hairloom package#

Module contents#

class hairloom.Breakpoint(chrom, pos, orientation)[source]#

Bases: object

Represents a genomic breakpoint with associated properties and methods.

chrom#

The chromosome name where the breakpoint is located.

Type:

str

pos#

The 1-based position of the breakpoint on the chromosome.

Type:

int

ori#

The orientation of the breakpoint (‘+’ or ‘-‘).

Type:

str

upstream#

Sequence upstream of the breakpoint, initialized to None.

Type:

str or None

downstream#

Sequence downstream of the breakpoint, initialized to None.

Type:

str or None

seq_rearranged#

Rearranged sequence at the breakpoint, initialized to None.

Type:

str or None

seq_removed#

Removed sequence at the breakpoint, initialized to None.

Type:

str or None

chroms#

List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).

Type:

list[str]

chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#
get_breakpoint_seqs(margin, genome)[source]#

Retrieves upstream and downstream sequences around the breakpoint.

Computes rearranged and removed sequences based on the given margin and genome dictionary.

Parameters:
  • margin (int) – Number of bases to include upstream and downstream of the breakpoint.

  • genome (dict) – A dictionary mapping chromosome names to their respective sequences.

Raises:

ValueError – If the chromosome is not found in the genome.

class hairloom.BreakpointChain(brks_iterable)[source]#

Bases: list

Represents a chain of genomic breakpoints.

This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.

tras#

List of transitions (pairs of breakpoints).

Type:

list[BreakpointPair]

segs#

List of segments (pairs of breakpoints).

Type:

list[BreakpointPair]

Parameters:

brks_iterable (iterable) – An iterable containing breakpoint objects.

Example

>>> brk1 = Breakpoint("chr1", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> chain = BreakpointChain([brk1, brk2])
>>> chain.tras
[BreakpointPair(brk1, brk2)]
class hairloom.BreakpointPair(brk1, brk2)[source]#

Bases: object

Represents a pair of genomic breakpoints.

brk1#

The first breakpoint in the pair.

Type:

Breakpoint

brk2#

The second breakpoint in the pair.

Type:

Breakpoint

aln_segment#

Indicates whether this pair is part of an alignment segment. Defaults to False.

Type:

bool

__repr__()[source]#

Returns a string representation of the breakpoint pair.

class hairloom.Segments(df)[source]#

Bases: object

Calculates and stores genomic segments (middle fragments).

list#

A list of segments, each represented as a tuple (chrom, start, end).

Type:

list[tuple]

Parameters:

df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1'],
...     'chrom': ['chr1', 'chr1', 'chr1'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]
get_list()[source]#

Computes segments (middle fragments) from the DataFrame.

Notes

  • Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).

  • If a group contains fewer than three fragments, no segments are added.

Modifies:

list (list[tuple]): Appends computed segments to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr1', 'chr1', 'chr2'],
...     'start': [100, 200, 300, 400],
...     'end': [150, 250, 350, 450],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]
class hairloom.Transitions(df)[source]#

Bases: object

Calculates transitions (tra) between genomic fragments.

list#

A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Type:

list[tuple]

Parameters:

df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1'],
...     'chrom': ['chr1', 'chr2'],
...     'start': [100, 200],
...     'end': [150, 250],
...     'strand': ['+', '-']
... })
>>> transitions = Transitions(df)
>>> transitions.list
[(('chr1', 150, '+'), ('chr2', 200, '-'))]
get_list()[source]#

Computes transitions (tra) from the DataFrame.

Notes

  • Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.

  • For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Modifies:

list (list[tuple]): Appends computed transitions to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr2', 'chr3'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
...     'strand': ['+', '-', '+']
... })
>>> transitions = Transitions(df)
>>> transitions.list # Note: 'read2' won't be included
[(('chr1', 150, '+'), ('chr2', 250, '-'))]
hairloom.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#

Extract alignment tables for split reads and concatenate them into a single DataFrame.

This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.

Parameters:
  • bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.

  • contig (str) – The contig (e.g., chromosome) to extract reads from.

  • start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.

  • end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.

  • max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.

Returns:

A DataFrame containing alignment data for all split reads in the region,

concatenated and organized.

Return type:

pd.DataFrame

Notes

  • The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.

  • The make_split_read_table function is used to create the DataFrame from the extracted alignments.

Example

>>> import pysam
>>> bam = pysam.AlignmentFile("example.bam", "rb")
>>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100)
>>> print(df.head())
hairloom.get_svtype(tra)[source]#

Get SV type string for a given BreakpointPair

Parameters:

tra (BreakpointPair) – Paired breakpoint object

Raises:

ValueError – If no SV type has been assigned

Returns:

SV type string

Return type:

str

hairloom.make_brk_table(bundle, chroms=None)[source]#

Generates a table of breakpoints with support information.

This function processes a bundle of BreakpointChain objects, extracts breakpoint coordinates, and organizes them into a structured pandas DataFrame. Optionally, the resulting table can be filtered by specific chromosomes.

Parameters:
  • bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing breakpoint information.

  • chroms (list[str], optional) – A list of chromosome names to include. If None, all chromosomes are considered. Defaults to None.

Returns:

A DataFrame containing the breakpoint table with the following

columns:

  • chrom (str): Chromosome name.

  • pos (int): Position of the breakpoint.

  • ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the breakpoint.

Return type:

pandas.DataFrame

Notes

  • Breakpoints are filtered based on the chroms parameter if provided.

  • Infinite values in the resulting DataFrame are replaced with NaN to maintain consistency.

Example

>>> from hairloom.datatypes import Breakpoint, BreakpointChain
>>> from hairloom.utils import make_brk_table
>>> bundle = [
...     BreakpointChain([
...         Breakpoint("chr1", 100, "+"),
...         Breakpoint("chr1", 200, "+"),
...         Breakpoint("chr2", 300, "+"),
...     ]),
... ]
>>> chroms = ["chr1", "chr2"]
>>> brk_df = make_brk_table(bundle, chroms)
>>> print(brk_df)
   chrom  pos ori  support
0  chr1  100   +       1
1  chr1  200   +       1
2  chr2  300   +       1
hairloom.make_bundle(reads)[source]#

Make a list of BreapointChain based on alignment table

Parameters:

reads (pandas.DataFrame) – Table of read alignment statistics

Returns:

List of BreakpointChain

Return type:

list[BreakpointChain]

hairloom.make_seg_table(bundle, chroms=None)[source]#

Creates a table of genomic segments with support information.

This function processes a bundle of BreakpointChain objects, extracts genomic segment coordinates, and combines them with support data into a structured pandas DataFrame.

Parameters:
  • bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing segment information.

  • chroms (list[str], optional) – A list of chromosomes to include. If None, all chromosomes are considered. Defaults to None.

Returns:

A DataFrame containing the breakpoint table with the following

columns:

  • chrom (str): Chromosome name.

  • pos (int): Position of the breakpoint.

  • ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the breakpoint.

Return type:

pandas.DataFrame

Notes

  • Segments are filtered based on the chroms parameter if provided.

  • Infinite values in the resulting DataFrame are replaced with NaN for consistency.

Example

>>> from your_module import BreakpointChain, make_seg_table
>>> bundle = [
...     Breakpoint('chr1', 100, '+'),
...     Breakpoint('chr2', 200, '+'),
...     Breakpoint('chr2', 300, '+'),
...     Breakpoint('chr1', 400, '+'),
... ]  # List of BreakpointChain objects
>>> seg_df = make_seg_table(bundle, chroms=["chr1", "chr2"])
>>> print(seg_df.head())
   chrom  pos1  pos2  support
0  chr2   200   300       1
hairloom.make_tra_table(bundle)[source]#

Creates a table of translocations with support information.

This function processes a bundle of BreakpointChain objects, extracts translocation coordinates, and combines them with support data into a structured pandas DataFrame.

Parameters:

bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing translocation information.

Returns:

A DataFrame containing the following columns:

  • chrom1 (str): Chromosome name of the first breakpoint.

  • pos1 (int): Position of the first breakpoint.

  • ori1 (str): Orientation of the first breakpoint (‘+’ or ‘-‘).

  • chrom2 (str): Chromosome name of the second breakpoint.

  • pos2 (int): Position of the second breakpoint.

  • ori2 (str): Orientation of the second breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the translocation.

Return type:

pandas.DataFrame

Notes

  • Translocations are identified by pairs of breakpoints (BreakpointPair objects) in the tras attribute of each BreakpointChain.

  • Duplicate translocations are removed by maintaining a set of seen coordinate pairs.

  • Infinite values in the resulting DataFrame are replaced with NaN for consistency.

Example

>>> from hairloom.datatypes import Breakpoint, BreakpointChain
>>> from hairloom.utils import make_tra_table
>>> chain1 = BreakpointChain([
...     Breakpoint('chr1', 100, '+'),
...     Breakpoint('chr2', 200, '+')
... ])
>>> chain2 = BreakpointChain([
...     Breakpoint('chr2', 300, '+'),
...     Breakpoint('chr1', 400, '+')
... ])
>>> bundle = [chain1, chain2]
>>> tra_df = make_tra_table(bundle)
>>> print(tra_df)
   chrom1  pos1 ori1 chrom2  pos2 ori2  support
0   chr1   100    +   chr2   200    +       1
1   chr2   300    +   chr1   400    +       1
hairloom.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#

Sort breakpoint1 and breakpoint2 of a SV table

Parameters:
  • sv (pandas.DataFrame) – Table of SVs

  • chrom1_col (str, optional) – Defaults to ‘chromosome_1’.

  • chrom2_col (str, optional) – Defaults to ‘chromosome_2’.

  • pos1_col (str, optional) – Defaults to ‘position_1’.

  • pos2_col (str, optional) – Defaults to ‘position_2’.

  • ori1_col (str, optional) – Defaults to ‘strand_1’.

  • ori2_col (str, optional) – Defaults to ‘strand_2’.

  • chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.

Returns:

Sorted (normalized) SV table

Return type:

pandas.DataFrame

Submodules#

hairloom.collect module#

hairloom.collect.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#

Extract alignment tables for split reads and concatenate them into a single DataFrame.

This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.

Parameters:
  • bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.

  • contig (str) – The contig (e.g., chromosome) to extract reads from.

  • start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.

  • end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.

  • max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.

Returns:

A DataFrame containing alignment data for all split reads in the region,

concatenated and organized.

Return type:

pd.DataFrame

Notes

  • The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.

  • The make_split_read_table function is used to create the DataFrame from the extracted alignments.

Example

>>> import pysam
>>> bam = pysam.AlignmentFile("example.bam", "rb")
>>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100)
>>> print(df.head())
hairloom.collect.extract_split_alignments(reads, max_reads=500)[source]#

Extract SplitAlignment objects from IteratorRow with a max_reads parameter

Parameters:
  • reads (pysam.IteratorRow) – Reads fetched from a pysam.Alignmentfile

  • max_reads (int, optional) – Number of reads to extract at maximum. Defaults to 500.

Returns:

list of SplitAlignment objects

Return type:

list

hairloom.collect.find_presence_of_matching_sv(sv1, sv2, margin=50)[source]#

Check overlap of sv2 for sv1 table

Parameters:
  • sv1 (pandas.DataFrame) – SV table to label matching SVs

  • sv2 (pandas.DataFrame) – SV table reference to check presence of overlap

  • margin (int, optional) – Margin (bp) of breakpoint coordinate difference. Defaults to 50.

Returns:

{True, False} list of matches. Length equal to sv1 row size.

Return type:

pd.Series

hairloom.collect.fix_lower_support_coordinates(bundle, coord_map)[source]#

Map breakpoint of lower support to close-by breakpoint with higher support

Parameters:
  • bundle (list) – List of BreakpointChain

  • coord_map (dict) – Map of str(Breakpoint) coordinates

Returns:

List of BreakpointChain, mapped to fixed coordinates

Return type:

list[BreakpointChain]

hairloom.collect.get_breakpoint_support_from_bundle(bundle)[source]#

Get breakpoint support count

Parameters:

bundle (list[BreakpointChain]) – List of BreakpointChain

Returns:

Support for str(Breakpoint) coordinates

Return type:

collections.Counter

hairloom.collect.get_normalized_sv(tra)[source]#

Sorts (normalizes) a BreakpointPair based on chromosome and position order.

This function ensures a consistent ordering of breakpoints in a BreakpointPair by sorting them based on chromosome precedence and genomic position.

Parameters:

tra (BreakpointPair) – A pair of breakpoints to normalize. Each breakpoint in the pair is expected to have the attributes chrom, pos, and ori.

Returns:

A flattened tuple of the normalized breakpoint coordinates in the format:

[chrom1, pos1, ori1, chrom2, pos2, ori2].

Return type:

tuple

Notes

  • Chromosomes are sorted based on their order in Breakpoint.chroms.

  • If the chromosomes are the same, the breakpoints are sorted by position.

  • The orientation (ori) remains associated with its respective breakpoint.

Example

>>> from your_module import Breakpoint, BreakpointPair, get_normalized_sv
>>> brk1 = Breakpoint("chr2", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> pair = BreakpointPair(brk1, brk2)
>>> get_normalized_sv(pair)
('chr1', 200, '-', 'chr2', 100, '+')
hairloom.collect.get_svtype(tra)[source]#

Get SV type string for a given BreakpointPair

Parameters:

tra (BreakpointPair) – Paired breakpoint object

Raises:

ValueError – If no SV type has been assigned

Returns:

SV type string

Return type:

str

hairloom.collect.make_bundle(reads)[source]#

Make a list of BreapointChain based on alignment table

Parameters:

reads (pandas.DataFrame) – Table of read alignment statistics

Returns:

List of BreakpointChain

Return type:

list[BreakpointChain]

hairloom.collect.make_tumor_sv_table(bundle, sv=None, margin=10, get_support=True)[source]#

Make SV table from list of BreakpointChain

Parameters:
  • bundle (list) – List of BreakpointChain

  • sv (pandas.DataFrame, optional) – Table of source SVs as reference for in_source flag. Defaults to None

  • margin (int, optional) – Margin (bp) for merging clustered breakpoints. Defaults to 10.

  • get_support (bool, optional) – Merge breakpoints with same coordinates and add count as support. Defaults to True.

Returns:

SV table from bundle [, with in_source labels] [, collapsed by coordinate with support counts]

Return type:

pandas.DataFrame

hairloom.collect.map_similar_coordinate_to_higher_rank(bundle, breakpoint_support, margin=10)[source]#

Make mapping of close-by coordinates, with breakpoints of higher support taking priority

Parameters:
  • bundle (list) – List of BreakpointChain

  • breakpoint_support (dict | collections.Counter) – Support for breakpoint coordinates

  • margin (int, optional) – Margin (bp) to merge close-by coordinates. Defaults to 10.

Returns:

tuple containing:

  • coord_map (dict): source -> destination coordinate

  • coord_map_log (tuple): (max_coord, src_count, max_count) [only for debugging]

Return type:

tuple

hairloom.collect.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#

Sort breakpoint1 and breakpoint2 of a SV table

Parameters:
  • sv (pandas.DataFrame) – Table of SVs

  • chrom1_col (str, optional) – Defaults to ‘chromosome_1’.

  • chrom2_col (str, optional) – Defaults to ‘chromosome_2’.

  • pos1_col (str, optional) – Defaults to ‘position_1’.

  • pos2_col (str, optional) – Defaults to ‘position_2’.

  • ori1_col (str, optional) – Defaults to ‘strand_1’.

  • ori2_col (str, optional) – Defaults to ‘strand_2’.

  • chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.

Returns:

Sorted (normalized) SV table

Return type:

pandas.DataFrame

hairloom.collect.pull_breakpoints_from_reads_in_sv_regions(bam, tra, get_read_table=False, min_n_breakpoint=2, margin=10)[source]#

Extract and append BreakpointChain objects from a bam file and a table of SVs

Parameters:
  • bam (pysam.AlignmentFile) – BAM file

  • tra (pandas.DataFrame) – Table of SVs

  • get_read_table (bool, optional) – Return table of read alignment stats. Defaults to False.

  • min_n_breakpoint (int, optional) – Minimum number of breakpoints required to be saved. Useful in selecting complex rearrangements if the number is high. Defaults to 3.

  • margin (int, optional) – Margin (bp) from breakpoints to fetch reads. Defaults to 10.

Returns:

A list of BreakpointChain objects

Return type:

list[BreakpointChain]

hairloom.collect.pull_sv_supporting_reads_from_bundle(sv, bundle)[source]#

Filter bundle to include BreakpointChain objects that have breakpoints matching that of the input sv table

Parameters:
  • sv (pandas.DataFrame) – SV table

  • bundle (list) – list of BreapointChain

Returns:

Filtered list of BreakpointChain

Return type:

list

hairloom.datatypes module#

class hairloom.datatypes.Breakpoint(chrom, pos, orientation)[source]#

Bases: object

Represents a genomic breakpoint with associated properties and methods.

chrom#

The chromosome name where the breakpoint is located.

Type:

str

pos#

The 1-based position of the breakpoint on the chromosome.

Type:

int

ori#

The orientation of the breakpoint (‘+’ or ‘-‘).

Type:

str

upstream#

Sequence upstream of the breakpoint, initialized to None.

Type:

str or None

downstream#

Sequence downstream of the breakpoint, initialized to None.

Type:

str or None

seq_rearranged#

Rearranged sequence at the breakpoint, initialized to None.

Type:

str or None

seq_removed#

Removed sequence at the breakpoint, initialized to None.

Type:

str or None

chroms#

List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).

Type:

list[str]

chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#
get_breakpoint_seqs(margin, genome)[source]#

Retrieves upstream and downstream sequences around the breakpoint.

Computes rearranged and removed sequences based on the given margin and genome dictionary.

Parameters:
  • margin (int) – Number of bases to include upstream and downstream of the breakpoint.

  • genome (dict) – A dictionary mapping chromosome names to their respective sequences.

Raises:

ValueError – If the chromosome is not found in the genome.

class hairloom.datatypes.BreakpointChain(brks_iterable)[source]#

Bases: list

Represents a chain of genomic breakpoints.

This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.

tras#

List of transitions (pairs of breakpoints).

Type:

list[BreakpointPair]

segs#

List of segments (pairs of breakpoints).

Type:

list[BreakpointPair]

Parameters:

brks_iterable (iterable) – An iterable containing breakpoint objects.

Example

>>> brk1 = Breakpoint("chr1", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> chain = BreakpointChain([brk1, brk2])
>>> chain.tras
[BreakpointPair(brk1, brk2)]
class hairloom.datatypes.BreakpointPair(brk1, brk2)[source]#

Bases: object

Represents a pair of genomic breakpoints.

brk1#

The first breakpoint in the pair.

Type:

Breakpoint

brk2#

The second breakpoint in the pair.

Type:

Breakpoint

aln_segment#

Indicates whether this pair is part of an alignment segment. Defaults to False.

Type:

bool

__repr__()[source]#

Returns a string representation of the breakpoint pair.

class hairloom.datatypes.Segments(df)[source]#

Bases: object

Calculates and stores genomic segments (middle fragments).

list#

A list of segments, each represented as a tuple (chrom, start, end).

Type:

list[tuple]

Parameters:

df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1'],
...     'chrom': ['chr1', 'chr1', 'chr1'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]
get_list()[source]#

Computes segments (middle fragments) from the DataFrame.

Notes

  • Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).

  • If a group contains fewer than three fragments, no segments are added.

Modifies:

list (list[tuple]): Appends computed segments to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr1', 'chr1', 'chr2'],
...     'start': [100, 200, 300, 400],
...     'end': [150, 250, 350, 450],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]
class hairloom.datatypes.SplitAlignment(cigarstring, read_name, refname, read_pos, strand)[source]#

Bases: object

Represents a split alignment from a sequencing read.

Parses the CIGAR string and extracts alignment information such as clip lengths, matched bases, and strand-corrected values.

read_name#

The name of the sequencing read.

Type:

str

refname#

The reference sequence name (e.g., chromosome or contig).

Type:

str

cigarstring#

The CIGAR string of the alignment.

Type:

str

start#

The start position of the alignment on the reference.

Type:

int

strand#

The strand information (‘+’ or ‘-‘).

Type:

str

cigar_tuples#

Parsed CIGAR string as a list of (operation, length) tuples.

Type:

list[tuple]

primary#

Placeholder for primary alignment information, initialized to None.

Type:

NoneType

match#

Total number of matched bases in the alignment.

Type:

int

aln_cols#

Column headers for alignment fields.

Type:

list[str]

clip1#

Length of the first clip (soft/hard) before the matched region.

Type:

int

clip2#

Length of the second clip (soft/hard) after the matched region.

Type:

int

end#

The end position of the alignment on the reference.

Type:

int

pclip1#

Strand-corrected length of the first clip.

Type:

int

extract_cigar_field()[source]#

Parses the CIGAR string to calculate clip lengths, matched bases, and alignment end position.

static get_cigar_tuples(cigarstring)[source]#

Parses a CIGAR string and converts it into a list of operation-length tuples.

Parameters:

cigarstring (str) – The CIGAR string to parse, following the standard format used in sequence alignments (e.g., “10M5I20M”).

Returns:

Parsed CIGAR operations and lengths.

Return type:

list[tuple[int, int]]

Raises:

ValueError – If the CIGAR string contains invalid operations.

Example

>>> cigarstring = "10M5I20M"
>>> SplitAlignment.get_cigar_tuples(cigarstring)
[(0, 10), (1, 5), (0, 20)]

Notes

Supported CIGAR operations:
  • ‘M’ (0): Alignment match (can be a sequence match or mismatch).

  • ‘I’ (1): Insertion to the reference.

  • ‘D’ (2): Deletion from the reference.

  • ‘N’ (3): Skipped region from the reference.

  • ‘S’ (4): Soft clipping (clipped sequences present in the read).

  • ‘H’ (5): Hard clipping (clipped sequences not present in the read).

  • ‘P’ (6): Padding (silent deletion from padded reference).

  • ‘=’ (7): Sequence match.

  • ‘X’ (8): Sequence mismatch.

class hairloom.datatypes.Transitions(df)[source]#

Bases: object

Calculates transitions (tra) between genomic fragments.

list#

A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Type:

list[tuple]

Parameters:

df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1'],
...     'chrom': ['chr1', 'chr2'],
...     'start': [100, 200],
...     'end': [150, 250],
...     'strand': ['+', '-']
... })
>>> transitions = Transitions(df)
>>> transitions.list
[(('chr1', 150, '+'), ('chr2', 200, '-'))]
get_list()[source]#

Computes transitions (tra) from the DataFrame.

Notes

  • Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.

  • For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Modifies:

list (list[tuple]): Appends computed transitions to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr2', 'chr3'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
...     'strand': ['+', '-', '+']
... })
>>> transitions = Transitions(df)
>>> transitions.list # Note: 'read2' won't be included
[(('chr1', 150, '+'), ('chr2', 250, '-'))]
hairloom.datatypes.get_breakpoint_seqs(chrom, pos, margin, genome)[source]#

Extracts upstream and downstream sequences around a breakpoint.

Parameters:
  • chrom (str) – Chromosome name.

  • pos (int) – 1-based position of the breakpoint.

  • margin (int) – Number of bases upstream and downstream to extract.

  • genome (dict) – Dictionary mapping chromosome names to sequences.

Returns:

A tuple containing:
  • upstream: Sequence upstream of the breakpoint.

  • downstream: Sequence downstream of the breakpoint.

Return type:

tuple[str, str]

Example

>>> genome = {'chr1': "A" * 1000, 'chr2': "T" * 1000}
>>> get_breakpoint_seqs('chr1', 5, 3, genome)
('AA', 'AAAA')

hairloom.utils module#

hairloom.utils.enumerate_breakpoints(df)[source]#

Enumerates breakpoints from a DataFrame of genomic fragments.

This function generates a BreakpointChain object containing Breakpoint objects based on the start and end positions of fragments in the input DataFrame. Breakpoints at the beginning and end of reads are treated differently to omit read boundaries.

Parameters:

df (pandas.DataFrame) –

A DataFrame containing fragment information, with the following columns:

  • chrom: The chromosome name (str).

  • start: The start position of the fragment (int).

  • end: The end position of the fragment (int).

  • strand: The strand information (‘+’ or ‘-‘).

Returns:

A chain of Breakpoint objects enumerated from the DataFrame.

Return type:

BreakpointChain

Notes

  • For the first fragment in the DataFrame, only the end position is included as a breakpoint.

  • For the last fragment in the DataFrame, only the start position is included as a breakpoint.

  • For middle fragments, both start and end positions are included, with orientations determined by the strand.

Example

>>> import pandas as pd
>>> from your_module import Breakpoint, BreakpointChain, enumerate_breakpoints
>>> df = pd.DataFrame({
...     'chrom': ['chr1', 'chr1', 'chr1'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
...     'strand': ['+', '+', '-']
... })
>>> brks = enumerate_breakpoints(df)
>>> print(brks)
[chr1:150:+, chr1:200:-, chr1:250:+, chr1:300:-]
hairloom.utils.get_secondaries(read)[source]#

Extracts secondary alignments from a sequencing read’s ‘SA’ tag.

This function retrieves secondary alignment information stored in the ‘SA’ tag of a sequencing read. It parses the tag and returns a list of secondary alignments.

Parameters:

read (pysam.AlignedSegment) – A sequencing read object, typically from a BAM or SAM file, that includes tags.

Returns:

A list of secondary alignment strings parsed from the ‘SA’ tag. If the ‘SA’ tag is not present, an empty list is returned.

Return type:

list of str

Example

>>> from pysam import AlignedSegment
>>> read = AlignedSegment()
>>> read.tags = [('SA', 'chr1,100,+,60M,60;chr2,200,-,60M,60;')]
>>> secondaries = get_secondaries(read)
>>> print(secondaries)
['chr1,100,+,60M,60', 'chr2,200,-,60M,60']

Notes

  • The ‘SA’ tag stores secondary alignments in a semicolon-separated format.

  • Each secondary alignment string typically contains the following fields:

    [chromosome, position, strand, CIGAR, mapping quality].

hairloom.utils.is_breakpoints_not_sorted(chrom1, pos1, chrom2, pos2, chrom_order)[source]#
hairloom.utils.make_brk_table(bundle, chroms=None)[source]#

Generates a table of breakpoints with support information.

This function processes a bundle of BreakpointChain objects, extracts breakpoint coordinates, and organizes them into a structured pandas DataFrame. Optionally, the resulting table can be filtered by specific chromosomes.

Parameters:
  • bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing breakpoint information.

  • chroms (list[str], optional) – A list of chromosome names to include. If None, all chromosomes are considered. Defaults to None.

Returns:

A DataFrame containing the breakpoint table with the following

columns:

  • chrom (str): Chromosome name.

  • pos (int): Position of the breakpoint.

  • ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the breakpoint.

Return type:

pandas.DataFrame

Notes

  • Breakpoints are filtered based on the chroms parameter if provided.

  • Infinite values in the resulting DataFrame are replaced with NaN to maintain consistency.

Example

>>> from hairloom.datatypes import Breakpoint, BreakpointChain
>>> from hairloom.utils import make_brk_table
>>> bundle = [
...     BreakpointChain([
...         Breakpoint("chr1", 100, "+"),
...         Breakpoint("chr1", 200, "+"),
...         Breakpoint("chr2", 300, "+"),
...     ]),
... ]
>>> chroms = ["chr1", "chr2"]
>>> brk_df = make_brk_table(bundle, chroms)
>>> print(brk_df)
   chrom  pos ori  support
0  chr1  100   +       1
1  chr1  200   +       1
2  chr2  300   +       1
hairloom.utils.make_seg_table(bundle, chroms=None)[source]#

Creates a table of genomic segments with support information.

This function processes a bundle of BreakpointChain objects, extracts genomic segment coordinates, and combines them with support data into a structured pandas DataFrame.

Parameters:
  • bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing segment information.

  • chroms (list[str], optional) – A list of chromosomes to include. If None, all chromosomes are considered. Defaults to None.

Returns:

A DataFrame containing the breakpoint table with the following

columns:

  • chrom (str): Chromosome name.

  • pos (int): Position of the breakpoint.

  • ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the breakpoint.

Return type:

pandas.DataFrame

Notes

  • Segments are filtered based on the chroms parameter if provided.

  • Infinite values in the resulting DataFrame are replaced with NaN for consistency.

Example

>>> from your_module import BreakpointChain, make_seg_table
>>> bundle = [
...     Breakpoint('chr1', 100, '+'),
...     Breakpoint('chr2', 200, '+'),
...     Breakpoint('chr2', 300, '+'),
...     Breakpoint('chr1', 400, '+'),
... ]  # List of BreakpointChain objects
>>> seg_df = make_seg_table(bundle, chroms=["chr1", "chr2"])
>>> print(seg_df.head())
   chrom  pos1  pos2  support
0  chr2   200   300       1
hairloom.utils.make_split_read_table(alignments)[source]#

Creates a table summarizing split-read alignments.

This function processes a list of alignment objects, extracting key attributes and organizing them into a structured pandas DataFrame. The resulting table is deduplicated and sorted for consistent organization.

Parameters:
  • alignments (list) – A list of alignment objects. Each alignment object is expected to have the following attributes:

  • read_name (-) – Query name of the read.

  • refname (-) – Reference sequence name (chromosome or contig).

  • start (-) – Start position of the alignment on the reference.

  • end (-) – End position of the alignment on the reference.

  • strand (-) – Strand information (‘+’ or ‘-‘).

  • clip1 (-) – Length of the first soft/hard clip before the matched region.

  • clip2 (-) – Length of the second soft/hard clip after the matched region.

  • match (-) – Total number of matched bases in the alignment.

  • pclip1 (-) – Strand-corrected length of the first clip.

Returns:

A DataFrame containing the following columns:

  • qname (str): Query name of the read.

  • chrom (str): Reference sequence name (chromosome or contig).

  • start (int): Start position of the alignment on the reference.

  • end (int): End position of the alignment on the reference.

  • strand (str): Strand information (‘+’ or ‘-‘).

  • clip1 (int): Length of the first soft/hard clip before the matched region.

  • match (int): Total number of matched bases in the alignment.

  • clip2 (int): Length of the second soft/hard clip after the matched region.

  • pclip1 (int): Strand-corrected length of the first clip.

Return type:

pandas.DataFrame

Notes

  • Duplicate rows are removed based on the full set of columns.

  • The DataFrame is sorted by qname and pclip1 for consistent organization.

Example

>>> from hairloom.utils import make_split_read_table
>>> class Alignment:
...     def __init__(self, read_name, refname, start, end, strand, clip1, match, clip2, pclip1):
...         self.read_name = read_name
...         self.refname = refname
...         self.start = start
...         self.end = end
...         self.strand = strand
...         self.clip1 = clip1
...         self.match = match
...         self.clip2 = clip2
...         self.pclip1 = pclip1
>>> alignments = [
...     Alignment(read_name="read1", refname="chr1", start=100, end=150,
...               strand='+', clip1=10, match=40, clip2=5, pclip1=10),
...     Alignment(read_name="read2", refname="chr2", start=200, end=250,
...               strand='-', clip1=15, match=35, clip2=10, pclip1=15)
... ]
>>> df = make_split_read_table(alignments)
>>> print(df)
   qname chrom  start  end strand  clip1  match  clip2  pclip1
0  read1  chr1    100  150      +     10     40      5      10
1  read2  chr2    200  250      -     15     35     10      15
hairloom.utils.make_tra_table(bundle)[source]#

Creates a table of translocations with support information.

This function processes a bundle of BreakpointChain objects, extracts translocation coordinates, and combines them with support data into a structured pandas DataFrame.

Parameters:

bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing translocation information.

Returns:

A DataFrame containing the following columns:

  • chrom1 (str): Chromosome name of the first breakpoint.

  • pos1 (int): Position of the first breakpoint.

  • ori1 (str): Orientation of the first breakpoint (‘+’ or ‘-‘).

  • chrom2 (str): Chromosome name of the second breakpoint.

  • pos2 (int): Position of the second breakpoint.

  • ori2 (str): Orientation of the second breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the translocation.

Return type:

pandas.DataFrame

Notes

  • Translocations are identified by pairs of breakpoints (BreakpointPair objects) in the tras attribute of each BreakpointChain.

  • Duplicate translocations are removed by maintaining a set of seen coordinate pairs.

  • Infinite values in the resulting DataFrame are replaced with NaN for consistency.

Example

>>> from hairloom.datatypes import Breakpoint, BreakpointChain
>>> from hairloom.utils import make_tra_table
>>> chain1 = BreakpointChain([
...     Breakpoint('chr1', 100, '+'),
...     Breakpoint('chr2', 200, '+')
... ])
>>> chain2 = BreakpointChain([
...     Breakpoint('chr2', 300, '+'),
...     Breakpoint('chr1', 400, '+')
... ])
>>> bundle = [chain1, chain2]
>>> tra_df = make_tra_table(bundle)
>>> print(tra_df)
   chrom1  pos1 ori1 chrom2  pos2 ori2  support
0   chr1   100    +   chr2   200    +       1
1   chr2   300    +   chr1   400    +       1
hairloom.utils.reverse_complement(seq)[source]#