hairloom.utils#

Functions

enumerate_breakpoints(df)

Enumerates breakpoints from a DataFrame of genomic fragments.

get_secondaries(read)

Extracts secondary alignments from a sequencing read's 'SA' tag.

is_breakpoints_not_sorted(chrom1, pos1, ...)

make_brk_table(bundle[, chroms])

Generates a table of breakpoints with support information.

make_seg_table(bundle[, chroms])

Creates a table of genomic segments with support information.

make_split_read_table(alignments)

Creates a table summarizing split-read alignments.

make_tra_table(bundle)

Creates a table of translocations with support information.

reverse_complement(seq)

Classes

Breakpoint(chrom, pos, orientation)

Represents a genomic breakpoint with associated properties and methods.

BreakpointChain(brks_iterable)

Represents a chain of genomic breakpoints.

hairloom.utils.enumerate_breakpoints(df)[source]#

Enumerates breakpoints from a DataFrame of genomic fragments.

This function generates a BreakpointChain object containing Breakpoint objects based on the start and end positions of fragments in the input DataFrame. Breakpoints at the beginning and end of reads are treated differently to omit read boundaries.

Parameters:

df (pandas.DataFrame) –

A DataFrame containing fragment information, with the following columns:

  • chrom: The chromosome name (str).

  • start: The start position of the fragment (int).

  • end: The end position of the fragment (int).

  • strand: The strand information (‘+’ or ‘-‘).

Returns:

A chain of Breakpoint objects enumerated from the DataFrame.

Return type:

BreakpointChain

Notes

  • For the first fragment in the DataFrame, only the end position is included as a breakpoint.

  • For the last fragment in the DataFrame, only the start position is included as a breakpoint.

  • For middle fragments, both start and end positions are included, with orientations determined by the strand.

Example

>>> import pandas as pd
>>> from your_module import Breakpoint, BreakpointChain, enumerate_breakpoints
>>> df = pd.DataFrame({
...     'chrom': ['chr1', 'chr1', 'chr1'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
...     'strand': ['+', '+', '-']
... })
>>> brks = enumerate_breakpoints(df)
>>> print(brks)
[chr1:150:+, chr1:200:-, chr1:250:+, chr1:300:-]
hairloom.utils.get_secondaries(read)[source]#

Extracts secondary alignments from a sequencing read’s ‘SA’ tag.

This function retrieves secondary alignment information stored in the ‘SA’ tag of a sequencing read. It parses the tag and returns a list of secondary alignments.

Parameters:

read (pysam.AlignedSegment) – A sequencing read object, typically from a BAM or SAM file, that includes tags.

Returns:

A list of secondary alignment strings parsed from the ‘SA’ tag. If the ‘SA’ tag is not present, an empty list is returned.

Return type:

list of str

Example

>>> from pysam import AlignedSegment
>>> read = AlignedSegment()
>>> read.tags = [('SA', 'chr1,100,+,60M,60;chr2,200,-,60M,60;')]
>>> secondaries = get_secondaries(read)
>>> print(secondaries)
['chr1,100,+,60M,60', 'chr2,200,-,60M,60']

Notes

  • The ‘SA’ tag stores secondary alignments in a semicolon-separated format.

  • Each secondary alignment string typically contains the following fields:

    [chromosome, position, strand, CIGAR, mapping quality].

hairloom.utils.is_breakpoints_not_sorted(chrom1, pos1, chrom2, pos2, chrom_order)[source]#
hairloom.utils.make_brk_table(bundle, chroms=None)[source]#

Generates a table of breakpoints with support information.

This function processes a bundle of BreakpointChain objects, extracts breakpoint coordinates, and organizes them into a structured pandas DataFrame. Optionally, the resulting table can be filtered by specific chromosomes.

Parameters:
  • bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing breakpoint information.

  • chroms (list[str], optional) – A list of chromosome names to include. If None, all chromosomes are considered. Defaults to None.

Returns:

A DataFrame containing the breakpoint table with the following

columns:

  • chrom (str): Chromosome name.

  • pos (int): Position of the breakpoint.

  • ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the breakpoint.

Return type:

pandas.DataFrame

Notes

  • Breakpoints are filtered based on the chroms parameter if provided.

  • Infinite values in the resulting DataFrame are replaced with NaN to maintain consistency.

Example

>>> from hairloom.datatypes import Breakpoint, BreakpointChain
>>> from hairloom.utils import make_brk_table
>>> bundle = [
...     BreakpointChain([
...         Breakpoint("chr1", 100, "+"),
...         Breakpoint("chr1", 200, "+"),
...         Breakpoint("chr2", 300, "+"),
...     ]),
... ]
>>> chroms = ["chr1", "chr2"]
>>> brk_df = make_brk_table(bundle, chroms)
>>> print(brk_df)
   chrom  pos ori  support
0  chr1  100   +       1
1  chr1  200   +       1
2  chr2  300   +       1
hairloom.utils.make_seg_table(bundle, chroms=None)[source]#

Creates a table of genomic segments with support information.

This function processes a bundle of BreakpointChain objects, extracts genomic segment coordinates, and combines them with support data into a structured pandas DataFrame.

Parameters:
  • bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing segment information.

  • chroms (list[str], optional) – A list of chromosomes to include. If None, all chromosomes are considered. Defaults to None.

Returns:

A DataFrame containing the breakpoint table with the following

columns:

  • chrom (str): Chromosome name.

  • pos (int): Position of the breakpoint.

  • ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the breakpoint.

Return type:

pandas.DataFrame

Notes

  • Segments are filtered based on the chroms parameter if provided.

  • Infinite values in the resulting DataFrame are replaced with NaN for consistency.

Example

>>> from your_module import BreakpointChain, make_seg_table
>>> bundle = [
...     Breakpoint('chr1', 100, '+'),
...     Breakpoint('chr2', 200, '+'),
...     Breakpoint('chr2', 300, '+'),
...     Breakpoint('chr1', 400, '+'),
... ]  # List of BreakpointChain objects
>>> seg_df = make_seg_table(bundle, chroms=["chr1", "chr2"])
>>> print(seg_df.head())
   chrom  pos1  pos2  support
0  chr2   200   300       1
hairloom.utils.make_split_read_table(alignments)[source]#

Creates a table summarizing split-read alignments.

This function processes a list of alignment objects, extracting key attributes and organizing them into a structured pandas DataFrame. The resulting table is deduplicated and sorted for consistent organization.

Parameters:
  • alignments (list) – A list of alignment objects. Each alignment object is expected to have the following attributes:

  • read_name (-) – Query name of the read.

  • refname (-) – Reference sequence name (chromosome or contig).

  • start (-) – Start position of the alignment on the reference.

  • end (-) – End position of the alignment on the reference.

  • strand (-) – Strand information (‘+’ or ‘-‘).

  • clip1 (-) – Length of the first soft/hard clip before the matched region.

  • clip2 (-) – Length of the second soft/hard clip after the matched region.

  • match (-) – Total number of matched bases in the alignment.

  • pclip1 (-) – Strand-corrected length of the first clip.

Returns:

A DataFrame containing the following columns:

  • qname (str): Query name of the read.

  • chrom (str): Reference sequence name (chromosome or contig).

  • start (int): Start position of the alignment on the reference.

  • end (int): End position of the alignment on the reference.

  • strand (str): Strand information (‘+’ or ‘-‘).

  • clip1 (int): Length of the first soft/hard clip before the matched region.

  • match (int): Total number of matched bases in the alignment.

  • clip2 (int): Length of the second soft/hard clip after the matched region.

  • pclip1 (int): Strand-corrected length of the first clip.

Return type:

pandas.DataFrame

Notes

  • Duplicate rows are removed based on the full set of columns.

  • The DataFrame is sorted by qname and pclip1 for consistent organization.

Example

>>> from hairloom.utils import make_split_read_table
>>> class Alignment:
...     def __init__(self, read_name, refname, start, end, strand, clip1, match, clip2, pclip1):
...         self.read_name = read_name
...         self.refname = refname
...         self.start = start
...         self.end = end
...         self.strand = strand
...         self.clip1 = clip1
...         self.match = match
...         self.clip2 = clip2
...         self.pclip1 = pclip1
>>> alignments = [
...     Alignment(read_name="read1", refname="chr1", start=100, end=150,
...               strand='+', clip1=10, match=40, clip2=5, pclip1=10),
...     Alignment(read_name="read2", refname="chr2", start=200, end=250,
...               strand='-', clip1=15, match=35, clip2=10, pclip1=15)
... ]
>>> df = make_split_read_table(alignments)
>>> print(df)
   qname chrom  start  end strand  clip1  match  clip2  pclip1
0  read1  chr1    100  150      +     10     40      5      10
1  read2  chr2    200  250      -     15     35     10      15
hairloom.utils.make_tra_table(bundle)[source]#

Creates a table of translocations with support information.

This function processes a bundle of BreakpointChain objects, extracts translocation coordinates, and combines them with support data into a structured pandas DataFrame.

Parameters:

bundle (list[BreakpointChain]) – A list of BreakpointChain objects containing translocation information.

Returns:

A DataFrame containing the following columns:

  • chrom1 (str): Chromosome name of the first breakpoint.

  • pos1 (int): Position of the first breakpoint.

  • ori1 (str): Orientation of the first breakpoint (‘+’ or ‘-‘).

  • chrom2 (str): Chromosome name of the second breakpoint.

  • pos2 (int): Position of the second breakpoint.

  • ori2 (str): Orientation of the second breakpoint (‘+’ or ‘-‘).

  • support (int): Support count for the translocation.

Return type:

pandas.DataFrame

Notes

  • Translocations are identified by pairs of breakpoints (BreakpointPair objects) in the tras attribute of each BreakpointChain.

  • Duplicate translocations are removed by maintaining a set of seen coordinate pairs.

  • Infinite values in the resulting DataFrame are replaced with NaN for consistency.

Example

>>> from hairloom.datatypes import Breakpoint, BreakpointChain
>>> from hairloom.utils import make_tra_table
>>> chain1 = BreakpointChain([
...     Breakpoint('chr1', 100, '+'),
...     Breakpoint('chr2', 200, '+')
... ])
>>> chain2 = BreakpointChain([
...     Breakpoint('chr2', 300, '+'),
...     Breakpoint('chr1', 400, '+')
... ])
>>> bundle = [chain1, chain2]
>>> tra_df = make_tra_table(bundle)
>>> print(tra_df)
   chrom1  pos1 ori1 chrom2  pos2 ori2  support
0   chr1   100    +   chr2   200    +       1
1   chr2   300    +   chr1   400    +       1
hairloom.utils.reverse_complement(seq)[source]#