hairloom.collect#

Functions

enumerate_breakpoints(df)

Enumerates breakpoints from a DataFrame of genomic fragments.

extract_read_data(bam, contig[, start, end, ...])

Extract alignment tables for split reads and concatenate them into a single DataFrame.

extract_split_alignments(reads[, max_reads])

Extract SplitAlignment objects from IteratorRow with a max_reads parameter

find_presence_of_matching_sv(sv1, sv2[, margin])

Check overlap of sv2 for sv1 table

fix_lower_support_coordinates(bundle, coord_map)

Map breakpoint of lower support to close-by breakpoint with higher support

get_breakpoint_support_from_bundle(bundle)

Get breakpoint support count

get_normalized_sv(tra)

Sorts (normalizes) a BreakpointPair based on chromosome and position order.

get_secondaries(read)

Extracts secondary alignments from a sequencing read's 'SA' tag.

get_svtype(tra)

Get SV type string for a given BreakpointPair

make_bundle(reads)

Make a list of BreapointChain based on alignment table

make_split_read_table(alignments)

Creates a table summarizing split-read alignments.

make_tumor_sv_table(bundle[, sv, margin, ...])

Make SV table from list of BreakpointChain

map_similar_coordinate_to_higher_rank(...[, ...])

Make mapping of close-by coordinates, with breakpoints of higher support taking priority

normalize_sv_table(sv[, chrom1_col, ...])

Sort breakpoint1 and breakpoint2 of a SV table

pull_breakpoints_from_reads_in_sv_regions(...)

Extract and append BreakpointChain objects from a bam file and a table of SVs

pull_sv_supporting_reads_from_bundle(sv, bundle)

Filter bundle to include BreakpointChain objects that have breakpoints matching that of the input sv table

Classes

Breakpoint(chrom, pos, orientation)

Represents a genomic breakpoint with associated properties and methods.

BreakpointPair(brk1, brk2)

Represents a pair of genomic breakpoints.

Counter([iterable])

Dict subclass for counting hashable items.

SplitAlignment(cigarstring, read_name, ...)

Represents a split alignment from a sequencing read.

hairloom.collect.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#

Extract alignment tables for split reads and concatenate them into a single DataFrame.

This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.

Parameters:
  • bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.

  • contig (str) – The contig (e.g., chromosome) to extract reads from.

  • start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.

  • end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.

  • max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.

Returns:

A DataFrame containing alignment data for all split reads in the region,

concatenated and organized.

Return type:

pd.DataFrame

Notes

  • The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.

  • The make_split_read_table function is used to create the DataFrame from the extracted alignments.

Example

>>> import pysam
>>> bam = pysam.AlignmentFile("example.bam", "rb")
>>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100)
>>> print(df.head())
hairloom.collect.extract_split_alignments(reads, max_reads=500)[source]#

Extract SplitAlignment objects from IteratorRow with a max_reads parameter

Parameters:
  • reads (pysam.IteratorRow) – Reads fetched from a pysam.Alignmentfile

  • max_reads (int, optional) – Number of reads to extract at maximum. Defaults to 500.

Returns:

list of SplitAlignment objects

Return type:

list

hairloom.collect.find_presence_of_matching_sv(sv1, sv2, margin=50)[source]#

Check overlap of sv2 for sv1 table

Parameters:
  • sv1 (pandas.DataFrame) – SV table to label matching SVs

  • sv2 (pandas.DataFrame) – SV table reference to check presence of overlap

  • margin (int, optional) – Margin (bp) of breakpoint coordinate difference. Defaults to 50.

Returns:

{True, False} list of matches. Length equal to sv1 row size.

Return type:

pd.Series

hairloom.collect.fix_lower_support_coordinates(bundle, coord_map)[source]#

Map breakpoint of lower support to close-by breakpoint with higher support

Parameters:
  • bundle (list) – List of BreakpointChain

  • coord_map (dict) – Map of str(Breakpoint) coordinates

Returns:

List of BreakpointChain, mapped to fixed coordinates

Return type:

list[BreakpointChain]

hairloom.collect.get_breakpoint_support_from_bundle(bundle)[source]#

Get breakpoint support count

Parameters:

bundle (list[BreakpointChain]) – List of BreakpointChain

Returns:

Support for str(Breakpoint) coordinates

Return type:

collections.Counter

hairloom.collect.get_normalized_sv(tra)[source]#

Sorts (normalizes) a BreakpointPair based on chromosome and position order.

This function ensures a consistent ordering of breakpoints in a BreakpointPair by sorting them based on chromosome precedence and genomic position.

Parameters:

tra (BreakpointPair) – A pair of breakpoints to normalize. Each breakpoint in the pair is expected to have the attributes chrom, pos, and ori.

Returns:

A flattened tuple of the normalized breakpoint coordinates in the format:

[chrom1, pos1, ori1, chrom2, pos2, ori2].

Return type:

tuple

Notes

  • Chromosomes are sorted based on their order in Breakpoint.chroms.

  • If the chromosomes are the same, the breakpoints are sorted by position.

  • The orientation (ori) remains associated with its respective breakpoint.

Example

>>> from your_module import Breakpoint, BreakpointPair, get_normalized_sv
>>> brk1 = Breakpoint("chr2", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> pair = BreakpointPair(brk1, brk2)
>>> get_normalized_sv(pair)
('chr1', 200, '-', 'chr2', 100, '+')
hairloom.collect.get_svtype(tra)[source]#

Get SV type string for a given BreakpointPair

Parameters:

tra (BreakpointPair) – Paired breakpoint object

Raises:

ValueError – If no SV type has been assigned

Returns:

SV type string

Return type:

str

hairloom.collect.make_bundle(reads)[source]#

Make a list of BreapointChain based on alignment table

Parameters:

reads (pandas.DataFrame) – Table of read alignment statistics

Returns:

List of BreakpointChain

Return type:

list[BreakpointChain]

hairloom.collect.make_tumor_sv_table(bundle, sv=None, margin=10, get_support=True)[source]#

Make SV table from list of BreakpointChain

Parameters:
  • bundle (list) – List of BreakpointChain

  • sv (pandas.DataFrame, optional) – Table of source SVs as reference for in_source flag. Defaults to None

  • margin (int, optional) – Margin (bp) for merging clustered breakpoints. Defaults to 10.

  • get_support (bool, optional) – Merge breakpoints with same coordinates and add count as support. Defaults to True.

Returns:

SV table from bundle [, with in_source labels] [, collapsed by coordinate with support counts]

Return type:

pandas.DataFrame

hairloom.collect.map_similar_coordinate_to_higher_rank(bundle, breakpoint_support, margin=10)[source]#

Make mapping of close-by coordinates, with breakpoints of higher support taking priority

Parameters:
  • bundle (list) – List of BreakpointChain

  • breakpoint_support (dict | collections.Counter) – Support for breakpoint coordinates

  • margin (int, optional) – Margin (bp) to merge close-by coordinates. Defaults to 10.

Returns:

tuple containing:

  • coord_map (dict): source -> destination coordinate

  • coord_map_log (tuple): (max_coord, src_count, max_count) [only for debugging]

Return type:

tuple

hairloom.collect.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#

Sort breakpoint1 and breakpoint2 of a SV table

Parameters:
  • sv (pandas.DataFrame) – Table of SVs

  • chrom1_col (str, optional) – Defaults to ‘chromosome_1’.

  • chrom2_col (str, optional) – Defaults to ‘chromosome_2’.

  • pos1_col (str, optional) – Defaults to ‘position_1’.

  • pos2_col (str, optional) – Defaults to ‘position_2’.

  • ori1_col (str, optional) – Defaults to ‘strand_1’.

  • ori2_col (str, optional) – Defaults to ‘strand_2’.

  • chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.

Returns:

Sorted (normalized) SV table

Return type:

pandas.DataFrame

hairloom.collect.pull_breakpoints_from_reads_in_sv_regions(bam, tra, get_read_table=False, min_n_breakpoint=2, margin=10)[source]#

Extract and append BreakpointChain objects from a bam file and a table of SVs

Parameters:
  • bam (pysam.AlignmentFile) – BAM file

  • tra (pandas.DataFrame) – Table of SVs

  • get_read_table (bool, optional) – Return table of read alignment stats. Defaults to False.

  • min_n_breakpoint (int, optional) – Minimum number of breakpoints required to be saved. Useful in selecting complex rearrangements if the number is high. Defaults to 3.

  • margin (int, optional) – Margin (bp) from breakpoints to fetch reads. Defaults to 10.

Returns:

A list of BreakpointChain objects

Return type:

list[BreakpointChain]

hairloom.collect.pull_sv_supporting_reads_from_bundle(sv, bundle)[source]#

Filter bundle to include BreakpointChain objects that have breakpoints matching that of the input sv table

Parameters:
  • sv (pandas.DataFrame) – SV table

  • bundle (list) – list of BreapointChain

Returns:

Filtered list of BreakpointChain

Return type:

list