hairloom.collect#

Functions

`enumerate_breakpoints`(df)	Enumerates breakpoints from a DataFrame of genomic fragments.
`extract_read_data`(bam, contig[, start, end, ...])	Extract alignment tables for split reads and concatenate them into a single DataFrame.
`extract_split_alignments`(reads[, max_reads])	Extract `SplitAlignment` objects from IteratorRow with a max_reads parameter
`find_presence_of_matching_sv`(sv1, sv2[, margin])	Check overlap of sv2 for sv1 table
`fix_lower_support_coordinates`(bundle, coord_map)	Map breakpoint of lower support to close-by breakpoint with higher support
`get_breakpoint_support_from_bundle`(bundle)	Get breakpoint support count
`get_normalized_sv`(tra)	Sorts (normalizes) a `BreakpointPair` based on chromosome and position order.
`get_secondaries`(read)	Extracts secondary alignments from a sequencing read's 'SA' tag.
`get_svtype`(tra)	Get SV type string for a given `BreakpointPair`
`make_bundle`(reads)	Make a list of `BreapointChain` based on alignment table
`make_split_read_table`(alignments)	Creates a table summarizing split-read alignments.
`make_tumor_sv_table`(bundle[, sv, margin, ...])	Make SV table from list of `BreakpointChain`
`map_similar_coordinate_to_higher_rank`(...[, ...])	Make mapping of close-by coordinates, with breakpoints of higher support taking priority
`normalize_sv_table`(sv[, chrom1_col, ...])	Sort breakpoint1 and breakpoint2 of a SV table
`pull_breakpoints_from_reads_in_sv_regions`(...)	Extract and append `BreakpointChain` objects from a bam file and a table of SVs
`pull_sv_supporting_reads_from_bundle`(sv, bundle)	Filter bundle to include `BreakpointChain` objects that have breakpoints matching that of the input sv table

Classes

`Breakpoint`(chrom, pos, orientation)	Represents a genomic breakpoint with associated properties and methods.
`BreakpointPair`(brk1, brk2)	Represents a pair of genomic breakpoints.
`Counter`([iterable])	Dict subclass for counting hashable items.
`SplitAlignment`(cigarstring, read_name, ...)	Represents a split alignment from a sequencing read.

hairloom.collect.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#

Extract alignment tables for split reads and concatenate them into a single DataFrame.

This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.

Parameters:

bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.
contig (str) – The contig (e.g., chromosome) to extract reads from.
start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.
end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.
max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.

Returns:

A DataFrame containing alignment data for all split reads in the region,: concatenated and organized.

Return type:

pd.DataFrame

Notes

The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.
The make_split_read_table function is used to create the DataFrame from the extracted alignments.

Example

>>> import pysam
>>> bam = pysam.AlignmentFile("example.bam", "rb")
>>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100)
>>> print(df.head())

hairloom.collect.extract_split_alignments(reads, max_reads=500)[source]#

Extract SplitAlignment objects from IteratorRow with a max_reads parameter

Parameters:

reads (pysam.IteratorRow) – Reads fetched from a pysam.Alignmentfile
max_reads (int, optional) – Number of reads to extract at maximum. Defaults to 500.

Returns:

list of SplitAlignment objects

Return type:

list

hairloom.collect.find_presence_of_matching_sv(sv1, sv2, margin=50)[source]#

Check overlap of sv2 for sv1 table

Parameters:

sv1 (pandas.DataFrame) – SV table to label matching SVs
sv2 (pandas.DataFrame) – SV table reference to check presence of overlap
margin (int, optional) – Margin (bp) of breakpoint coordinate difference. Defaults to 50.

Returns:

{True, False} list of matches. Length equal to sv1 row size.

Return type:

pd.Series

hairloom.collect.fix_lower_support_coordinates(bundle, coord_map)[source]#

Map breakpoint of lower support to close-by breakpoint with higher support

Parameters:

bundle (list) – List of BreakpointChain
coord_map (dict) – Map of str(Breakpoint) coordinates

Returns:

List of BreakpointChain, mapped to fixed coordinates

Return type:

list[BreakpointChain]

hairloom.collect.get_breakpoint_support_from_bundle(bundle)[source]#

Get breakpoint support count

Parameters:: bundle (list[BreakpointChain]) – List of BreakpointChain
Returns:: Support for str(Breakpoint) coordinates
Return type:: collections.Counter

hairloom.collect.get_normalized_sv(tra)[source]#

Sorts (normalizes) a BreakpointPair based on chromosome and position order.

This function ensures a consistent ordering of breakpoints in a BreakpointPair by sorting them based on chromosome precedence and genomic position.

Parameters:

tra (BreakpointPair) – A pair of breakpoints to normalize. Each breakpoint in the pair is expected to have the attributes chrom, pos, and ori.

Returns:

A flattened tuple of the normalized breakpoint coordinates in the format:: [chrom1, pos1, ori1, chrom2, pos2, ori2].

Return type:

tuple

Notes

Chromosomes are sorted based on their order in Breakpoint.chroms.
If the chromosomes are the same, the breakpoints are sorted by position.
The orientation (ori) remains associated with its respective breakpoint.

Example

>>> from your_module import Breakpoint, BreakpointPair, get_normalized_sv
>>> brk1 = Breakpoint("chr2", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> pair = BreakpointPair(brk1, brk2)
>>> get_normalized_sv(pair)
('chr1', 200, '-', 'chr2', 100, '+')

hairloom.collect.get_svtype(tra)[source]#

Get SV type string for a given BreakpointPair

Parameters:: tra (BreakpointPair) – Paired breakpoint object
Raises:: ValueError – If no SV type has been assigned
Returns:: SV type string
Return type:: str

hairloom.collect.make_bundle(reads)[source]#

Make a list of BreapointChain based on alignment table

Parameters:: reads (pandas.DataFrame) – Table of read alignment statistics
Returns:: List of BreakpointChain
Return type:: list[BreakpointChain]

hairloom.collect.make_tumor_sv_table(bundle, sv=None, margin=10, get_support=True)[source]#

Make SV table from list of BreakpointChain

Parameters:

bundle (list) – List of BreakpointChain
sv (pandas.DataFrame, optional) – Table of source SVs as reference for in_source flag. Defaults to None
margin (int, optional) – Margin (bp) for merging clustered breakpoints. Defaults to 10.
get_support (bool, optional) – Merge breakpoints with same coordinates and add count as support. Defaults to True.

Returns:

SV table from bundle [, with in_source labels] [, collapsed by coordinate with support counts]

Return type:

pandas.DataFrame

hairloom.collect.map_similar_coordinate_to_higher_rank(bundle, breakpoint_support, margin=10)[source]#

Make mapping of close-by coordinates, with breakpoints of higher support taking priority

Parameters:

bundle (list) – List of BreakpointChain
breakpoint_support (dict | collections.Counter) – Support for breakpoint coordinates
margin (int, optional) – Margin (bp) to merge close-by coordinates. Defaults to 10.

Returns:

tuple containing:

coord_map (dict): source -> destination coordinate
coord_map_log (tuple): (max_coord, src_count, max_count) [only for debugging]

Return type:

tuple

hairloom.collect.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#

Sort breakpoint1 and breakpoint2 of a SV table

Parameters:

sv (pandas.DataFrame) – Table of SVs
chrom1_col (str, optional) – Defaults to ‘chromosome_1’.
chrom2_col (str, optional) – Defaults to ‘chromosome_2’.
pos1_col (str, optional) – Defaults to ‘position_1’.
pos2_col (str, optional) – Defaults to ‘position_2’.
ori1_col (str, optional) – Defaults to ‘strand_1’.
ori2_col (str, optional) – Defaults to ‘strand_2’.
chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.

Returns:

Sorted (normalized) SV table

Return type:

pandas.DataFrame

hairloom.collect.pull_breakpoints_from_reads_in_sv_regions(bam, tra, get_read_table=False, min_n_breakpoint=2, margin=10)[source]#

Extract and append BreakpointChain objects from a bam file and a table of SVs

Parameters:

bam (pysam.AlignmentFile) – BAM file
tra (pandas.DataFrame) – Table of SVs
get_read_table (bool, optional) – Return table of read alignment stats. Defaults to False.
min_n_breakpoint (int, optional) – Minimum number of breakpoints required to be saved. Useful in selecting complex rearrangements if the number is high. Defaults to 3.
margin (int, optional) – Margin (bp) from breakpoints to fetch reads. Defaults to 10.

Returns:

A list of BreakpointChain objects

Return type:

list[BreakpointChain]

hairloom.collect.pull_sv_supporting_reads_from_bundle(sv, bundle)[source]#

Filter bundle to include BreakpointChain objects that have breakpoints matching that of the input sv table

Parameters:

sv (pandas.DataFrame) – SV table
bundle (list) – list of BreapointChain

Returns:

Filtered list of BreakpointChain

Return type:

list

hairloom.collect#

This Page