hairloom.collect#
Functions
|
Enumerates breakpoints from a DataFrame of genomic fragments. |
|
Extract alignment tables for split reads and concatenate them into a single DataFrame. |
|
Extract |
|
Check overlap of sv2 for sv1 table |
|
Map breakpoint of lower support to close-by breakpoint with higher support |
Get breakpoint support count |
|
|
Sorts (normalizes) a |
|
Extracts secondary alignments from a sequencing read's 'SA' tag. |
|
Get SV type string for a given |
|
Make a list of |
|
Creates a table summarizing split-read alignments. |
|
Make SV table from list of |
|
Make mapping of close-by coordinates, with breakpoints of higher support taking priority |
|
Sort breakpoint1 and breakpoint2 of a SV table |
Extract and append |
|
|
Filter bundle to include |
Classes
|
Represents a genomic breakpoint with associated properties and methods. |
|
Represents a pair of genomic breakpoints. |
|
Dict subclass for counting hashable items. |
|
Represents a split alignment from a sequencing read. |
- hairloom.collect.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#
Extract alignment tables for split reads and concatenate them into a single DataFrame.
This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.
- Parameters:
bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.
contig (str) – The contig (e.g., chromosome) to extract reads from.
start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.
end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.
max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.
- Returns:
- A DataFrame containing alignment data for all split reads in the region,
concatenated and organized.
- Return type:
pd.DataFrame
Notes
The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.
The make_split_read_table function is used to create the DataFrame from the extracted alignments.
Example
>>> import pysam >>> bam = pysam.AlignmentFile("example.bam", "rb") >>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100) >>> print(df.head())
- hairloom.collect.extract_split_alignments(reads, max_reads=500)[source]#
Extract
SplitAlignmentobjects from IteratorRow with a max_reads parameter- Parameters:
reads (pysam.IteratorRow) – Reads fetched from a pysam.Alignmentfile
max_reads (int, optional) – Number of reads to extract at maximum. Defaults to 500.
- Returns:
list of
SplitAlignmentobjects- Return type:
list
- hairloom.collect.find_presence_of_matching_sv(sv1, sv2, margin=50)[source]#
Check overlap of sv2 for sv1 table
- Parameters:
sv1 (pandas.DataFrame) – SV table to label matching SVs
sv2 (pandas.DataFrame) – SV table reference to check presence of overlap
margin (int, optional) – Margin (bp) of breakpoint coordinate difference. Defaults to 50.
- Returns:
{True, False} list of matches. Length equal to sv1 row size.
- Return type:
pd.Series
- hairloom.collect.fix_lower_support_coordinates(bundle, coord_map)[source]#
Map breakpoint of lower support to close-by breakpoint with higher support
- Parameters:
bundle (list) – List of
BreakpointChaincoord_map (dict) – Map of str(
Breakpoint) coordinates
- Returns:
List of
BreakpointChain, mapped to fixed coordinates- Return type:
list[BreakpointChain]
- hairloom.collect.get_breakpoint_support_from_bundle(bundle)[source]#
Get breakpoint support count
- Parameters:
bundle (list[BreakpointChain]) – List of
BreakpointChain- Returns:
Support for str(
Breakpoint) coordinates- Return type:
collections.Counter
- hairloom.collect.get_normalized_sv(tra)[source]#
Sorts (normalizes) a
BreakpointPairbased on chromosome and position order.This function ensures a consistent ordering of breakpoints in a
BreakpointPairby sorting them based on chromosome precedence and genomic position.- Parameters:
tra (
BreakpointPair) – A pair of breakpoints to normalize. Each breakpoint in the pair is expected to have the attributes chrom, pos, and ori.- Returns:
- A flattened tuple of the normalized breakpoint coordinates in the format:
[chrom1, pos1, ori1, chrom2, pos2, ori2].
- Return type:
tuple
Notes
Chromosomes are sorted based on their order in
Breakpoint.chroms.If the chromosomes are the same, the breakpoints are sorted by position.
The orientation (ori) remains associated with its respective breakpoint.
Example
>>> from your_module import Breakpoint, BreakpointPair, get_normalized_sv >>> brk1 = Breakpoint("chr2", 100, "+") >>> brk2 = Breakpoint("chr1", 200, "-") >>> pair = BreakpointPair(brk1, brk2) >>> get_normalized_sv(pair) ('chr1', 200, '-', 'chr2', 100, '+')
- hairloom.collect.get_svtype(tra)[source]#
Get SV type string for a given
BreakpointPair- Parameters:
tra (
BreakpointPair) – Paired breakpoint object- Raises:
ValueError – If no SV type has been assigned
- Returns:
SV type string
- Return type:
str
- hairloom.collect.make_bundle(reads)[source]#
Make a list of
BreapointChainbased on alignment table- Parameters:
reads (pandas.DataFrame) – Table of read alignment statistics
- Returns:
List of
BreakpointChain- Return type:
list[BreakpointChain]
- hairloom.collect.make_tumor_sv_table(bundle, sv=None, margin=10, get_support=True)[source]#
Make SV table from list of
BreakpointChain- Parameters:
bundle (list) – List of
BreakpointChainsv (pandas.DataFrame, optional) – Table of source SVs as reference for in_source flag. Defaults to None
margin (int, optional) – Margin (bp) for merging clustered breakpoints. Defaults to 10.
get_support (bool, optional) – Merge breakpoints with same coordinates and add count as support. Defaults to True.
- Returns:
SV table from bundle [, with in_source labels] [, collapsed by coordinate with support counts]
- Return type:
pandas.DataFrame
- hairloom.collect.map_similar_coordinate_to_higher_rank(bundle, breakpoint_support, margin=10)[source]#
Make mapping of close-by coordinates, with breakpoints of higher support taking priority
- Parameters:
bundle (list) – List of
BreakpointChainbreakpoint_support (dict | collections.Counter) – Support for breakpoint coordinates
margin (int, optional) – Margin (bp) to merge close-by coordinates. Defaults to 10.
- Returns:
tuple containing:
coord_map (dict): source -> destination coordinate
coord_map_log (tuple): (max_coord, src_count, max_count) [only for debugging]
- Return type:
tuple
- hairloom.collect.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#
Sort breakpoint1 and breakpoint2 of a SV table
- Parameters:
sv (pandas.DataFrame) – Table of SVs
chrom1_col (str, optional) – Defaults to ‘chromosome_1’.
chrom2_col (str, optional) – Defaults to ‘chromosome_2’.
pos1_col (str, optional) – Defaults to ‘position_1’.
pos2_col (str, optional) – Defaults to ‘position_2’.
ori1_col (str, optional) – Defaults to ‘strand_1’.
ori2_col (str, optional) – Defaults to ‘strand_2’.
chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.
- Returns:
Sorted (normalized) SV table
- Return type:
pandas.DataFrame
- hairloom.collect.pull_breakpoints_from_reads_in_sv_regions(bam, tra, get_read_table=False, min_n_breakpoint=2, margin=10)[source]#
Extract and append
BreakpointChainobjects from a bam file and a table of SVs- Parameters:
bam (pysam.AlignmentFile) – BAM file
tra (pandas.DataFrame) – Table of SVs
get_read_table (bool, optional) – Return table of read alignment stats. Defaults to False.
min_n_breakpoint (int, optional) – Minimum number of breakpoints required to be saved. Useful in selecting complex rearrangements if the number is high. Defaults to 3.
margin (int, optional) – Margin (bp) from breakpoints to fetch reads. Defaults to 10.
- Returns:
A list of BreakpointChain objects
- Return type:
list[BreakpointChain]
- hairloom.collect.pull_sv_supporting_reads_from_bundle(sv, bundle)[source]#
Filter bundle to include
BreakpointChainobjects that have breakpoints matching that of the input sv table- Parameters:
sv (pandas.DataFrame) – SV table
bundle (list) – list of
BreapointChain
- Returns:
Filtered list of
BreakpointChain- Return type:
list