hairloom.utils#
Functions
Enumerates breakpoints from a DataFrame of genomic fragments. |
|
|
Extracts secondary alignments from a sequencing read's 'SA' tag. |
|
|
|
Generates a table of breakpoints with support information. |
|
Creates a table of genomic segments with support information. |
|
Creates a table summarizing split-read alignments. |
|
Creates a table of translocations with support information. |
|
Classes
|
Represents a genomic breakpoint with associated properties and methods. |
|
Represents a chain of genomic breakpoints. |
- hairloom.utils.enumerate_breakpoints(df)[source]#
Enumerates breakpoints from a DataFrame of genomic fragments.
This function generates a
BreakpointChainobject containingBreakpointobjects based on the start and end positions of fragments in the input DataFrame. Breakpoints at the beginning and end of reads are treated differently to omit read boundaries.- Parameters:
df (pandas.DataFrame) –
A DataFrame containing fragment information, with the following columns:
chrom: The chromosome name (str).
start: The start position of the fragment (int).
end: The end position of the fragment (int).
strand: The strand information (‘+’ or ‘-‘).
- Returns:
A chain of Breakpoint objects enumerated from the DataFrame.
- Return type:
Notes
For the first fragment in the DataFrame, only the end position is included as a breakpoint.
For the last fragment in the DataFrame, only the start position is included as a breakpoint.
For middle fragments, both start and end positions are included, with orientations determined by the strand.
Example
>>> import pandas as pd >>> from your_module import Breakpoint, BreakpointChain, enumerate_breakpoints >>> df = pd.DataFrame({ ... 'chrom': ['chr1', 'chr1', 'chr1'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... 'strand': ['+', '+', '-'] ... }) >>> brks = enumerate_breakpoints(df) >>> print(brks) [chr1:150:+, chr1:200:-, chr1:250:+, chr1:300:-]
- hairloom.utils.get_secondaries(read)[source]#
Extracts secondary alignments from a sequencing read’s ‘SA’ tag.
This function retrieves secondary alignment information stored in the ‘SA’ tag of a sequencing read. It parses the tag and returns a list of secondary alignments.
- Parameters:
read (pysam.AlignedSegment) – A sequencing read object, typically from a BAM or SAM file, that includes tags.
- Returns:
A list of secondary alignment strings parsed from the ‘SA’ tag. If the ‘SA’ tag is not present, an empty list is returned.
- Return type:
list of str
Example
>>> from pysam import AlignedSegment >>> read = AlignedSegment() >>> read.tags = [('SA', 'chr1,100,+,60M,60;chr2,200,-,60M,60;')] >>> secondaries = get_secondaries(read) >>> print(secondaries) ['chr1,100,+,60M,60', 'chr2,200,-,60M,60']
Notes
The ‘SA’ tag stores secondary alignments in a semicolon-separated format.
- Each secondary alignment string typically contains the following fields:
[chromosome, position, strand, CIGAR, mapping quality].
- hairloom.utils.make_brk_table(bundle, chroms=None)[source]#
Generates a table of breakpoints with support information.
This function processes a bundle of
BreakpointChainobjects, extracts breakpoint coordinates, and organizes them into a structured pandas DataFrame. Optionally, the resulting table can be filtered by specific chromosomes.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing breakpoint information.chroms (list[str], optional) – A list of chromosome names to include. If None, all chromosomes are considered. Defaults to None.
- Returns:
- A DataFrame containing the breakpoint table with the following
columns:
chrom (str): Chromosome name.
pos (int): Position of the breakpoint.
ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).
support (int): Support count for the breakpoint.
- Return type:
pandas.DataFrame
Notes
Breakpoints are filtered based on the chroms parameter if provided.
Infinite values in the resulting DataFrame are replaced with NaN to maintain consistency.
Example
>>> from hairloom.datatypes import Breakpoint, BreakpointChain >>> from hairloom.utils import make_brk_table >>> bundle = [ ... BreakpointChain([ ... Breakpoint("chr1", 100, "+"), ... Breakpoint("chr1", 200, "+"), ... Breakpoint("chr2", 300, "+"), ... ]), ... ] >>> chroms = ["chr1", "chr2"] >>> brk_df = make_brk_table(bundle, chroms) >>> print(brk_df) chrom pos ori support 0 chr1 100 + 1 1 chr1 200 + 1 2 chr2 300 + 1
- hairloom.utils.make_seg_table(bundle, chroms=None)[source]#
Creates a table of genomic segments with support information.
This function processes a bundle of
BreakpointChainobjects, extracts genomic segment coordinates, and combines them with support data into a structured pandas DataFrame.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing segment information.chroms (list[str], optional) – A list of chromosomes to include. If None, all chromosomes are considered. Defaults to None.
- Returns:
- A DataFrame containing the breakpoint table with the following
columns:
chrom (str): Chromosome name.
pos (int): Position of the breakpoint.
ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).
support (int): Support count for the breakpoint.
- Return type:
pandas.DataFrame
Notes
Segments are filtered based on the chroms parameter if provided.
Infinite values in the resulting DataFrame are replaced with NaN for consistency.
Example
>>> from your_module import BreakpointChain, make_seg_table >>> bundle = [ ... Breakpoint('chr1', 100, '+'), ... Breakpoint('chr2', 200, '+'), ... Breakpoint('chr2', 300, '+'), ... Breakpoint('chr1', 400, '+'), ... ] # List of BreakpointChain objects >>> seg_df = make_seg_table(bundle, chroms=["chr1", "chr2"]) >>> print(seg_df.head()) chrom pos1 pos2 support 0 chr2 200 300 1
- hairloom.utils.make_split_read_table(alignments)[source]#
Creates a table summarizing split-read alignments.
This function processes a list of alignment objects, extracting key attributes and organizing them into a structured pandas DataFrame. The resulting table is deduplicated and sorted for consistent organization.
- Parameters:
alignments (list) – A list of alignment objects. Each alignment object is expected to have the following attributes:
read_name (-) – Query name of the read.
refname (-) – Reference sequence name (chromosome or contig).
start (-) – Start position of the alignment on the reference.
end (-) – End position of the alignment on the reference.
strand (-) – Strand information (‘+’ or ‘-‘).
clip1 (-) – Length of the first soft/hard clip before the matched region.
clip2 (-) – Length of the second soft/hard clip after the matched region.
match (-) – Total number of matched bases in the alignment.
pclip1 (-) – Strand-corrected length of the first clip.
- Returns:
A DataFrame containing the following columns:
qname (str): Query name of the read.
chrom (str): Reference sequence name (chromosome or contig).
start (int): Start position of the alignment on the reference.
end (int): End position of the alignment on the reference.
strand (str): Strand information (‘+’ or ‘-‘).
clip1 (int): Length of the first soft/hard clip before the matched region.
match (int): Total number of matched bases in the alignment.
clip2 (int): Length of the second soft/hard clip after the matched region.
pclip1 (int): Strand-corrected length of the first clip.
- Return type:
pandas.DataFrame
Notes
Duplicate rows are removed based on the full set of columns.
The DataFrame is sorted by qname and pclip1 for consistent organization.
Example
>>> from hairloom.utils import make_split_read_table >>> class Alignment: ... def __init__(self, read_name, refname, start, end, strand, clip1, match, clip2, pclip1): ... self.read_name = read_name ... self.refname = refname ... self.start = start ... self.end = end ... self.strand = strand ... self.clip1 = clip1 ... self.match = match ... self.clip2 = clip2 ... self.pclip1 = pclip1 >>> alignments = [ ... Alignment(read_name="read1", refname="chr1", start=100, end=150, ... strand='+', clip1=10, match=40, clip2=5, pclip1=10), ... Alignment(read_name="read2", refname="chr2", start=200, end=250, ... strand='-', clip1=15, match=35, clip2=10, pclip1=15) ... ] >>> df = make_split_read_table(alignments) >>> print(df) qname chrom start end strand clip1 match clip2 pclip1 0 read1 chr1 100 150 + 10 40 5 10 1 read2 chr2 200 250 - 15 35 10 15
- hairloom.utils.make_tra_table(bundle)[source]#
Creates a table of translocations with support information.
This function processes a bundle of
BreakpointChainobjects, extracts translocation coordinates, and combines them with support data into a structured pandas DataFrame.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing translocation information.- Returns:
A DataFrame containing the following columns:
chrom1 (str): Chromosome name of the first breakpoint.
pos1 (int): Position of the first breakpoint.
ori1 (str): Orientation of the first breakpoint (‘+’ or ‘-‘).
chrom2 (str): Chromosome name of the second breakpoint.
pos2 (int): Position of the second breakpoint.
ori2 (str): Orientation of the second breakpoint (‘+’ or ‘-‘).
support (int): Support count for the translocation.
- Return type:
pandas.DataFrame
Notes
Translocations are identified by pairs of breakpoints (BreakpointPair objects) in the tras attribute of each BreakpointChain.
Duplicate translocations are removed by maintaining a set of seen coordinate pairs.
Infinite values in the resulting DataFrame are replaced with NaN for consistency.
Example
>>> from hairloom.datatypes import Breakpoint, BreakpointChain >>> from hairloom.utils import make_tra_table >>> chain1 = BreakpointChain([ ... Breakpoint('chr1', 100, '+'), ... Breakpoint('chr2', 200, '+') ... ]) >>> chain2 = BreakpointChain([ ... Breakpoint('chr2', 300, '+'), ... Breakpoint('chr1', 400, '+') ... ]) >>> bundle = [chain1, chain2] >>> tra_df = make_tra_table(bundle) >>> print(tra_df) chrom1 pos1 ori1 chrom2 pos2 ori2 support 0 chr1 100 + chr2 200 + 1 1 chr2 300 + chr1 400 + 1