hairloom package#
Module contents#
- class hairloom.Breakpoint(chrom, pos, orientation)[source]#
Bases:
objectRepresents a genomic breakpoint with associated properties and methods.
- chrom#
The chromosome name where the breakpoint is located.
- Type:
str
- pos#
The 1-based position of the breakpoint on the chromosome.
- Type:
int
- ori#
The orientation of the breakpoint (‘+’ or ‘-‘).
- Type:
str
- upstream#
Sequence upstream of the breakpoint, initialized to None.
- Type:
str or None
- downstream#
Sequence downstream of the breakpoint, initialized to None.
- Type:
str or None
- seq_rearranged#
Rearranged sequence at the breakpoint, initialized to None.
- Type:
str or None
- seq_removed#
Removed sequence at the breakpoint, initialized to None.
- Type:
str or None
- chroms#
List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).
- Type:
list[str]
- chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#
- get_breakpoint_seqs(margin, genome)[source]#
Retrieves upstream and downstream sequences around the breakpoint.
Computes rearranged and removed sequences based on the given margin and genome dictionary.
- Parameters:
margin (int) – Number of bases to include upstream and downstream of the breakpoint.
genome (dict) – A dictionary mapping chromosome names to their respective sequences.
- Raises:
ValueError – If the chromosome is not found in the genome.
- class hairloom.BreakpointChain(brks_iterable)[source]#
Bases:
listRepresents a chain of genomic breakpoints.
This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.
- tras#
List of transitions (pairs of breakpoints).
- Type:
list[BreakpointPair]
- segs#
List of segments (pairs of breakpoints).
- Type:
list[BreakpointPair]
- Parameters:
brks_iterable (iterable) – An iterable containing breakpoint objects.
Example
>>> brk1 = Breakpoint("chr1", 100, "+") >>> brk2 = Breakpoint("chr1", 200, "-") >>> chain = BreakpointChain([brk1, brk2]) >>> chain.tras [BreakpointPair(brk1, brk2)]
- class hairloom.BreakpointPair(brk1, brk2)[source]#
Bases:
objectRepresents a pair of genomic breakpoints.
- brk1#
The first breakpoint in the pair.
- Type:
- brk2#
The second breakpoint in the pair.
- Type:
- aln_segment#
Indicates whether this pair is part of an alignment segment. Defaults to False.
- Type:
bool
- class hairloom.Segments(df)[source]#
Bases:
objectCalculates and stores genomic segments (middle fragments).
- list#
A list of segments, each represented as a tuple (chrom, start, end).
- Type:
list[tuple]
- Parameters:
df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read1'], ... 'chrom': ['chr1', 'chr1', 'chr1'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... }) >>> segments = Segments(df) >>> segments.list [('chr1', 200, 250)]
- get_list()[source]#
Computes segments (middle fragments) from the DataFrame.
Notes
Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).
If a group contains fewer than three fragments, no segments are added.
- Modifies:
list (list[tuple]): Appends computed segments to the list attribute.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read1', 'read2'], ... 'chrom': ['chr1', 'chr1', 'chr1', 'chr2'], ... 'start': [100, 200, 300, 400], ... 'end': [150, 250, 350, 450], ... }) >>> segments = Segments(df) >>> segments.list [('chr1', 200, 250)]
- class hairloom.Transitions(df)[source]#
Bases:
objectCalculates transitions (tra) between genomic fragments.
- list#
A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).
- Type:
list[tuple]
- Parameters:
df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1'], ... 'chrom': ['chr1', 'chr2'], ... 'start': [100, 200], ... 'end': [150, 250], ... 'strand': ['+', '-'] ... }) >>> transitions = Transitions(df) >>> transitions.list [(('chr1', 150, '+'), ('chr2', 200, '-'))]
- get_list()[source]#
Computes transitions (tra) from the DataFrame.
Notes
Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.
For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).
- Modifies:
list (list[tuple]): Appends computed transitions to the list attribute.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read2'], ... 'chrom': ['chr1', 'chr2', 'chr3'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... 'strand': ['+', '-', '+'] ... }) >>> transitions = Transitions(df) >>> transitions.list # Note: 'read2' won't be included [(('chr1', 150, '+'), ('chr2', 250, '-'))]
- hairloom.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#
Extract alignment tables for split reads and concatenate them into a single DataFrame.
This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.
- Parameters:
bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.
contig (str) – The contig (e.g., chromosome) to extract reads from.
start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.
end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.
max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.
- Returns:
- A DataFrame containing alignment data for all split reads in the region,
concatenated and organized.
- Return type:
pd.DataFrame
Notes
The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.
The make_split_read_table function is used to create the DataFrame from the extracted alignments.
Example
>>> import pysam >>> bam = pysam.AlignmentFile("example.bam", "rb") >>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100) >>> print(df.head())
- hairloom.get_svtype(tra)[source]#
Get SV type string for a given
BreakpointPair- Parameters:
tra (
BreakpointPair) – Paired breakpoint object- Raises:
ValueError – If no SV type has been assigned
- Returns:
SV type string
- Return type:
str
- hairloom.make_brk_table(bundle, chroms=None)[source]#
Generates a table of breakpoints with support information.
This function processes a bundle of
BreakpointChainobjects, extracts breakpoint coordinates, and organizes them into a structured pandas DataFrame. Optionally, the resulting table can be filtered by specific chromosomes.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing breakpoint information.chroms (list[str], optional) – A list of chromosome names to include. If None, all chromosomes are considered. Defaults to None.
- Returns:
- A DataFrame containing the breakpoint table with the following
columns:
chrom (str): Chromosome name.
pos (int): Position of the breakpoint.
ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).
support (int): Support count for the breakpoint.
- Return type:
pandas.DataFrame
Notes
Breakpoints are filtered based on the chroms parameter if provided.
Infinite values in the resulting DataFrame are replaced with NaN to maintain consistency.
Example
>>> from hairloom.datatypes import Breakpoint, BreakpointChain >>> from hairloom.utils import make_brk_table >>> bundle = [ ... BreakpointChain([ ... Breakpoint("chr1", 100, "+"), ... Breakpoint("chr1", 200, "+"), ... Breakpoint("chr2", 300, "+"), ... ]), ... ] >>> chroms = ["chr1", "chr2"] >>> brk_df = make_brk_table(bundle, chroms) >>> print(brk_df) chrom pos ori support 0 chr1 100 + 1 1 chr1 200 + 1 2 chr2 300 + 1
- hairloom.make_bundle(reads)[source]#
Make a list of
BreapointChainbased on alignment table- Parameters:
reads (pandas.DataFrame) – Table of read alignment statistics
- Returns:
List of
BreakpointChain- Return type:
list[BreakpointChain]
- hairloom.make_seg_table(bundle, chroms=None)[source]#
Creates a table of genomic segments with support information.
This function processes a bundle of
BreakpointChainobjects, extracts genomic segment coordinates, and combines them with support data into a structured pandas DataFrame.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing segment information.chroms (list[str], optional) – A list of chromosomes to include. If None, all chromosomes are considered. Defaults to None.
- Returns:
- A DataFrame containing the breakpoint table with the following
columns:
chrom (str): Chromosome name.
pos (int): Position of the breakpoint.
ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).
support (int): Support count for the breakpoint.
- Return type:
pandas.DataFrame
Notes
Segments are filtered based on the chroms parameter if provided.
Infinite values in the resulting DataFrame are replaced with NaN for consistency.
Example
>>> from your_module import BreakpointChain, make_seg_table >>> bundle = [ ... Breakpoint('chr1', 100, '+'), ... Breakpoint('chr2', 200, '+'), ... Breakpoint('chr2', 300, '+'), ... Breakpoint('chr1', 400, '+'), ... ] # List of BreakpointChain objects >>> seg_df = make_seg_table(bundle, chroms=["chr1", "chr2"]) >>> print(seg_df.head()) chrom pos1 pos2 support 0 chr2 200 300 1
- hairloom.make_tra_table(bundle)[source]#
Creates a table of translocations with support information.
This function processes a bundle of
BreakpointChainobjects, extracts translocation coordinates, and combines them with support data into a structured pandas DataFrame.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing translocation information.- Returns:
A DataFrame containing the following columns:
chrom1 (str): Chromosome name of the first breakpoint.
pos1 (int): Position of the first breakpoint.
ori1 (str): Orientation of the first breakpoint (‘+’ or ‘-‘).
chrom2 (str): Chromosome name of the second breakpoint.
pos2 (int): Position of the second breakpoint.
ori2 (str): Orientation of the second breakpoint (‘+’ or ‘-‘).
support (int): Support count for the translocation.
- Return type:
pandas.DataFrame
Notes
Translocations are identified by pairs of breakpoints (BreakpointPair objects) in the tras attribute of each BreakpointChain.
Duplicate translocations are removed by maintaining a set of seen coordinate pairs.
Infinite values in the resulting DataFrame are replaced with NaN for consistency.
Example
>>> from hairloom.datatypes import Breakpoint, BreakpointChain >>> from hairloom.utils import make_tra_table >>> chain1 = BreakpointChain([ ... Breakpoint('chr1', 100, '+'), ... Breakpoint('chr2', 200, '+') ... ]) >>> chain2 = BreakpointChain([ ... Breakpoint('chr2', 300, '+'), ... Breakpoint('chr1', 400, '+') ... ]) >>> bundle = [chain1, chain2] >>> tra_df = make_tra_table(bundle) >>> print(tra_df) chrom1 pos1 ori1 chrom2 pos2 ori2 support 0 chr1 100 + chr2 200 + 1 1 chr2 300 + chr1 400 + 1
- hairloom.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#
Sort breakpoint1 and breakpoint2 of a SV table
- Parameters:
sv (pandas.DataFrame) – Table of SVs
chrom1_col (str, optional) – Defaults to ‘chromosome_1’.
chrom2_col (str, optional) – Defaults to ‘chromosome_2’.
pos1_col (str, optional) – Defaults to ‘position_1’.
pos2_col (str, optional) – Defaults to ‘position_2’.
ori1_col (str, optional) – Defaults to ‘strand_1’.
ori2_col (str, optional) – Defaults to ‘strand_2’.
chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.
- Returns:
Sorted (normalized) SV table
- Return type:
pandas.DataFrame
Submodules#
hairloom.collect module#
- hairloom.collect.extract_read_data(bam, contig, start=None, end=None, max_reads=500)[source]#
Extract alignment tables for split reads and concatenate them into a single DataFrame.
This function retrieves reads from a specified region in a BAM file, extracts split alignments, and organizes them into a structured pandas DataFrame.
- Parameters:
bam (pysam.AlignmentFile) – The input BAM file opened with pysam.AlignmentFile.
contig (str) – The contig (e.g., chromosome) to extract reads from.
start (int, optional) – 1-based start position of the region. If None, reads are fetched from the beginning of the contig. Defaults to None.
end (int, optional) – 1-based end position of the region. If None, reads are fetched to the end of the contig. Defaults to None.
max_reads (int, optional) – The maximum number of reads to extract. Defaults to 500.
- Returns:
- A DataFrame containing alignment data for all split reads in the region,
concatenated and organized.
- Return type:
pd.DataFrame
Notes
The start and end positions are converted to 0-based coordinates for compatibility with pysam.AlignmentFile.fetch.
The make_split_read_table function is used to create the DataFrame from the extracted alignments.
Example
>>> import pysam >>> bam = pysam.AlignmentFile("example.bam", "rb") >>> df = extract_read_data(bam, contig="chr1", start=100, end=1000, max_reads=100) >>> print(df.head())
- hairloom.collect.extract_split_alignments(reads, max_reads=500)[source]#
Extract
SplitAlignmentobjects from IteratorRow with a max_reads parameter- Parameters:
reads (pysam.IteratorRow) – Reads fetched from a pysam.Alignmentfile
max_reads (int, optional) – Number of reads to extract at maximum. Defaults to 500.
- Returns:
list of
SplitAlignmentobjects- Return type:
list
- hairloom.collect.find_presence_of_matching_sv(sv1, sv2, margin=50)[source]#
Check overlap of sv2 for sv1 table
- Parameters:
sv1 (pandas.DataFrame) – SV table to label matching SVs
sv2 (pandas.DataFrame) – SV table reference to check presence of overlap
margin (int, optional) – Margin (bp) of breakpoint coordinate difference. Defaults to 50.
- Returns:
{True, False} list of matches. Length equal to sv1 row size.
- Return type:
pd.Series
- hairloom.collect.fix_lower_support_coordinates(bundle, coord_map)[source]#
Map breakpoint of lower support to close-by breakpoint with higher support
- Parameters:
bundle (list) – List of
BreakpointChaincoord_map (dict) – Map of str(
Breakpoint) coordinates
- Returns:
List of
BreakpointChain, mapped to fixed coordinates- Return type:
list[BreakpointChain]
- hairloom.collect.get_breakpoint_support_from_bundle(bundle)[source]#
Get breakpoint support count
- Parameters:
bundle (list[BreakpointChain]) – List of
BreakpointChain- Returns:
Support for str(
Breakpoint) coordinates- Return type:
collections.Counter
- hairloom.collect.get_normalized_sv(tra)[source]#
Sorts (normalizes) a
BreakpointPairbased on chromosome and position order.This function ensures a consistent ordering of breakpoints in a
BreakpointPairby sorting them based on chromosome precedence and genomic position.- Parameters:
tra (
BreakpointPair) – A pair of breakpoints to normalize. Each breakpoint in the pair is expected to have the attributes chrom, pos, and ori.- Returns:
- A flattened tuple of the normalized breakpoint coordinates in the format:
[chrom1, pos1, ori1, chrom2, pos2, ori2].
- Return type:
tuple
Notes
Chromosomes are sorted based on their order in
Breakpoint.chroms.If the chromosomes are the same, the breakpoints are sorted by position.
The orientation (ori) remains associated with its respective breakpoint.
Example
>>> from your_module import Breakpoint, BreakpointPair, get_normalized_sv >>> brk1 = Breakpoint("chr2", 100, "+") >>> brk2 = Breakpoint("chr1", 200, "-") >>> pair = BreakpointPair(brk1, brk2) >>> get_normalized_sv(pair) ('chr1', 200, '-', 'chr2', 100, '+')
- hairloom.collect.get_svtype(tra)[source]#
Get SV type string for a given
BreakpointPair- Parameters:
tra (
BreakpointPair) – Paired breakpoint object- Raises:
ValueError – If no SV type has been assigned
- Returns:
SV type string
- Return type:
str
- hairloom.collect.make_bundle(reads)[source]#
Make a list of
BreapointChainbased on alignment table- Parameters:
reads (pandas.DataFrame) – Table of read alignment statistics
- Returns:
List of
BreakpointChain- Return type:
list[BreakpointChain]
- hairloom.collect.make_tumor_sv_table(bundle, sv=None, margin=10, get_support=True)[source]#
Make SV table from list of
BreakpointChain- Parameters:
bundle (list) – List of
BreakpointChainsv (pandas.DataFrame, optional) – Table of source SVs as reference for in_source flag. Defaults to None
margin (int, optional) – Margin (bp) for merging clustered breakpoints. Defaults to 10.
get_support (bool, optional) – Merge breakpoints with same coordinates and add count as support. Defaults to True.
- Returns:
SV table from bundle [, with in_source labels] [, collapsed by coordinate with support counts]
- Return type:
pandas.DataFrame
- hairloom.collect.map_similar_coordinate_to_higher_rank(bundle, breakpoint_support, margin=10)[source]#
Make mapping of close-by coordinates, with breakpoints of higher support taking priority
- Parameters:
bundle (list) – List of
BreakpointChainbreakpoint_support (dict | collections.Counter) – Support for breakpoint coordinates
margin (int, optional) – Margin (bp) to merge close-by coordinates. Defaults to 10.
- Returns:
tuple containing:
coord_map (dict): source -> destination coordinate
coord_map_log (tuple): (max_coord, src_count, max_count) [only for debugging]
- Return type:
tuple
- hairloom.collect.normalize_sv_table(sv, chrom1_col='chromosome_1', chrom2_col='chromosome_2', pos1_col='position_1', pos2_col='position_2', ori1_col='strand_1', ori2_col='strand_2', chroms=None)[source]#
Sort breakpoint1 and breakpoint2 of a SV table
- Parameters:
sv (pandas.DataFrame) – Table of SVs
chrom1_col (str, optional) – Defaults to ‘chromosome_1’.
chrom2_col (str, optional) – Defaults to ‘chromosome_2’.
pos1_col (str, optional) – Defaults to ‘position_1’.
pos2_col (str, optional) – Defaults to ‘position_2’.
ori1_col (str, optional) – Defaults to ‘strand_1’.
ori2_col (str, optional) – Defaults to ‘strand_2’.
chroms (list, optional) – List of input contigs for coordinate sorting. Defaults to None.
- Returns:
Sorted (normalized) SV table
- Return type:
pandas.DataFrame
- hairloom.collect.pull_breakpoints_from_reads_in_sv_regions(bam, tra, get_read_table=False, min_n_breakpoint=2, margin=10)[source]#
Extract and append
BreakpointChainobjects from a bam file and a table of SVs- Parameters:
bam (pysam.AlignmentFile) – BAM file
tra (pandas.DataFrame) – Table of SVs
get_read_table (bool, optional) – Return table of read alignment stats. Defaults to False.
min_n_breakpoint (int, optional) – Minimum number of breakpoints required to be saved. Useful in selecting complex rearrangements if the number is high. Defaults to 3.
margin (int, optional) – Margin (bp) from breakpoints to fetch reads. Defaults to 10.
- Returns:
A list of BreakpointChain objects
- Return type:
list[BreakpointChain]
- hairloom.collect.pull_sv_supporting_reads_from_bundle(sv, bundle)[source]#
Filter bundle to include
BreakpointChainobjects that have breakpoints matching that of the input sv table- Parameters:
sv (pandas.DataFrame) – SV table
bundle (list) – list of
BreapointChain
- Returns:
Filtered list of
BreakpointChain- Return type:
list
hairloom.datatypes module#
- class hairloom.datatypes.Breakpoint(chrom, pos, orientation)[source]#
Bases:
objectRepresents a genomic breakpoint with associated properties and methods.
- chrom#
The chromosome name where the breakpoint is located.
- Type:
str
- pos#
The 1-based position of the breakpoint on the chromosome.
- Type:
int
- ori#
The orientation of the breakpoint (‘+’ or ‘-‘).
- Type:
str
- upstream#
Sequence upstream of the breakpoint, initialized to None.
- Type:
str or None
- downstream#
Sequence downstream of the breakpoint, initialized to None.
- Type:
str or None
- seq_rearranged#
Rearranged sequence at the breakpoint, initialized to None.
- Type:
str or None
- seq_removed#
Removed sequence at the breakpoint, initialized to None.
- Type:
str or None
- chroms#
List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).
- Type:
list[str]
- chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#
- get_breakpoint_seqs(margin, genome)[source]#
Retrieves upstream and downstream sequences around the breakpoint.
Computes rearranged and removed sequences based on the given margin and genome dictionary.
- Parameters:
margin (int) – Number of bases to include upstream and downstream of the breakpoint.
genome (dict) – A dictionary mapping chromosome names to their respective sequences.
- Raises:
ValueError – If the chromosome is not found in the genome.
- class hairloom.datatypes.BreakpointChain(brks_iterable)[source]#
Bases:
listRepresents a chain of genomic breakpoints.
This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.
- tras#
List of transitions (pairs of breakpoints).
- Type:
list[BreakpointPair]
- segs#
List of segments (pairs of breakpoints).
- Type:
list[BreakpointPair]
- Parameters:
brks_iterable (iterable) – An iterable containing breakpoint objects.
Example
>>> brk1 = Breakpoint("chr1", 100, "+") >>> brk2 = Breakpoint("chr1", 200, "-") >>> chain = BreakpointChain([brk1, brk2]) >>> chain.tras [BreakpointPair(brk1, brk2)]
- class hairloom.datatypes.BreakpointPair(brk1, brk2)[source]#
Bases:
objectRepresents a pair of genomic breakpoints.
- brk1#
The first breakpoint in the pair.
- Type:
- brk2#
The second breakpoint in the pair.
- Type:
- aln_segment#
Indicates whether this pair is part of an alignment segment. Defaults to False.
- Type:
bool
- class hairloom.datatypes.Segments(df)[source]#
Bases:
objectCalculates and stores genomic segments (middle fragments).
- list#
A list of segments, each represented as a tuple (chrom, start, end).
- Type:
list[tuple]
- Parameters:
df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read1'], ... 'chrom': ['chr1', 'chr1', 'chr1'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... }) >>> segments = Segments(df) >>> segments.list [('chr1', 200, 250)]
- get_list()[source]#
Computes segments (middle fragments) from the DataFrame.
Notes
Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).
If a group contains fewer than three fragments, no segments are added.
- Modifies:
list (list[tuple]): Appends computed segments to the list attribute.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read1', 'read2'], ... 'chrom': ['chr1', 'chr1', 'chr1', 'chr2'], ... 'start': [100, 200, 300, 400], ... 'end': [150, 250, 350, 450], ... }) >>> segments = Segments(df) >>> segments.list [('chr1', 200, 250)]
- class hairloom.datatypes.SplitAlignment(cigarstring, read_name, refname, read_pos, strand)[source]#
Bases:
objectRepresents a split alignment from a sequencing read.
Parses the CIGAR string and extracts alignment information such as clip lengths, matched bases, and strand-corrected values.
- read_name#
The name of the sequencing read.
- Type:
str
- refname#
The reference sequence name (e.g., chromosome or contig).
- Type:
str
- cigarstring#
The CIGAR string of the alignment.
- Type:
str
- start#
The start position of the alignment on the reference.
- Type:
int
- strand#
The strand information (‘+’ or ‘-‘).
- Type:
str
- cigar_tuples#
Parsed CIGAR string as a list of (operation, length) tuples.
- Type:
list[tuple]
- primary#
Placeholder for primary alignment information, initialized to None.
- Type:
NoneType
- match#
Total number of matched bases in the alignment.
- Type:
int
- aln_cols#
Column headers for alignment fields.
- Type:
list[str]
- clip1#
Length of the first clip (soft/hard) before the matched region.
- Type:
int
- clip2#
Length of the second clip (soft/hard) after the matched region.
- Type:
int
- end#
The end position of the alignment on the reference.
- Type:
int
- pclip1#
Strand-corrected length of the first clip.
- Type:
int
- extract_cigar_field()[source]#
Parses the CIGAR string to calculate clip lengths, matched bases, and alignment end position.
- static get_cigar_tuples(cigarstring)[source]#
Parses a CIGAR string and converts it into a list of operation-length tuples.
- Parameters:
cigarstring (str) – The CIGAR string to parse, following the standard format used in sequence alignments (e.g., “10M5I20M”).
- Returns:
Parsed CIGAR operations and lengths.
- Return type:
list[tuple[int, int]]
- Raises:
ValueError – If the CIGAR string contains invalid operations.
Example
>>> cigarstring = "10M5I20M" >>> SplitAlignment.get_cigar_tuples(cigarstring) [(0, 10), (1, 5), (0, 20)]
Notes
- Supported CIGAR operations:
‘M’ (0): Alignment match (can be a sequence match or mismatch).
‘I’ (1): Insertion to the reference.
‘D’ (2): Deletion from the reference.
‘N’ (3): Skipped region from the reference.
‘S’ (4): Soft clipping (clipped sequences present in the read).
‘H’ (5): Hard clipping (clipped sequences not present in the read).
‘P’ (6): Padding (silent deletion from padded reference).
‘=’ (7): Sequence match.
‘X’ (8): Sequence mismatch.
- class hairloom.datatypes.Transitions(df)[source]#
Bases:
objectCalculates transitions (tra) between genomic fragments.
- list#
A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).
- Type:
list[tuple]
- Parameters:
df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1'], ... 'chrom': ['chr1', 'chr2'], ... 'start': [100, 200], ... 'end': [150, 250], ... 'strand': ['+', '-'] ... }) >>> transitions = Transitions(df) >>> transitions.list [(('chr1', 150, '+'), ('chr2', 200, '-'))]
- get_list()[source]#
Computes transitions (tra) from the DataFrame.
Notes
Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.
For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).
- Modifies:
list (list[tuple]): Appends computed transitions to the list attribute.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read2'], ... 'chrom': ['chr1', 'chr2', 'chr3'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... 'strand': ['+', '-', '+'] ... }) >>> transitions = Transitions(df) >>> transitions.list # Note: 'read2' won't be included [(('chr1', 150, '+'), ('chr2', 250, '-'))]
- hairloom.datatypes.get_breakpoint_seqs(chrom, pos, margin, genome)[source]#
Extracts upstream and downstream sequences around a breakpoint.
- Parameters:
chrom (str) – Chromosome name.
pos (int) – 1-based position of the breakpoint.
margin (int) – Number of bases upstream and downstream to extract.
genome (dict) – Dictionary mapping chromosome names to sequences.
- Returns:
- A tuple containing:
upstream: Sequence upstream of the breakpoint.
downstream: Sequence downstream of the breakpoint.
- Return type:
tuple[str, str]
Example
>>> genome = {'chr1': "A" * 1000, 'chr2': "T" * 1000} >>> get_breakpoint_seqs('chr1', 5, 3, genome) ('AA', 'AAAA')
hairloom.utils module#
- hairloom.utils.enumerate_breakpoints(df)[source]#
Enumerates breakpoints from a DataFrame of genomic fragments.
This function generates a
BreakpointChainobject containingBreakpointobjects based on the start and end positions of fragments in the input DataFrame. Breakpoints at the beginning and end of reads are treated differently to omit read boundaries.- Parameters:
df (pandas.DataFrame) –
A DataFrame containing fragment information, with the following columns:
chrom: The chromosome name (str).
start: The start position of the fragment (int).
end: The end position of the fragment (int).
strand: The strand information (‘+’ or ‘-‘).
- Returns:
A chain of Breakpoint objects enumerated from the DataFrame.
- Return type:
Notes
For the first fragment in the DataFrame, only the end position is included as a breakpoint.
For the last fragment in the DataFrame, only the start position is included as a breakpoint.
For middle fragments, both start and end positions are included, with orientations determined by the strand.
Example
>>> import pandas as pd >>> from your_module import Breakpoint, BreakpointChain, enumerate_breakpoints >>> df = pd.DataFrame({ ... 'chrom': ['chr1', 'chr1', 'chr1'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... 'strand': ['+', '+', '-'] ... }) >>> brks = enumerate_breakpoints(df) >>> print(brks) [chr1:150:+, chr1:200:-, chr1:250:+, chr1:300:-]
- hairloom.utils.get_secondaries(read)[source]#
Extracts secondary alignments from a sequencing read’s ‘SA’ tag.
This function retrieves secondary alignment information stored in the ‘SA’ tag of a sequencing read. It parses the tag and returns a list of secondary alignments.
- Parameters:
read (pysam.AlignedSegment) – A sequencing read object, typically from a BAM or SAM file, that includes tags.
- Returns:
A list of secondary alignment strings parsed from the ‘SA’ tag. If the ‘SA’ tag is not present, an empty list is returned.
- Return type:
list of str
Example
>>> from pysam import AlignedSegment >>> read = AlignedSegment() >>> read.tags = [('SA', 'chr1,100,+,60M,60;chr2,200,-,60M,60;')] >>> secondaries = get_secondaries(read) >>> print(secondaries) ['chr1,100,+,60M,60', 'chr2,200,-,60M,60']
Notes
The ‘SA’ tag stores secondary alignments in a semicolon-separated format.
- Each secondary alignment string typically contains the following fields:
[chromosome, position, strand, CIGAR, mapping quality].
- hairloom.utils.make_brk_table(bundle, chroms=None)[source]#
Generates a table of breakpoints with support information.
This function processes a bundle of
BreakpointChainobjects, extracts breakpoint coordinates, and organizes them into a structured pandas DataFrame. Optionally, the resulting table can be filtered by specific chromosomes.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing breakpoint information.chroms (list[str], optional) – A list of chromosome names to include. If None, all chromosomes are considered. Defaults to None.
- Returns:
- A DataFrame containing the breakpoint table with the following
columns:
chrom (str): Chromosome name.
pos (int): Position of the breakpoint.
ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).
support (int): Support count for the breakpoint.
- Return type:
pandas.DataFrame
Notes
Breakpoints are filtered based on the chroms parameter if provided.
Infinite values in the resulting DataFrame are replaced with NaN to maintain consistency.
Example
>>> from hairloom.datatypes import Breakpoint, BreakpointChain >>> from hairloom.utils import make_brk_table >>> bundle = [ ... BreakpointChain([ ... Breakpoint("chr1", 100, "+"), ... Breakpoint("chr1", 200, "+"), ... Breakpoint("chr2", 300, "+"), ... ]), ... ] >>> chroms = ["chr1", "chr2"] >>> brk_df = make_brk_table(bundle, chroms) >>> print(brk_df) chrom pos ori support 0 chr1 100 + 1 1 chr1 200 + 1 2 chr2 300 + 1
- hairloom.utils.make_seg_table(bundle, chroms=None)[source]#
Creates a table of genomic segments with support information.
This function processes a bundle of
BreakpointChainobjects, extracts genomic segment coordinates, and combines them with support data into a structured pandas DataFrame.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing segment information.chroms (list[str], optional) – A list of chromosomes to include. If None, all chromosomes are considered. Defaults to None.
- Returns:
- A DataFrame containing the breakpoint table with the following
columns:
chrom (str): Chromosome name.
pos (int): Position of the breakpoint.
ori (str): Orientation of the breakpoint (‘+’ or ‘-‘).
support (int): Support count for the breakpoint.
- Return type:
pandas.DataFrame
Notes
Segments are filtered based on the chroms parameter if provided.
Infinite values in the resulting DataFrame are replaced with NaN for consistency.
Example
>>> from your_module import BreakpointChain, make_seg_table >>> bundle = [ ... Breakpoint('chr1', 100, '+'), ... Breakpoint('chr2', 200, '+'), ... Breakpoint('chr2', 300, '+'), ... Breakpoint('chr1', 400, '+'), ... ] # List of BreakpointChain objects >>> seg_df = make_seg_table(bundle, chroms=["chr1", "chr2"]) >>> print(seg_df.head()) chrom pos1 pos2 support 0 chr2 200 300 1
- hairloom.utils.make_split_read_table(alignments)[source]#
Creates a table summarizing split-read alignments.
This function processes a list of alignment objects, extracting key attributes and organizing them into a structured pandas DataFrame. The resulting table is deduplicated and sorted for consistent organization.
- Parameters:
alignments (list) – A list of alignment objects. Each alignment object is expected to have the following attributes:
read_name (-) – Query name of the read.
refname (-) – Reference sequence name (chromosome or contig).
start (-) – Start position of the alignment on the reference.
end (-) – End position of the alignment on the reference.
strand (-) – Strand information (‘+’ or ‘-‘).
clip1 (-) – Length of the first soft/hard clip before the matched region.
clip2 (-) – Length of the second soft/hard clip after the matched region.
match (-) – Total number of matched bases in the alignment.
pclip1 (-) – Strand-corrected length of the first clip.
- Returns:
A DataFrame containing the following columns:
qname (str): Query name of the read.
chrom (str): Reference sequence name (chromosome or contig).
start (int): Start position of the alignment on the reference.
end (int): End position of the alignment on the reference.
strand (str): Strand information (‘+’ or ‘-‘).
clip1 (int): Length of the first soft/hard clip before the matched region.
match (int): Total number of matched bases in the alignment.
clip2 (int): Length of the second soft/hard clip after the matched region.
pclip1 (int): Strand-corrected length of the first clip.
- Return type:
pandas.DataFrame
Notes
Duplicate rows are removed based on the full set of columns.
The DataFrame is sorted by qname and pclip1 for consistent organization.
Example
>>> from hairloom.utils import make_split_read_table >>> class Alignment: ... def __init__(self, read_name, refname, start, end, strand, clip1, match, clip2, pclip1): ... self.read_name = read_name ... self.refname = refname ... self.start = start ... self.end = end ... self.strand = strand ... self.clip1 = clip1 ... self.match = match ... self.clip2 = clip2 ... self.pclip1 = pclip1 >>> alignments = [ ... Alignment(read_name="read1", refname="chr1", start=100, end=150, ... strand='+', clip1=10, match=40, clip2=5, pclip1=10), ... Alignment(read_name="read2", refname="chr2", start=200, end=250, ... strand='-', clip1=15, match=35, clip2=10, pclip1=15) ... ] >>> df = make_split_read_table(alignments) >>> print(df) qname chrom start end strand clip1 match clip2 pclip1 0 read1 chr1 100 150 + 10 40 5 10 1 read2 chr2 200 250 - 15 35 10 15
- hairloom.utils.make_tra_table(bundle)[source]#
Creates a table of translocations with support information.
This function processes a bundle of
BreakpointChainobjects, extracts translocation coordinates, and combines them with support data into a structured pandas DataFrame.- Parameters:
bundle (list[BreakpointChain]) – A list of
BreakpointChainobjects containing translocation information.- Returns:
A DataFrame containing the following columns:
chrom1 (str): Chromosome name of the first breakpoint.
pos1 (int): Position of the first breakpoint.
ori1 (str): Orientation of the first breakpoint (‘+’ or ‘-‘).
chrom2 (str): Chromosome name of the second breakpoint.
pos2 (int): Position of the second breakpoint.
ori2 (str): Orientation of the second breakpoint (‘+’ or ‘-‘).
support (int): Support count for the translocation.
- Return type:
pandas.DataFrame
Notes
Translocations are identified by pairs of breakpoints (BreakpointPair objects) in the tras attribute of each BreakpointChain.
Duplicate translocations are removed by maintaining a set of seen coordinate pairs.
Infinite values in the resulting DataFrame are replaced with NaN for consistency.
Example
>>> from hairloom.datatypes import Breakpoint, BreakpointChain >>> from hairloom.utils import make_tra_table >>> chain1 = BreakpointChain([ ... Breakpoint('chr1', 100, '+'), ... Breakpoint('chr2', 200, '+') ... ]) >>> chain2 = BreakpointChain([ ... Breakpoint('chr2', 300, '+'), ... Breakpoint('chr1', 400, '+') ... ]) >>> bundle = [chain1, chain2] >>> tra_df = make_tra_table(bundle) >>> print(tra_df) chrom1 pos1 ori1 chrom2 pos2 ori2 support 0 chr1 100 + chr2 200 + 1 1 chr2 300 + chr1 400 + 1