hairloom.datatypes#
Functions
|
Extracts upstream and downstream sequences around a breakpoint. |
Classes
|
Represents a genomic breakpoint with associated properties and methods. |
|
Represents a chain of genomic breakpoints. |
|
Represents a pair of genomic breakpoints. |
|
Calculates and stores genomic segments (middle fragments). |
|
Represents a split alignment from a sequencing read. |
|
Calculates transitions (tra) between genomic fragments. |
- class hairloom.datatypes.Breakpoint(chrom, pos, orientation)[source]#
Bases:
objectRepresents a genomic breakpoint with associated properties and methods.
- chrom#
The chromosome name where the breakpoint is located.
- Type:
str
- pos#
The 1-based position of the breakpoint on the chromosome.
- Type:
int
- ori#
The orientation of the breakpoint (‘+’ or ‘-‘).
- Type:
str
- upstream#
Sequence upstream of the breakpoint, initialized to None.
- Type:
str or None
- downstream#
Sequence downstream of the breakpoint, initialized to None.
- Type:
str or None
- seq_rearranged#
Rearranged sequence at the breakpoint, initialized to None.
- Type:
str or None
- seq_removed#
Removed sequence at the breakpoint, initialized to None.
- Type:
str or None
- chroms#
List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).
- Type:
list[str]
- chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#
- get_breakpoint_seqs(margin, genome)[source]#
Retrieves upstream and downstream sequences around the breakpoint.
Computes rearranged and removed sequences based on the given margin and genome dictionary.
- Parameters:
margin (int) – Number of bases to include upstream and downstream of the breakpoint.
genome (dict) – A dictionary mapping chromosome names to their respective sequences.
- Raises:
ValueError – If the chromosome is not found in the genome.
- class hairloom.datatypes.BreakpointChain(brks_iterable)[source]#
Bases:
listRepresents a chain of genomic breakpoints.
This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.
- tras#
List of transitions (pairs of breakpoints).
- Type:
list[BreakpointPair]
- segs#
List of segments (pairs of breakpoints).
- Type:
list[BreakpointPair]
- Parameters:
brks_iterable (iterable) – An iterable containing breakpoint objects.
Example
>>> brk1 = Breakpoint("chr1", 100, "+") >>> brk2 = Breakpoint("chr1", 200, "-") >>> chain = BreakpointChain([brk1, brk2]) >>> chain.tras [BreakpointPair(brk1, brk2)]
- class hairloom.datatypes.BreakpointPair(brk1, brk2)[source]#
Bases:
objectRepresents a pair of genomic breakpoints.
- brk1#
The first breakpoint in the pair.
- Type:
- brk2#
The second breakpoint in the pair.
- Type:
- aln_segment#
Indicates whether this pair is part of an alignment segment. Defaults to False.
- Type:
bool
- class hairloom.datatypes.Segments(df)[source]#
Bases:
objectCalculates and stores genomic segments (middle fragments).
- list#
A list of segments, each represented as a tuple (chrom, start, end).
- Type:
list[tuple]
- Parameters:
df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read1'], ... 'chrom': ['chr1', 'chr1', 'chr1'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... }) >>> segments = Segments(df) >>> segments.list [('chr1', 200, 250)]
- get_list()[source]#
Computes segments (middle fragments) from the DataFrame.
Notes
Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).
If a group contains fewer than three fragments, no segments are added.
- Modifies:
list (list[tuple]): Appends computed segments to the list attribute.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read1', 'read2'], ... 'chrom': ['chr1', 'chr1', 'chr1', 'chr2'], ... 'start': [100, 200, 300, 400], ... 'end': [150, 250, 350, 450], ... }) >>> segments = Segments(df) >>> segments.list [('chr1', 200, 250)]
- class hairloom.datatypes.SplitAlignment(cigarstring, read_name, refname, read_pos, strand)[source]#
Bases:
objectRepresents a split alignment from a sequencing read.
Parses the CIGAR string and extracts alignment information such as clip lengths, matched bases, and strand-corrected values.
- read_name#
The name of the sequencing read.
- Type:
str
- refname#
The reference sequence name (e.g., chromosome or contig).
- Type:
str
- cigarstring#
The CIGAR string of the alignment.
- Type:
str
- start#
The start position of the alignment on the reference.
- Type:
int
- strand#
The strand information (‘+’ or ‘-‘).
- Type:
str
- cigar_tuples#
Parsed CIGAR string as a list of (operation, length) tuples.
- Type:
list[tuple]
- primary#
Placeholder for primary alignment information, initialized to None.
- Type:
NoneType
- match#
Total number of matched bases in the alignment.
- Type:
int
- aln_cols#
Column headers for alignment fields.
- Type:
list[str]
- clip1#
Length of the first clip (soft/hard) before the matched region.
- Type:
int
- clip2#
Length of the second clip (soft/hard) after the matched region.
- Type:
int
- end#
The end position of the alignment on the reference.
- Type:
int
- pclip1#
Strand-corrected length of the first clip.
- Type:
int
- extract_cigar_field()[source]#
Parses the CIGAR string to calculate clip lengths, matched bases, and alignment end position.
- static get_cigar_tuples(cigarstring)[source]#
Parses a CIGAR string and converts it into a list of operation-length tuples.
- Parameters:
cigarstring (str) – The CIGAR string to parse, following the standard format used in sequence alignments (e.g., “10M5I20M”).
- Returns:
Parsed CIGAR operations and lengths.
- Return type:
list[tuple[int, int]]
- Raises:
ValueError – If the CIGAR string contains invalid operations.
Example
>>> cigarstring = "10M5I20M" >>> SplitAlignment.get_cigar_tuples(cigarstring) [(0, 10), (1, 5), (0, 20)]
Notes
- Supported CIGAR operations:
‘M’ (0): Alignment match (can be a sequence match or mismatch).
‘I’ (1): Insertion to the reference.
‘D’ (2): Deletion from the reference.
‘N’ (3): Skipped region from the reference.
‘S’ (4): Soft clipping (clipped sequences present in the read).
‘H’ (5): Hard clipping (clipped sequences not present in the read).
‘P’ (6): Padding (silent deletion from padded reference).
‘=’ (7): Sequence match.
‘X’ (8): Sequence mismatch.
- class hairloom.datatypes.Transitions(df)[source]#
Bases:
objectCalculates transitions (tra) between genomic fragments.
- list#
A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).
- Type:
list[tuple]
- Parameters:
df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1'], ... 'chrom': ['chr1', 'chr2'], ... 'start': [100, 200], ... 'end': [150, 250], ... 'strand': ['+', '-'] ... }) >>> transitions = Transitions(df) >>> transitions.list [(('chr1', 150, '+'), ('chr2', 200, '-'))]
- get_list()[source]#
Computes transitions (tra) from the DataFrame.
Notes
Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.
For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).
- Modifies:
list (list[tuple]): Appends computed transitions to the list attribute.
Example
>>> df = pd.DataFrame({ ... 'qname': ['read1', 'read1', 'read2'], ... 'chrom': ['chr1', 'chr2', 'chr3'], ... 'start': [100, 200, 300], ... 'end': [150, 250, 350], ... 'strand': ['+', '-', '+'] ... }) >>> transitions = Transitions(df) >>> transitions.list # Note: 'read2' won't be included [(('chr1', 150, '+'), ('chr2', 250, '-'))]
- hairloom.datatypes.get_breakpoint_seqs(chrom, pos, margin, genome)[source]#
Extracts upstream and downstream sequences around a breakpoint.
- Parameters:
chrom (str) – Chromosome name.
pos (int) – 1-based position of the breakpoint.
margin (int) – Number of bases upstream and downstream to extract.
genome (dict) – Dictionary mapping chromosome names to sequences.
- Returns:
- A tuple containing:
upstream: Sequence upstream of the breakpoint.
downstream: Sequence downstream of the breakpoint.
- Return type:
tuple[str, str]
Example
>>> genome = {'chr1': "A" * 1000, 'chr2': "T" * 1000} >>> get_breakpoint_seqs('chr1', 5, 3, genome) ('AA', 'AAAA')