hairloom.datatypes#

Functions

get_breakpoint_seqs(chrom, pos, margin, genome)

Extracts upstream and downstream sequences around a breakpoint.

Classes

`Breakpoint`(chrom, pos, orientation)	Represents a genomic breakpoint with associated properties and methods.
`BreakpointChain`(brks_iterable)	Represents a chain of genomic breakpoints.
`BreakpointPair`(brk1, brk2)	Represents a pair of genomic breakpoints.
`Segments`(df)	Calculates and stores genomic segments (middle fragments).
`SplitAlignment`(cigarstring, read_name, ...)	Represents a split alignment from a sequencing read.
`Transitions`(df)	Calculates transitions (tra) between genomic fragments.

class hairloom.datatypes.Breakpoint(chrom, pos, orientation)[source]#

Bases: object

Represents a genomic breakpoint with associated properties and methods.

chrom#

The chromosome name where the breakpoint is located.

Type:: str

pos#

The 1-based position of the breakpoint on the chromosome.

Type:: int

ori#

The orientation of the breakpoint (‘+’ or ‘-‘).

Type:: str

upstream#

Sequence upstream of the breakpoint, initialized to None.

Type:: str or None

downstream#

Sequence downstream of the breakpoint, initialized to None.

Type:: str or None

seq_rearranged#

Rearranged sequence at the breakpoint, initialized to None.

Type:: str or None

seq_removed#

Removed sequence at the breakpoint, initialized to None.

Type:: str or None

chroms#

List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).

Type:: list[str]

chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#

get_breakpoint_seqs(margin, genome)[source]#

Retrieves upstream and downstream sequences around the breakpoint.

Computes rearranged and removed sequences based on the given margin and genome dictionary.

Parameters:

margin (int) – Number of bases to include upstream and downstream of the breakpoint.
genome (dict) – A dictionary mapping chromosome names to their respective sequences.

Raises:

ValueError – If the chromosome is not found in the genome.

class hairloom.datatypes.BreakpointChain(brks_iterable)[source]#

Bases: list

Represents a chain of genomic breakpoints.

This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.

tras#

List of transitions (pairs of breakpoints).

Type:: list[BreakpointPair]

segs#

List of segments (pairs of breakpoints).

Type:: list[BreakpointPair]

Parameters:: brks_iterable (iterable) – An iterable containing breakpoint objects.

Example

>>> brk1 = Breakpoint("chr1", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> chain = BreakpointChain([brk1, brk2])
>>> chain.tras
[BreakpointPair(brk1, brk2)]

class hairloom.datatypes.BreakpointPair(brk1, brk2)[source]#

Bases: object

Represents a pair of genomic breakpoints.

brk1#

The first breakpoint in the pair.

Type:: Breakpoint

brk2#

The second breakpoint in the pair.

Type:: Breakpoint

aln_segment#

Indicates whether this pair is part of an alignment segment. Defaults to False.

Type:: bool

__repr__()[source]#: Returns a string representation of the breakpoint pair.

class hairloom.datatypes.Segments(df)[source]#

Bases: object

Calculates and stores genomic segments (middle fragments).

list#

A list of segments, each represented as a tuple (chrom, start, end).

Type:: list[tuple]

Parameters:: df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1'],
...     'chrom': ['chr1', 'chr1', 'chr1'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]

get_list()[source]#

Computes segments (middle fragments) from the DataFrame.

Notes

Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).
If a group contains fewer than three fragments, no segments are added.

Modifies:: list (list[tuple]): Appends computed segments to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr1', 'chr1', 'chr2'],
...     'start': [100, 200, 300, 400],
...     'end': [150, 250, 350, 450],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]

class hairloom.datatypes.SplitAlignment(cigarstring, read_name, refname, read_pos, strand)[source]#

Bases: object

Represents a split alignment from a sequencing read.

Parses the CIGAR string and extracts alignment information such as clip lengths, matched bases, and strand-corrected values.

read_name#

The name of the sequencing read.

Type:: str

refname#

The reference sequence name (e.g., chromosome or contig).

Type:: str

cigarstring#

The CIGAR string of the alignment.

Type:: str

start#

The start position of the alignment on the reference.

Type:: int

strand#

The strand information (‘+’ or ‘-‘).

Type:: str

cigar_tuples#

Parsed CIGAR string as a list of (operation, length) tuples.

Type:: list[tuple]

primary#

Placeholder for primary alignment information, initialized to None.

Type:: NoneType

match#

Total number of matched bases in the alignment.

Type:: int

aln_cols#

Column headers for alignment fields.

Type:: list[str]

clip1#

Length of the first clip (soft/hard) before the matched region.

Type:: int

clip2#

Length of the second clip (soft/hard) after the matched region.

Type:: int

end#

The end position of the alignment on the reference.

Type:: int

pclip1#

Strand-corrected length of the first clip.

Type:: int

extract_cigar_field()[source]#: Parses the CIGAR string to calculate clip lengths, matched bases, and alignment end position.

static get_cigar_tuples(cigarstring)[source]#

Parses a CIGAR string and converts it into a list of operation-length tuples.

Parameters:: cigarstring (str) – The CIGAR string to parse, following the standard format used in sequence alignments (e.g., “10M5I20M”).
Returns:: Parsed CIGAR operations and lengths.
Return type:: list[tuple[int, int]]
Raises:: ValueError – If the CIGAR string contains invalid operations.

Example

>>> cigarstring = "10M5I20M"
>>> SplitAlignment.get_cigar_tuples(cigarstring)
[(0, 10), (1, 5), (0, 20)]

Notes

Supported CIGAR operations:

‘M’ (0): Alignment match (can be a sequence match or mismatch).
‘I’ (1): Insertion to the reference.
‘D’ (2): Deletion from the reference.
‘N’ (3): Skipped region from the reference.
‘S’ (4): Soft clipping (clipped sequences present in the read).
‘H’ (5): Hard clipping (clipped sequences not present in the read).
‘P’ (6): Padding (silent deletion from padded reference).
‘=’ (7): Sequence match.
‘X’ (8): Sequence mismatch.

class hairloom.datatypes.Transitions(df)[source]#

Bases: object

Calculates transitions (tra) between genomic fragments.

list#

A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Type:: list[tuple]

Parameters:: df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1'],
...     'chrom': ['chr1', 'chr2'],
...     'start': [100, 200],
...     'end': [150, 250],
...     'strand': ['+', '-']
... })
>>> transitions = Transitions(df)
>>> transitions.list
[(('chr1', 150, '+'), ('chr2', 200, '-'))]

get_list()[source]#

Computes transitions (tra) from the DataFrame.

Notes

Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.
For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Modifies:: list (list[tuple]): Appends computed transitions to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr2', 'chr3'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
...     'strand': ['+', '-', '+']
... })
>>> transitions = Transitions(df)
>>> transitions.list # Note: 'read2' won't be included
[(('chr1', 150, '+'), ('chr2', 250, '-'))]

hairloom.datatypes.get_breakpoint_seqs(chrom, pos, margin, genome)[source]#

Extracts upstream and downstream sequences around a breakpoint.

Parameters:

chrom (str) – Chromosome name.
pos (int) – 1-based position of the breakpoint.
margin (int) – Number of bases upstream and downstream to extract.
genome (dict) – Dictionary mapping chromosome names to sequences.

Returns:

A tuple containing:

upstream: Sequence upstream of the breakpoint.
downstream: Sequence downstream of the breakpoint.

Return type:

tuple[str, str]

Example

>>> genome = {'chr1': "A" * 1000, 'chr2': "T" * 1000}
>>> get_breakpoint_seqs('chr1', 5, 3, genome)
('AA', 'AAAA')

hairloom.datatypes#

This Page