hairloom.datatypes#

Functions

get_breakpoint_seqs(chrom, pos, margin, genome)

Extracts upstream and downstream sequences around a breakpoint.

Classes

Breakpoint(chrom, pos, orientation)

Represents a genomic breakpoint with associated properties and methods.

BreakpointChain(brks_iterable)

Represents a chain of genomic breakpoints.

BreakpointPair(brk1, brk2)

Represents a pair of genomic breakpoints.

Segments(df)

Calculates and stores genomic segments (middle fragments).

SplitAlignment(cigarstring, read_name, ...)

Represents a split alignment from a sequencing read.

Transitions(df)

Calculates transitions (tra) between genomic fragments.

class hairloom.datatypes.Breakpoint(chrom, pos, orientation)[source]#

Bases: object

Represents a genomic breakpoint with associated properties and methods.

chrom#

The chromosome name where the breakpoint is located.

Type:

str

pos#

The 1-based position of the breakpoint on the chromosome.

Type:

int

ori#

The orientation of the breakpoint (‘+’ or ‘-‘).

Type:

str

upstream#

Sequence upstream of the breakpoint, initialized to None.

Type:

str or None

downstream#

Sequence downstream of the breakpoint, initialized to None.

Type:

str or None

seq_rearranged#

Rearranged sequence at the breakpoint, initialized to None.

Type:

str or None

seq_removed#

Removed sequence at the breakpoint, initialized to None.

Type:

str or None

chroms#

List of valid chromosome names, including both standard (‘1’, ‘2’, …, ‘X’, ‘Y’, ‘M’) and prefixed (‘chr1’, ‘chr2’, …, ‘chrX’, ‘chrY’).

Type:

list[str]

chroms = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'M', 'chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM']#
get_breakpoint_seqs(margin, genome)[source]#

Retrieves upstream and downstream sequences around the breakpoint.

Computes rearranged and removed sequences based on the given margin and genome dictionary.

Parameters:
  • margin (int) – Number of bases to include upstream and downstream of the breakpoint.

  • genome (dict) – A dictionary mapping chromosome names to their respective sequences.

Raises:

ValueError – If the chromosome is not found in the genome.

class hairloom.datatypes.BreakpointChain(brks_iterable)[source]#

Bases: list

Represents a chain of genomic breakpoints.

This will often represent the chain of breakpoints coming from a single read. This class extends the Python list to store breakpoints and provides methods to enumerate transitions and segments.

tras#

List of transitions (pairs of breakpoints).

Type:

list[BreakpointPair]

segs#

List of segments (pairs of breakpoints).

Type:

list[BreakpointPair]

Parameters:

brks_iterable (iterable) – An iterable containing breakpoint objects.

Example

>>> brk1 = Breakpoint("chr1", 100, "+")
>>> brk2 = Breakpoint("chr1", 200, "-")
>>> chain = BreakpointChain([brk1, brk2])
>>> chain.tras
[BreakpointPair(brk1, brk2)]
class hairloom.datatypes.BreakpointPair(brk1, brk2)[source]#

Bases: object

Represents a pair of genomic breakpoints.

brk1#

The first breakpoint in the pair.

Type:

Breakpoint

brk2#

The second breakpoint in the pair.

Type:

Breakpoint

aln_segment#

Indicates whether this pair is part of an alignment segment. Defaults to False.

Type:

bool

__repr__()[source]#

Returns a string representation of the breakpoint pair.

class hairloom.datatypes.Segments(df)[source]#

Bases: object

Calculates and stores genomic segments (middle fragments).

list#

A list of segments, each represented as a tuple (chrom, start, end).

Type:

list[tuple]

Parameters:

df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, and ‘end’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1'],
...     'chrom': ['chr1', 'chr1', 'chr1'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]
get_list()[source]#

Computes segments (middle fragments) from the DataFrame.

Notes

  • Segments are defined as genomic fragments that are neither the first nor the last fragment in a group (grouped by ‘qname’).

  • If a group contains fewer than three fragments, no segments are added.

Modifies:

list (list[tuple]): Appends computed segments to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr1', 'chr1', 'chr2'],
...     'start': [100, 200, 300, 400],
...     'end': [150, 250, 350, 450],
... })
>>> segments = Segments(df)
>>> segments.list
[('chr1', 200, 250)]
class hairloom.datatypes.SplitAlignment(cigarstring, read_name, refname, read_pos, strand)[source]#

Bases: object

Represents a split alignment from a sequencing read.

Parses the CIGAR string and extracts alignment information such as clip lengths, matched bases, and strand-corrected values.

read_name#

The name of the sequencing read.

Type:

str

refname#

The reference sequence name (e.g., chromosome or contig).

Type:

str

cigarstring#

The CIGAR string of the alignment.

Type:

str

start#

The start position of the alignment on the reference.

Type:

int

strand#

The strand information (‘+’ or ‘-‘).

Type:

str

cigar_tuples#

Parsed CIGAR string as a list of (operation, length) tuples.

Type:

list[tuple]

primary#

Placeholder for primary alignment information, initialized to None.

Type:

NoneType

match#

Total number of matched bases in the alignment.

Type:

int

aln_cols#

Column headers for alignment fields.

Type:

list[str]

clip1#

Length of the first clip (soft/hard) before the matched region.

Type:

int

clip2#

Length of the second clip (soft/hard) after the matched region.

Type:

int

end#

The end position of the alignment on the reference.

Type:

int

pclip1#

Strand-corrected length of the first clip.

Type:

int

extract_cigar_field()[source]#

Parses the CIGAR string to calculate clip lengths, matched bases, and alignment end position.

static get_cigar_tuples(cigarstring)[source]#

Parses a CIGAR string and converts it into a list of operation-length tuples.

Parameters:

cigarstring (str) – The CIGAR string to parse, following the standard format used in sequence alignments (e.g., “10M5I20M”).

Returns:

Parsed CIGAR operations and lengths.

Return type:

list[tuple[int, int]]

Raises:

ValueError – If the CIGAR string contains invalid operations.

Example

>>> cigarstring = "10M5I20M"
>>> SplitAlignment.get_cigar_tuples(cigarstring)
[(0, 10), (1, 5), (0, 20)]

Notes

Supported CIGAR operations:
  • ‘M’ (0): Alignment match (can be a sequence match or mismatch).

  • ‘I’ (1): Insertion to the reference.

  • ‘D’ (2): Deletion from the reference.

  • ‘N’ (3): Skipped region from the reference.

  • ‘S’ (4): Soft clipping (clipped sequences present in the read).

  • ‘H’ (5): Hard clipping (clipped sequences not present in the read).

  • ‘P’ (6): Padding (silent deletion from padded reference).

  • ‘=’ (7): Sequence match.

  • ‘X’ (8): Sequence mismatch.

class hairloom.datatypes.Transitions(df)[source]#

Bases: object

Calculates transitions (tra) between genomic fragments.

list#

A list of transitions, where each transition is a tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Type:

list[tuple]

Parameters:

df (pandas.DataFrame) – A DataFrame with fragment information, including ‘qname’, ‘chrom’, ‘start’, ‘end’, and ‘strand’.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1'],
...     'chrom': ['chr1', 'chr2'],
...     'start': [100, 200],
...     'end': [150, 250],
...     'strand': ['+', '-']
... })
>>> transitions = Transitions(df)
>>> transitions.list
[(('chr1', 150, '+'), ('chr2', 200, '-'))]
get_list()[source]#

Computes transitions (tra) from the DataFrame.

Notes

  • Transitions are calculated between consecutive fragments in the DataFrame grouped by ‘qname’.

  • For each fragment pair, the orientation and positions are determined based on the strand, creating a transition tuple of the form ((chrom1, pos1, ori1), (chrom2, pos2, ori2)).

Modifies:

list (list[tuple]): Appends computed transitions to the list attribute.

Example

>>> df = pd.DataFrame({
...     'qname': ['read1', 'read1', 'read2'],
...     'chrom': ['chr1', 'chr2', 'chr3'],
...     'start': [100, 200, 300],
...     'end': [150, 250, 350],
...     'strand': ['+', '-', '+']
... })
>>> transitions = Transitions(df)
>>> transitions.list # Note: 'read2' won't be included
[(('chr1', 150, '+'), ('chr2', 250, '-'))]
hairloom.datatypes.get_breakpoint_seqs(chrom, pos, margin, genome)[source]#

Extracts upstream and downstream sequences around a breakpoint.

Parameters:
  • chrom (str) – Chromosome name.

  • pos (int) – 1-based position of the breakpoint.

  • margin (int) – Number of bases upstream and downstream to extract.

  • genome (dict) – Dictionary mapping chromosome names to sequences.

Returns:

A tuple containing:
  • upstream: Sequence upstream of the breakpoint.

  • downstream: Sequence downstream of the breakpoint.

Return type:

tuple[str, str]

Example

>>> genome = {'chr1': "A" * 1000, 'chr2': "T" * 1000}
>>> get_breakpoint_seqs('chr1', 5, 3, genome)
('AA', 'AAAA')