linkapy.parsing

Classes

Linkapy_Parser

Linkapy_Parser mainly functions to create matrices (arrow format for RNA, mtx format for accessibility / methylation)

Functions

parse_rna(→ None)

Read one or more featureCount files, combine them and write them to a counts and metadata arrow file.

read_rna_to_anndata(→ anndata.AnnData)

From a prefix, read the count matrix, and the metadata, combine them into an AnnData object.

read_meth_to_anndata(→ anndata.AnnData)

From a prefix, read the fraction matrices, and their metadata, and combine them into an AnnData object.

match_cells(→ tuple[List[List[str]], ...)

Take a list of lists containing putative cell names. Per list, we need a 'best match'.

get_common_cellname(→ str | float)

Module Contents

class linkapy.parsing.Linkapy_Parser(methylation_path=None, transcriptome_path=None, output='linkapy_output', methylation_pattern=('*GC*tsv.gz',), methylation_pattern_names=(), transcriptome_pattern=('*tsv',), transcriptome_pattern_names=(), NOMe=False, threads=1, chromsizes=None, regions=None, blacklist=None, binsize=10000, project='linkapy', verbose=False)

Linkapy_Parser mainly functions to create matrices (arrow format for RNA, mtx format for accessibility / methylation) from directories containing processed multi-modal single-cell data.

At least one of both items should be provided:
  • methylation_path and/or transcriptome_path

  • regions or chromsizes file (if methylation_path is provided).

Parameters:
  • methylation_path (str) – The path to the methylation directory (will be searched recursively!).

  • transcriptome_path (str) – The path to the RNA output directory (will be searched recursively!).

  • output (str) – The output directory where matrices will be written to. Defaults to current working directory in folder (‘linkapy_output’).

  • methylation_pattern (tuple) – The glob pattern to search methylation path recursively. Defaults to (‘GC’). Note that this is a tuple.

  • transcriptome_pattern (tuple) – The glob pattern to search transcriptome path recursively. Defaults to (‘tsv’). Note that this is a tuple.

  • NOMe (bool) – If set, methylation_path will be searched for NOMe-seq data. The methylation path will be searched for patterns (‘GCHN’, ‘WCGN’).

  • threads (int) – Number of threads to use for parsing. Defaults to 1.

  • chromsizes (str) – Path to the chromsizes file for the genome. If set, methylation signal will be aggregated over bins

  • regions (tuple) – Path or paths to bed files containing regions to aggregate methylation signal over. Can be gzipped. Note that this is a tuple.

  • blacklist (tuple) – Path or paths to bed files containing regions to exclude from the aggregation. Can be gzipped. Note that this is a tuple.

  • binsize (int) – Size of the bins to aggregate over. Only relevant if no regions are provided. Defaults to 10000.

  • project (str) – Name of the project. Will be treated as a prefix for the output files. Defaults to ‘linkapy’.

output
project = 'linkapy'
logfile
logger
methylation_path
transcriptome_path
chromsizes
regions
blacklist
threads = 1
methylation_pattern = ('*GC*tsv.gz',)
methylation_pattern_names = ()
transcriptome_pattern = ('*tsv',)
transcriptome_pattern_names = ()
binsize = 10000
_validate()

Validate the provided paths and parameters.

_glob()

Discover files to aggregate over based on the paths and patterns provided.

parse()

Parse the globbed files and create the different matrices and their corresponding metadata.

dump_mudata()
linkapy.parsing.parse_rna(files, prefix) None

Read one or more featureCount files, combine them and write them to a counts and metadata arrow file.

linkapy.parsing.read_rna_to_anndata(prefix) anndata.AnnData

From a prefix, read the count matrix, and the metadata, combine them into an AnnData object.

linkapy.parsing.read_meth_to_anndata(prefix) anndata.AnnData

From a prefix, read the fraction matrices, and their metadata, and combine them into an AnnData object.

linkapy.parsing.match_cells(_l: List[List[str]], patterns: List[str], logger) tuple[List[List[str]], pandas.DataFrame] | tuple[None, None]

Take a list of lists containing putative cell names. Per list, we need a ‘best match’. This is needed since often an assay or context specific pre- or postfix is used, and we want to match them for the mudata object.

linkapy.parsing.get_common_cellname(cellnames: List[str]) str | float