linkapy.parsing

Classes

Linkapy_Parser

Linkapy_Parser mainly functions to create matrices (arrow format for RNA, mtx format for accessibility / methylation)

Functions

`parse_rna`(→ None)	Read one or more featureCount files, combine them and write them to a counts and metadata arrow file.
`read_rna_to_anndata`(→ anndata.AnnData)	From a prefix, read the count matrix, and the metadata, combine them into an AnnData object.
`read_meth_to_anndata`(→ anndata.AnnData)	From a prefix, read the fraction matrices, and their metadata, and combine them into an AnnData object.
`match_cells`(→ tuple[List[List[str]], ...)	Take a list of lists containing putative cell names. Per list, we need a 'best match'.
`get_common_cellname`(→ str \| float)

Module Contents

class linkapy.parsing.Linkapy_Parser(methylation_path=None, transcriptome_path=None, output='linkapy_output', methylation_pattern=('*GC*tsv.gz',), methylation_pattern_names=(), transcriptome_pattern=('*tsv',), transcriptome_pattern_names=(), NOMe=False, threads=1, chromsizes=None, regions=None, blacklist=None, binsize=10000, project='linkapy', verbose=False)

Linkapy_Parser mainly functions to create matrices (arrow format for RNA, mtx format for accessibility / methylation) from directories containing processed multi-modal single-cell data.

At least one of both items should be provided:

methylation_path and/or transcriptome_path
regions or chromsizes file (if methylation_path is provided).

Parameters:

methylation_path (str) – The path to the methylation directory (will be searched recursively!).
transcriptome_path (str) – The path to the RNA output directory (will be searched recursively!).
output (str) – The output directory where matrices will be written to. Defaults to current working directory in folder (‘linkapy_output’).
methylation_pattern (tuple) – The glob pattern to search methylation path recursively. Defaults to (‘GC’). Note that this is a tuple.
transcriptome_pattern (tuple) – The glob pattern to search transcriptome path recursively. Defaults to (‘tsv’). Note that this is a tuple.
NOMe (bool) – If set, methylation_path will be searched for NOMe-seq data. The methylation path will be searched for patterns (‘GCHN’, ‘WCGN’).
threads (int) – Number of threads to use for parsing. Defaults to 1.
chromsizes (str) – Path to the chromsizes file for the genome. If set, methylation signal will be aggregated over bins
regions (tuple) – Path or paths to bed files containing regions to aggregate methylation signal over. Can be gzipped. Note that this is a tuple.
blacklist (tuple) – Path or paths to bed files containing regions to exclude from the aggregation. Can be gzipped. Note that this is a tuple.
binsize (int) – Size of the bins to aggregate over. Only relevant if no regions are provided. Defaults to 10000.
project (str) – Name of the project. Will be treated as a prefix for the output files. Defaults to ‘linkapy’.

output

project = 'linkapy'

logfile

logger

methylation_path

transcriptome_path

chromsizes

regions

blacklist

threads = 1

methylation_pattern = ('*GC*tsv.gz',)

methylation_pattern_names = ()

transcriptome_pattern = ('*tsv',)

transcriptome_pattern_names = ()

binsize = 10000

_validate(): Validate the provided paths and parameters.

_glob(): Discover files to aggregate over based on the paths and patterns provided.

parse(): Parse the globbed files and create the different matrices and their corresponding metadata.

dump_mudata()

linkapy.parsing.parse_rna(files, prefix) → None: Read one or more featureCount files, combine them and write them to a counts and metadata arrow file.

linkapy.parsing.read_rna_to_anndata(prefix) → anndata.AnnData: From a prefix, read the count matrix, and the metadata, combine them into an AnnData object.

linkapy.parsing.read_meth_to_anndata(prefix) → anndata.AnnData: From a prefix, read the fraction matrices, and their metadata, and combine them into an AnnData object.

linkapy.parsing.match_cells(_l: List[List[str]], patterns: List[str], logger) → tuple[List[List[str]], pandas.DataFrame] | tuple[None, None]: Take a list of lists containing putative cell names. Per list, we need a ‘best match’. This is needed since often an assay or context specific pre- or postfix is used, and we want to match them for the mudata object.

linkapy.parsing.get_common_cellname(cellnames: List[str]) → str | float