Skip to content

Commit

Permalink
Merge pull request #87 from Mayrlab/dev
Browse files Browse the repository at this point in the history
Milestone v0.5.0
  • Loading branch information
mfansler committed May 28, 2024
2 parents 876264d + a29003a commit 4073878
Show file tree
Hide file tree
Showing 11 changed files with 375 additions and 66 deletions.
37 changes: 26 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,20 +44,31 @@ determined by simulations.
Please see [our accompanying manuscript][ref:scUTRquant] for more details.

## Outputs
### SingleCellExperiment object (*default*)
The primary output of the pipeline is a Bioconductor `SingleCellExperiment` object.
The `counts` in the object is a sparse `Matrix` of 3' UTR isoform counts; the `rowRanges`
is a `GenomicRanges` of the 3' UTR isoforms; the `rowData` is a `DataFrame` with additional
information about 3' UTR isoforms; and the `colData` is a `DataFrame` populated with sample
metadata and optional user-provide cell annotations.

### H5AD AnnData object (*experimental*)
scUTRquant v0.5.0 adds support for [AnnData objects](https://anndata.readthedocs.io/en/stable/index.html),
making it easier for users who prefer Python and working with [the `scverse` ecosystem](https://scverse.org).
Similar annotations will be attached in the `obs` (cell) and `var` (3'UTR isoform) tables.
However, no analogous `rowRanges` is attached.

The output format can be controlled with the `output_format` configuration option,
with `"h5ad"` and "`sce`" being currently supported options.

### Reporting
To assist users in quality control, the pipeline additionally generates HTML reports
for each sample.

### Additional Files
The pipeline is configured to retain intermediate files, such as BUS and MTX files.
Advanced users can readily customize the pipeline to only generate the files they
require. For example, users who prefer to work with alternative scRNA-seq data structures,
such as those used in Scanpy or Seurat, may wish to terminate the pipeline at MTX
generation.
require. For example, users who prefer to work with alternative scRNA-seq data structures
may wish to terminate the pipeline at MTX generation.

# Setup

Expand All @@ -73,22 +84,23 @@ repository directly tests running examples on the GitHub-hosted runners.
### Conda/Mamba Mode (MacOS or Linux)
Snakemake can use Conda to install the needed software. This configuration requires:

- [Snakemake][ref:snakemake] >= 6.0, < 8.0<sup>ª</sup>
- [Snakemake][ref:snakemake] >= 6.0<sup>ª</sup>
- [Conda](https://docs.conda.io/projects/conda/en/latest/)

If Conda is not already installed, we strongly recommend installing
[Mambaforge](https://github.com/conda-forge/miniforge#mambaforge). If Conda was previously
installed, strongly recommend installing Mamba:
[Miniforge](https://github.com/conda-forge/miniforge#miniforge). If Conda was previously
installed, strongly recommend that Conda be upgraded to at least v23.11 which uses the
faster `libmamba` solver:

```bash
conda install -n base -c conda-forge mamba
conda install -n base 'conda>=23.11'
```

### Singularity Mode (Linux)
Snakemake can use the pre-built scUTRsquant Docker image to provide all additional software.
This configuration requires installing:
Snakemake can use [the pre-built scUTRquant Docker image](https://hub.docker.com/repository/docker/mfansler/scutr-quant)
to provide all additional software. This configuration requires installing:

- [Snakemake][ref:snakemake] >= 6.0, < 8.0<sup>ª</sup>
- [Snakemake][ref:snakemake] >= 6.0<sup>ª</sup>
- [Singularity](https://singularity.lbl.gov/index.html)


Expand Down Expand Up @@ -206,7 +218,7 @@ On GitHub runners with 2-3 cores, these examples have typical execution times of
On HPC systems with multiple nodes with multiple cores, a large job (e.g., 1-2TB raw data)
can process in under an hour when properly configured.

## Full Examples
## Full-Scale Examples
The inputs used to process data in [the manuscript][ref:scUTRquant] are also available in the
[scUTRquant-inputs repository](https://github.com/Mayrlab/scUTRquant-inputs).
These also include individual Snakemake pipelines to download atlas-scale datasets.
Expand Down Expand Up @@ -236,6 +248,7 @@ pipeline. The following keys are expected:
- `cell_annots`: (optional) CSV file with a key column that matches the `<sample_id>_<cell_bx>` format
- `cell_annots_key`: specifies the name of the key column in the `cell_annots` file; default is `cell_id`
- `exclude_unannotated_cells`: boolean indicating whether unannotated cells should be excluded from the final output; default is `False`
- `output_format`: a list of formats, `"sce"` (*default*) and/or `"h5ad"` (*experimental*); `null` will end processing at MTX files.

### Default Values

Expand Down Expand Up @@ -270,7 +283,9 @@ The `extdata/targets.yaml` defines the targets available to pseudoalign to. The
- `kdx`: Kallisto index for UTRome
- `merge`: TSV for merging features (isoforms)
- `tx_annots`: (optional) RDS file containing Bioconductor DataFrame object with annotations for transcripts
- `tx_annots_csv`: (optional) CSV file with annotations for transcripts (used for AnnData)
- `gene_annots`: (optional) RDS file containing Bioconductor DataFrame object with annotations for genes
- `gene_annots_csv`: (optional) CSV file with annotations for genes (used for AnnData)
# Customization
Expand Down
105 changes: 86 additions & 19 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -1,19 +1,29 @@
container: "docker://mfansler/scutr-quant:0.4.0"
container: "docker://mfansler/scutr-quant:0.5.0"
configfile: "config.yaml"
SQ_VERSION="0.5.0"

import os
import pandas as pd
from snakemake.io import load_configfile
from snakemake.utils import min_version
from sys import stderr

# robust loading of `load_configfile`
try:
## version >=8.0.0
from snakemake.common.configfile import load_configfile
except:
## version <8.0.0
from snakemake.io import load_configfile

# set minimum Snakemake version
min_version("6.0")

# print to stderr
def message(*args, **kwargs):
print(*args, file=stderr, **kwargs)

message(f"[INFO] scUTRquant v{SQ_VERSION}")

# make sure the tmp directory exists
os.makedirs(config['tmp_dir'], exist_ok=True)

Expand Down Expand Up @@ -54,19 +64,25 @@ def get_outputs():
outputs += expand("qc/umi_count/{target}/{sample_id}.umi_count.html",
target=target_list,
sample_id=samples.index.values)
if config['use_hdf5']:
outputs += expand("data/sce/{target}/{dataset}.{output_type}.{file_type}",
target=target_list,
dataset=config['dataset_name'],
output_type=config['output_type'],
file_type=['se.rds', 'assays.h5'])
else:
outputs += expand("data/sce/{target}/{dataset}.{output_type}.Rds",
target=target_list,
dataset=config['dataset_name'],
output_type=config['output_type'])
if (not config['output_format']) or ("sce" in config['output_format']):
if config['use_hdf5']:
outputs += expand("data/sce/{target}/{dataset}.{output_type}.{file_type}",
target=target_list,
dataset=config['dataset_name'],
output_type=config['output_type'],
file_type=['se.rds', 'assays.h5'])
else:
outputs += expand("data/sce/{target}/{dataset}.{output_type}.Rds",
target=target_list,
dataset=config['dataset_name'],
output_type=config['output_type'])
if "h5ad" in config['output_format']:
outputs += expand("data/h5ad/{target}/{dataset}.{output_type}.h5ad",
target=target_list,
dataset=config['dataset_name'],
output_type=config['output_type'])
return outputs

rule all:
input: get_outputs()

Expand All @@ -82,9 +98,11 @@ for target_id, target in targets.items():

target_path = target['path']
download_script = target_path + target['download_script']

FILE_KEYS = ['gtf', 'kdx', 'merge_tsv', 'tx_annots', 'gene_annots']
target_files = [target_path + target[k] for k in FILE_KEYS]

FILE_KEYS = ['gtf', 'kdx', 'merge_tsv',
'tx_annots', 'gene_annots',
'tx_annots_csv', 'gene_annots_csv']
target_files = [target_path + target[k] for k in FILE_KEYS if target[k] is not None]

rule:
name: f"download_{target_id}"
Expand Down Expand Up @@ -379,6 +397,55 @@ rule mtxs_to_sce_h5_genes:
script:
"scripts/mtxs_to_sce_genes.R"

rule mtxs_to_h5ad_txs:
input:
bxs=expand("data/kallisto/{target}/{sample_id}/txs.barcodes.txt",
sample_id=samples.index.values, allow_missing=True),
txs=expand("data/kallisto/{target}/{sample_id}/txs.genes.txt",
sample_id=samples.index.values, allow_missing=True),
mtxs=expand("data/kallisto/{target}/{sample_id}/txs.mtx",
sample_id=samples.index.values, allow_missing=True),
gtf=get_target_file('gtf'),
tx_annots=get_target_file('tx_annots_csv'),
cell_annots=config['cell_annots']
output:
h5ad="data/h5ad/{target}/%s.txs.h5ad" % config['dataset_name']
params:
genome=lambda wcs: targets[wcs.target]['genome'],
sample_ids=samples.index.values,
min_umis=config['min_umis'],
cell_annots_key=config['cell_annots_key'],
exclude_unannotated_cells=config['exclude_unannotated_cells']
resources:
mem_mb=16000
conda: "envs/anndata.yaml"
script:
"scripts/mtxs_to_h5ad_txs.py"

rule mtxs_to_h5ad_genes:
input:
bxs=expand("data/kallisto/{target}/{sample_id}/genes.barcodes.txt",
sample_id=samples.index.values, allow_missing=True),
genes=expand("data/kallisto/{target}/{sample_id}/genes.genes.txt",
sample_id=samples.index.values, allow_missing=True),
mtxs=expand("data/kallisto/{target}/{sample_id}/genes.mtx",
sample_id=samples.index.values, allow_missing=True),
gtf=get_target_file('gtf'),
gene_annots=get_target_file('gene_annots_csv'),
cell_annots=config['cell_annots']
output:
h5ad="data/h5ad/{target}/%s.genes.h5ad" % config['dataset_name']
params:
genome=lambda wcs: targets[wcs.target]['genome'],
sample_ids=samples.index.values,
min_umis=config['min_umis'],
cell_annots_key=config['cell_annots_key'],
exclude_unannotated_cells=config['exclude_unannotated_cells']
resources:
mem_mb=16000
conda: "envs/anndata.yaml"
script:
"scripts/mtxs_to_h5ad_genes.py"

################################################################################
## Reports
Expand All @@ -390,10 +457,10 @@ rule report_umis_per_cell:
txs="data/kallisto/{target}/{sample_id}/txs.genes.txt",
mtx="data/kallisto/{target}/{sample_id}/txs.mtx"
params:
min_umis=config['min_umis']
min_umis=config['min_umis'],
sq_version=SQ_VERSION
output:
"qc/umi_count/{target}/{sample_id}.umi_count.html"
conda: "envs/rmd-reporting.yaml"
script:
"scripts/report_umi_counts_per_cell.Rmd"

2 changes: 2 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ target: "utrome_mm10_v2"
output_type:
- "txs"
- "genes"
output_format:
- "sce"
tmp_dir: "/tmp"
bx_whitelist: null
correct_bus: True
Expand Down
26 changes: 16 additions & 10 deletions docker/scutr-quant.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,20 @@ dependencies:
- bustools=0.40.0

## sce + reporting
- r-base=4.1
- r-dplyr=1.0
- r-ggplot2=3.3
- r-matrix=1.4
- r-base=4.3
- r-dplyr=1.1
- r-ggplot2=3.5
- r-matrix=1.6
- r-readr=2.1
- r-rmarkdown=2.14
- r-stringr=1.4
- bioconductor-hdf5array=1.22
- bioconductor-singlecellexperiment=1.16
- bioconductor-plyranges=1.14
- openssl=1
- r-rmarkdown=2.27
- r-stringr=1.5
- bioconductor-hdf5array=1.30
- bioconductor-singlecellexperiment=1.24
- bioconductor-plyranges=1.22

## anndata
- python=3.11
- anndata=0.10
- numpy=1.26
- pandas=2.2
- scipy=1.13
11 changes: 11 additions & 0 deletions envs/anndata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: anndata
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3.11
- anndata=0.10
- numpy=1.26
- pandas=2.2
- scipy=1.13
15 changes: 7 additions & 8 deletions envs/bioconductor-sce.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,11 @@ channels:
- bioconda
- defaults
dependencies:
- r-base=4.1
- r-dplyr=1.0
- r-ggplot2=3.3
- r-base=4.3
- r-dplyr=1.1
- r-ggplot2=3.5
- r-readr=2.1
- r-stringr=1.4
- bioconductor-hdf5array=1.22
- bioconductor-singlecellexperiment=1.16
- bioconductor-plyranges=1.14
- openssl=1
- r-stringr=1.5
- bioconductor-hdf5array=1.30
- bioconductor-singlecellexperiment=1.24
- bioconductor-plyranges=1.22
12 changes: 6 additions & 6 deletions envs/rmd-reporting.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ channels:
- conda-forge
- defaults
dependencies:
- r-base=4.1
- r-dplyr=1.0
- r-ggplot2=3.3
- r-matrix=1.4
- r-base=4.3
- r-dplyr=1.1
- r-ggplot2=3.5
- r-matrix=1.6
- r-readr=2.1
- r-rmarkdown=2.14
- r-stringr=1.4
- r-rmarkdown=2.27
- r-stringr=1.5
12 changes: 10 additions & 2 deletions extdata/targets/targets.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
utrome_mm10_v1:
path: "extdata/targets/utrome_mm10_v1/" # files should be relative to this
path: "extdata/targets/utrome_mm10_v1/"
genome: "mm10"
gtf: "adult.utrome.e3.t200.f0.999.w500.gtf"
kdx: "adult.utrome.e3.t200.f0.999.w500.kdx"
merge_tsv: "adult.utrome.e3.t200.f0.999.w500.merge.tsv"
tx_annots: "utrome_txs_annotation.Rds"
tx_annots_csv: null
gene_annots: "utrome_genes_annotation.Rds"
gene_annots_csv: null
download_script: "download_utrome.sh"

utrome_mm10_v2:
Expand All @@ -15,7 +17,9 @@ utrome_mm10_v2:
kdx: "utrome.e30.t5.gc25.pas3.f0.9999.w500.kdx"
merge_tsv: "utrome.e30.t5.gc25.pas3.f0.9999.w500.m200.tsv"
tx_annots: "utrome_mm10_v2_tx_annots.2024.05.13.Rds"
tx_annots_csv: "utrome_mm10_v2_tx_annots.2024.05.13.csv.gz"
gene_annots: "utrome_mm10_v2_gene_annots.2024.05.13.Rds"
gene_annots_csv: "utrome_mm10_v2_gene_annots.2024.05.13.csv.gz"
download_script: "download_utrome.sh"

utrome_hg38_v1:
Expand All @@ -25,16 +29,20 @@ utrome_hg38_v1:
kdx: "utrome.e30.t5.gc39.pas3.f0.9999.w500.kdx"
merge_tsv: "utrome.e30.t5.gc39.pas3.f0.9999.w500.m200.tsv"
tx_annots: "utrome_hg38_v1_tx_annots.2024.05.13.Rds"
tx_annots_csv: "utrome_hg38_v1_tx_annots.2024.05.13.csv.gz"
gene_annots: "utrome_hg38_v1_gene_annots.2024.05.13.Rds"
gene_annots_csv: "utrome_hg38_v1_gene_annots.2024.05.13.csv.gz"
download_script: "download_utrome.sh"

## example custom target using GENCODE
## see https://github.com/Mayrlab/txcutr-db
# gencode_v38_pc_w500:
# path: "extdata/targets/gencode_v38_pc_w500/"
# path: "extdata/targets/gencode_v38_pc_w500/" # files must be relative to this
# genome: "hg38"
# gtf: "gencode.v38.annotation.pc.txcutr.w500.gtf"
# kdx: "gencode.v38.annotation.pc.txcutr.w500.kdx"
# merge_tsv: "gencode.v38.annotation.pc.txcutr.w500.merge.tsv"
# tx_annots: null
# tx_annots_csv: null
# gene_annots: null
# gene_annots_csv: null
Loading

0 comments on commit 4073878

Please sign in to comment.