Merge pull request #87 from Mayrlab/dev

Milestone v0.5.0
Mayrlab · May 28, 2024 · 4073878 · 4073878
2 parents 876264d + a29003a
commit 4073878
Show file tree

Hide file tree

Showing 11 changed files with 375 additions and 66 deletions.
diff --git a/README.md b/README.md
@@ -44,20 +44,31 @@ determined by simulations.
 Please see [our accompanying manuscript][ref:scUTRquant] for more details.
 
 ## Outputs
+### SingleCellExperiment object (*default*)
 The primary output of the pipeline is a Bioconductor `SingleCellExperiment` object.
 The `counts` in the object is a sparse `Matrix` of 3' UTR isoform counts; the `rowRanges` 
 is a `GenomicRanges` of the 3' UTR isoforms; the `rowData` is a `DataFrame` with additional
 information about 3' UTR isoforms; and the `colData` is a `DataFrame` populated with sample 
 metadata and optional user-provide cell annotations.
 
+### H5AD AnnData object (*experimental*)
+scUTRquant v0.5.0 adds support for [AnnData objects](https://anndata.readthedocs.io/en/stable/index.html),
+making it easier for users who prefer Python and working with [the `scverse` ecosystem](https://scverse.org).
+Similar annotations will be attached in the `obs` (cell) and `var` (3'UTR isoform) tables.
+However, no analogous `rowRanges` is attached.
+
+The output format can be controlled with the `output_format` configuration option,
+with `"h5ad"` and "`sce`" being currently supported options. 
+
+### Reporting
 To assist users in quality control, the pipeline additionally generates HTML reports 
 for each sample.
 
+### Additional Files
 The pipeline is configured to retain intermediate files, such as BUS and MTX files.
 Advanced users can readily customize the pipeline to only generate the files they 
-require. For example, users who prefer to work with alternative scRNA-seq data structures,
-such as those used in Scanpy or Seurat, may wish to terminate the pipeline at MTX 
-generation.
+require. For example, users who prefer to work with alternative scRNA-seq data structures
+may wish to terminate the pipeline at MTX generation.
 
 # Setup
 
@@ -73,22 +84,23 @@ repository directly tests running examples on the GitHub-hosted runners.
 ### Conda/Mamba Mode (MacOS or Linux)
 Snakemake can use Conda to install the needed software. This configuration requires:
 
- - [Snakemake][ref:snakemake] >= 6.0, < 8.0<sup>ª</sup>
+ - [Snakemake][ref:snakemake] >= 6.0<sup>ª</sup>
  - [Conda](https://docs.conda.io/projects/conda/en/latest/)
 
 If Conda is not already installed, we strongly recommend installing 
-[Mambaforge](https://github.com/conda-forge/miniforge#mambaforge). If Conda was previously
-installed, strongly recommend installing Mamba:
+[Miniforge](https://github.com/conda-forge/miniforge#miniforge). If Conda was previously
+installed, strongly recommend that Conda be upgraded to at least v23.11 which uses the 
+faster `libmamba` solver:
 
 ```bash
-conda install -n base -c conda-forge mamba
+conda install -n base 'conda>=23.11'
 ```
 
 ### Singularity Mode (Linux)
-Snakemake can use the pre-built scUTRsquant Docker image to provide all additional software.
-This configuration requires installing:
+Snakemake can use [the pre-built scUTRquant Docker image](https://hub.docker.com/repository/docker/mfansler/scutr-quant) 
+to provide all additional software. This configuration requires installing:
 
- - [Snakemake][ref:snakemake] >= 6.0, < 8.0<sup>ª</sup>
+ - [Snakemake][ref:snakemake] >= 6.0<sup>ª</sup>
  - [Singularity](https://singularity.lbl.gov/index.html)
 
 
@@ -206,7 +218,7 @@ On GitHub runners with 2-3 cores, these examples have typical execution times of
 On HPC systems with multiple nodes with multiple cores, a large job (e.g., 1-2TB raw data) 
 can process in under an hour when properly configured.
 
-## Full Examples
+## Full-Scale Examples
 The inputs used to process data in [the manuscript][ref:scUTRquant] are also available in the 
 [scUTRquant-inputs repository](https://github.com/Mayrlab/scUTRquant-inputs).
 These also include individual Snakemake pipelines to download atlas-scale datasets. 
@@ -236,6 +248,7 @@ pipeline. The following keys are expected:
  - `cell_annots`: (optional) CSV file with a key column that matches the `<sample_id>_<cell_bx>` format
  - `cell_annots_key`: specifies the name of the key column in the `cell_annots` file; default is `cell_id`
  - `exclude_unannotated_cells`: boolean indicating whether unannotated cells should be excluded from the final output; default is `False`
+ - `output_format`: a list of formats, `"sce"` (*default*) and/or `"h5ad"` (*experimental*); `null` will end processing at MTX files.
 
 ### Default Values
 
@@ -270,7 +283,9 @@ The `extdata/targets.yaml` defines the targets available to pseudoalign to. The
  - `kdx`: Kallisto index for UTRome
  - `merge`: TSV for merging features (isoforms)
  - `tx_annots`: (optional) RDS file containing Bioconductor DataFrame object with annotations for transcripts
+ - `tx_annots_csv`: (optional) CSV file with annotations for transcripts (used for AnnData)
  - `gene_annots`: (optional) RDS file containing Bioconductor DataFrame object with annotations for genes
+ - `gene_annots_csv`: (optional) CSV file with annotations for genes (used for AnnData)
 
 
 # Customization

diff --git a/Snakefile b/Snakefile
@@ -1,19 +1,29 @@
-container: "docker://mfansler/scutr-quant:0.4.0"
+container: "docker://mfansler/scutr-quant:0.5.0"
 configfile: "config.yaml"
+SQ_VERSION="0.5.0"
 
 import os
 import pandas as pd
-from snakemake.io import load_configfile
 from snakemake.utils import min_version
 from sys import stderr
 
+# robust loading of `load_configfile`
+try:
+    ## version >=8.0.0
+    from snakemake.common.configfile import load_configfile
+except:
+    ## version <8.0.0
+    from snakemake.io import load_configfile
+
 # set minimum Snakemake version
 min_version("6.0")
 
 # print to stderr
 def message(*args, **kwargs):
     print(*args, file=stderr, **kwargs)
 
+message(f"[INFO] scUTRquant v{SQ_VERSION}")
+
 # make sure the tmp directory exists
 os.makedirs(config['tmp_dir'], exist_ok=True)
 
@@ -54,19 +64,25 @@ def get_outputs():
         outputs += expand("qc/umi_count/{target}/{sample_id}.umi_count.html",
                           target=target_list,
                           sample_id=samples.index.values)
-    if config['use_hdf5']:
-        outputs += expand("data/sce/{target}/{dataset}.{output_type}.{file_type}",
-                          target=target_list,
-                          dataset=config['dataset_name'],
-                          output_type=config['output_type'],
-                          file_type=['se.rds', 'assays.h5'])
-    else:
-        outputs += expand("data/sce/{target}/{dataset}.{output_type}.Rds",
-                          target=target_list,
-                          dataset=config['dataset_name'],
-                          output_type=config['output_type'])
+    if (not config['output_format']) or ("sce" in config['output_format']):
+        if config['use_hdf5']:
+            outputs += expand("data/sce/{target}/{dataset}.{output_type}.{file_type}",
+                            target=target_list,
+                            dataset=config['dataset_name'],
+                            output_type=config['output_type'],
+                            file_type=['se.rds', 'assays.h5'])
+        else:
+            outputs += expand("data/sce/{target}/{dataset}.{output_type}.Rds",
+                            target=target_list,
+                            dataset=config['dataset_name'],
+                            output_type=config['output_type'])
+    if "h5ad" in config['output_format']:
+        outputs += expand("data/h5ad/{target}/{dataset}.{output_type}.h5ad",
+                            target=target_list,
+                            dataset=config['dataset_name'],
+                            output_type=config['output_type'])
     return outputs
-    
+
 rule all:
     input: get_outputs()
 
@@ -82,9 +98,11 @@ for target_id, target in targets.items():
 
     target_path = target['path']
     download_script = target_path + target['download_script']
-
-    FILE_KEYS = ['gtf', 'kdx', 'merge_tsv', 'tx_annots', 'gene_annots']
-    target_files = [target_path + target[k] for k in FILE_KEYS]
+
+    FILE_KEYS = ['gtf', 'kdx', 'merge_tsv', 
+                 'tx_annots', 'gene_annots', 
+                 'tx_annots_csv', 'gene_annots_csv']
+    target_files = [target_path + target[k] for k in FILE_KEYS if target[k] is not None]
 
     rule:
         name: f"download_{target_id}"
@@ -379,6 +397,55 @@ rule mtxs_to_sce_h5_genes:
     script:
         "scripts/mtxs_to_sce_genes.R"
 
+rule mtxs_to_h5ad_txs:
+    input:
+        bxs=expand("data/kallisto/{target}/{sample_id}/txs.barcodes.txt",
+                   sample_id=samples.index.values, allow_missing=True),
+        txs=expand("data/kallisto/{target}/{sample_id}/txs.genes.txt",
+                   sample_id=samples.index.values, allow_missing=True),
+        mtxs=expand("data/kallisto/{target}/{sample_id}/txs.mtx",
+                    sample_id=samples.index.values, allow_missing=True),
+        gtf=get_target_file('gtf'),
+        tx_annots=get_target_file('tx_annots_csv'),
+        cell_annots=config['cell_annots']
+    output:
+        h5ad="data/h5ad/{target}/%s.txs.h5ad" % config['dataset_name']
+    params:
+        genome=lambda wcs: targets[wcs.target]['genome'],
+        sample_ids=samples.index.values,
+        min_umis=config['min_umis'],
+        cell_annots_key=config['cell_annots_key'],
+        exclude_unannotated_cells=config['exclude_unannotated_cells']
+    resources:
+        mem_mb=16000
+    conda: "envs/anndata.yaml"
+    script:
+        "scripts/mtxs_to_h5ad_txs.py"
+
+rule mtxs_to_h5ad_genes:
+    input:
+        bxs=expand("data/kallisto/{target}/{sample_id}/genes.barcodes.txt",
+                   sample_id=samples.index.values, allow_missing=True),
+        genes=expand("data/kallisto/{target}/{sample_id}/genes.genes.txt",
+                   sample_id=samples.index.values, allow_missing=True),
+        mtxs=expand("data/kallisto/{target}/{sample_id}/genes.mtx",
+                    sample_id=samples.index.values, allow_missing=True),
+        gtf=get_target_file('gtf'),
+        gene_annots=get_target_file('gene_annots_csv'),
+        cell_annots=config['cell_annots']
+    output:
+        h5ad="data/h5ad/{target}/%s.genes.h5ad" % config['dataset_name']
+    params:
+        genome=lambda wcs: targets[wcs.target]['genome'],
+        sample_ids=samples.index.values,
+        min_umis=config['min_umis'],
+        cell_annots_key=config['cell_annots_key'],
+        exclude_unannotated_cells=config['exclude_unannotated_cells']
+    resources:
+        mem_mb=16000
+    conda: "envs/anndata.yaml"
+    script:
+        "scripts/mtxs_to_h5ad_genes.py"
 
 ################################################################################
 ## Reports
@@ -390,10 +457,10 @@ rule report_umis_per_cell:
         txs="data/kallisto/{target}/{sample_id}/txs.genes.txt",
         mtx="data/kallisto/{target}/{sample_id}/txs.mtx"
     params:
-        min_umis=config['min_umis']
+        min_umis=config['min_umis'],
+        sq_version=SQ_VERSION
     output:
         "qc/umi_count/{target}/{sample_id}.umi_count.html"
     conda: "envs/rmd-reporting.yaml"
     script:
         "scripts/report_umi_counts_per_cell.Rmd"
-
diff --git a/config.yaml b/config.yaml
@@ -7,6 +7,8 @@ target: "utrome_mm10_v2"
 output_type:
   - "txs"
   - "genes"
+output_format:
+  - "sce"
 tmp_dir: "/tmp"
 bx_whitelist: null
 correct_bus: True

diff --git a/docker/scutr-quant.yaml b/docker/scutr-quant.yaml
@@ -15,14 +15,20 @@ dependencies:
   - bustools=0.40.0
 
   ## sce + reporting
-  - r-base=4.1
-  - r-dplyr=1.0
-  - r-ggplot2=3.3
-  - r-matrix=1.4
+  - r-base=4.3
+  - r-dplyr=1.1
+  - r-ggplot2=3.5
+  - r-matrix=1.6
   - r-readr=2.1
-  - r-rmarkdown=2.14
-  - r-stringr=1.4
-  - bioconductor-hdf5array=1.22
-  - bioconductor-singlecellexperiment=1.16
-  - bioconductor-plyranges=1.14
-  - openssl=1
+  - r-rmarkdown=2.27
+  - r-stringr=1.5
+  - bioconductor-hdf5array=1.30
+  - bioconductor-singlecellexperiment=1.24
+  - bioconductor-plyranges=1.22
+
+  ## anndata
+  - python=3.11
+  - anndata=0.10
+  - numpy=1.26
+  - pandas=2.2
+  - scipy=1.13
diff --git a/envs/anndata.yaml b/envs/anndata.yaml
@@ -0,0 +1,11 @@
+name: anndata
+channels:
+  - conda-forge
+  - bioconda
+  - defaults
+dependencies:
+  - python=3.11
+  - anndata=0.10
+  - numpy=1.26
+  - pandas=2.2
+  - scipy=1.13
diff --git a/envs/bioconductor-sce.yaml b/envs/bioconductor-sce.yaml
@@ -4,12 +4,11 @@ channels:
   - bioconda
   - defaults
 dependencies:
-  - r-base=4.1
-  - r-dplyr=1.0
-  - r-ggplot2=3.3
+  - r-base=4.3
+  - r-dplyr=1.1
+  - r-ggplot2=3.5
   - r-readr=2.1
-  - r-stringr=1.4
-  - bioconductor-hdf5array=1.22
-  - bioconductor-singlecellexperiment=1.16
-  - bioconductor-plyranges=1.14
-  - openssl=1
+  - r-stringr=1.5
+  - bioconductor-hdf5array=1.30
+  - bioconductor-singlecellexperiment=1.24
+  - bioconductor-plyranges=1.22
diff --git a/envs/rmd-reporting.yaml b/envs/rmd-reporting.yaml
@@ -3,10 +3,10 @@ channels:
   - conda-forge
   - defaults
 dependencies:
-  - r-base=4.1
-  - r-dplyr=1.0
-  - r-ggplot2=3.3
-  - r-matrix=1.4
+  - r-base=4.3
+  - r-dplyr=1.1
+  - r-ggplot2=3.5
+  - r-matrix=1.6
   - r-readr=2.1
-  - r-rmarkdown=2.14
-  - r-stringr=1.4
+  - r-rmarkdown=2.27
+  - r-stringr=1.5
diff --git a/extdata/targets/targets.yaml b/extdata/targets/targets.yaml
@@ -1,11 +1,13 @@
 utrome_mm10_v1:
-  path: "extdata/targets/utrome_mm10_v1/"  # files should be relative to this
+  path: "extdata/targets/utrome_mm10_v1/"
   genome: "mm10"
   gtf: "adult.utrome.e3.t200.f0.999.w500.gtf"
   kdx: "adult.utrome.e3.t200.f0.999.w500.kdx"
   merge_tsv: "adult.utrome.e3.t200.f0.999.w500.merge.tsv"
   tx_annots: "utrome_txs_annotation.Rds"
+  tx_annots_csv: null
   gene_annots: "utrome_genes_annotation.Rds"
+  gene_annots_csv: null
   download_script: "download_utrome.sh"
 
 utrome_mm10_v2:
@@ -15,7 +17,9 @@ utrome_mm10_v2:
   kdx: "utrome.e30.t5.gc25.pas3.f0.9999.w500.kdx"
   merge_tsv: "utrome.e30.t5.gc25.pas3.f0.9999.w500.m200.tsv"
   tx_annots: "utrome_mm10_v2_tx_annots.2024.05.13.Rds"
+  tx_annots_csv: "utrome_mm10_v2_tx_annots.2024.05.13.csv.gz"
   gene_annots: "utrome_mm10_v2_gene_annots.2024.05.13.Rds"
+  gene_annots_csv: "utrome_mm10_v2_gene_annots.2024.05.13.csv.gz"
   download_script: "download_utrome.sh"
 
 utrome_hg38_v1:
@@ -25,16 +29,20 @@ utrome_hg38_v1:
   kdx: "utrome.e30.t5.gc39.pas3.f0.9999.w500.kdx"
   merge_tsv: "utrome.e30.t5.gc39.pas3.f0.9999.w500.m200.tsv"
   tx_annots: "utrome_hg38_v1_tx_annots.2024.05.13.Rds"
+  tx_annots_csv: "utrome_hg38_v1_tx_annots.2024.05.13.csv.gz"
   gene_annots: "utrome_hg38_v1_gene_annots.2024.05.13.Rds"
+  gene_annots_csv: "utrome_hg38_v1_gene_annots.2024.05.13.csv.gz"
   download_script: "download_utrome.sh"
 
 ## example custom target using GENCODE
 ## see https://github.com/Mayrlab/txcutr-db
 # gencode_v38_pc_w500:
-#   path: "extdata/targets/gencode_v38_pc_w500/"
+#   path: "extdata/targets/gencode_v38_pc_w500/"  # files must be relative to this
 #   genome: "hg38"
 #   gtf: "gencode.v38.annotation.pc.txcutr.w500.gtf"
 #   kdx: "gencode.v38.annotation.pc.txcutr.w500.kdx"
 #   merge_tsv: "gencode.v38.annotation.pc.txcutr.w500.merge.tsv"
 #   tx_annots: null
+#   tx_annots_csv: null
 #   gene_annots: null
+#   gene_annots_csv: null