Skip to content

Latest commit

 

History

History
214 lines (173 loc) · 20.5 KB

README.md

File metadata and controls

214 lines (173 loc) · 20.5 KB

Method workflows

This is where the workflows for running APAeval participants (= "method workflows") live.

NOTE: The following sections give in depth instructions on how to create new APAeval method workflows. If you're looking for instructions on how to run an existing workflow for one of our benchmarked methods, please refer to the README.md in the respective directory. You can find quick links to those directories in the participant overview table below. In any case, make sure you have the APAeval conda environment set up and running.

Benchmarking Participants

List of bioinformatic methods benchmarked in APAeval. Please update columns as the method workflows progress.

Method Citation Type Status in APAeval Benchmarked OpenEBench link
APA-Scan Fahmi et al. 2020 Identification, relative quantification, differential usage Issue #26
PR #160
No
(incompatible with APAeval input and metrics, bugs)
https://dev-openebench.bsc.es/tool/apa-scan
APAlyzer Wang & Tian 2020 Relative quantification, differential usage Snakemake workflow No
(incompatible with APAeval metrics)
https://dev-openebench.bsc.es/tool/apalyzer
APAtrap Ye et al. 2018 Identification, absolute and relative quantification, differential usage Nextflow workflow
(high time/memory consumption)
Issue # 244
Yes NA
Aptardi Lusk et al. 2021 Identification Nextflow workflow
(high time/memory consumption,
only tested on small test files,
no ML model building, uses authors’ published model)
No
(time/memory issues)
https://openebench.bsc.es/tool/aptardi
CSI-UTR Harrison et al. 2019 Differential usage Issue #388
Nextflow workflow
(only tested on small test files)
No
(incompatible with APAeval inputs, bugs)
NA
DaPars Xia et al. 2014 Identification, relative quantification, differential usage Nextflow workflow Yes NA
DaPars2 Feng et al. 2018 Identification, relative quantification, differential usage Snakemake workflow Yes NA
diffUTR Gerber et al. 2021 Differential usage Nextflow workflow
(only tested on small test files)
No
(incompatible with APAeval metrics)
https://dev-openebench.bsc.es/tool/diffutr
GETUTR Kim et al. 2015 Identification, relative quantification, differential usage Nextflow workflow Yes https://openebench.bsc.es/tool/getutr
IsoSCM Shenker et al. 2015 Identification, relative quantification, differential usage Nextflow workflow Yes https://dev-openebench.bsc.es/tool/isoscm
LABRAT Goering et al. 2020 Relative quantification, differential usage Nextflow workflow
(only tested on small test files)
Issue #406
No
(incompatible with APAeval metrics)
(https://openebench.bsc.es/tool/labrat
MISO Katz et al. 2010 Absolute and relative quantification, differential usage Issue #36
PR #85
No
(incompatible with APAeval input)
https://openebench.bsc.es/tool/miso
mountainClimber Cass & Xiao 2019 Identification, quantification, differential usage (according to publication) Issue #37
PR #86
No
(bugs, utter lack of user-friendliness)
https://openebench.bsc.es/tool/mountainclimber
PAQR Gruber et al. 2014 Absolute and relative quantification, differential usage Snakemake workflow
Issue #457
Yes https://openebench.bsc.es/tool/paqr
QAPA Ha et al. 2018 Absolute and relative quantification, differential usage Nextflow workflow
(hardcoded defaults, build mode in beta, we recommend using pre-built annotations)
Issue #457
Yes https://openebench.bsc.es/tool/qapa
Roar Grassi et al. 2016 Relative quantification, differential usage PR #161
Issue #38
No
(incompatible with APAeval input)
https://openebench.bsc.es/tool/roar
TAPAS Arefeen et al. 2018 Identification, relative quantification, differential usage Nextflow workflow
(differential usage functionality not implemented)
Yes https://openebench.bsc.es/tool/tapas

Overview

Method workflows contain all steps that need to be run per method (in OEB terms: per participant). Depending on the participant, a method workflow will have to perform pre-processing steps to convert the APAeval sanctioned input files into a format that the participant can consume. This does not include e.g. adapter trimming or mapping of reads, as those steps are already performed in our general pre-processing pipeline. After pre-processing, the actual execution of the method has to be implemented, and subsequently post-processing steps might be required to convert the obtained output into the format defined by the APAeval specifications.

method_workflows

More details

  1. Sanctioned input files: Each of the processed input data APAeval is using for their challenges is provided as .bam file (see specifications for file formats). If participants need other file formats, these HAVE TO be created as part of the pre-processing within method workflows (see 2.). Similarly, for each dataset we provide a gencode annotation in .gtf format, as well as a reference PAS atlas in .bed format for participants that depend on pre-defined PAS. All other annotation formats that might be needed HAVE TO be created from those. Non-sanctioned annotation- or similar auxiliary files MUST NOT be downloaded as part of the method workflows, in order to ensure comparability of all participants’ performance.

As several method workflows might have to do the same pre-processing tasks, we created a utils directory, where scripts (which have their corresponding docker images uploaded to the APAeval dockerhub) are stored. Please check the utils directory before writing your own conversion scripts, and/or add your pre-processing scripts to the utils directory if you think others might be able to re-use them.

  1. Method execution: For each method to be benchmarked (“participant”) one method workflow has to be written. The workflow MUST include all necessary pre- and post-processing steps that are needed to get from the input formats provided by APAeval (see 1.), to the output specified by APAeval in their metrics specifications (see 3.). The workflow should include run mode parameters for the benchmarking events that it qualifies for, set to either true or false (e.g. run_identification = true). Each run of the method workflow should output files for events where run modes are set to true. If a method has distinct run modes other than those concerning the three benchmarking events, the calls to those should also be parameterized. If those run modes could significantly alter the behaviour of the method, please discuss with the APAeval community whether the distinct run modes should actually be treated as distinct participants in APAeval (see section on parameters). That could for example be the case if the method can be run with either mathematical model A or model B, and the expected results would be quite different. At the moment we can't foresee all possibilities, so we count on you to report and discuss any such cases. In any case, please do document extensively how the method can be used and how you employed it. In general, all relevant participant parameters should be configurable in the workflow config files. Parameters, file names, run modes, etc. MUST NOT be hardcoded within the workflow.

IMPORTANT: Do not download any other annotation files because the docs of your participant say so. Instead, create all files the participant needs from the ones provided by APAeval. If you don't know how, please don't hesitate to start discussions within the APAeval community! Chances are high that somebody already encountered a similar problem and will be able to help.

  1. Post-processing: To ensure compatibility with the OEB benchmarking events, specifications for file formats (output of method workflows = input for benchmarking workflows) are provided by APAeval. There is one specification per metric (=statistical parameter to assess performance of a participant), but calculation of several metrics can require a common input file format (thus, the file has to be created only once by the method workflow). The required method workflow outputs are a bed file containing coordinates of identified PAS (and their respective expression in tpm, if applicable) a tsv file containing information on differential expression (if applicable), and a json file containing information about compute resource and time requirements (see output specifications for detailed description of the file formats). These files have to be created within the method workflows as post-processing steps.

Method workflows should be implemented in either Nexflow or Snakemake, and individual steps should be isolated through the use of containers. For more information on how to create these containers, see section containers.

Templates

To implement a method workflow for a participant, copy either the snakemake template or the nextflow template dsl1/nextflow template dsl2 into the participant's directory and adapt the workflow directory names as described in the template's README. Don't forget to adapt the README itself as well.

Example:

method_workflows/
 |--QAPA/
     |--QAPA_snakemake/
          |--workflow/Snakefile
          |--config/config.QAPA.yaml
          |--envs/QAPA.yaml
          |--envs/QAPA.Dockerfile
          |-- ...
 |--MISO/
      |--MISO_nextflow/
          |-- ...

Containers

For the sake of reproducibility and interoperability, we require the use of Docker containers in our method workflows. The participants to be benchmarked have to be available in a container, but also any other tools that are used for pre- or post-processing in a method workflow should be containerized. Whether you get individual containers for all the tools of your workflow, or combine them inside one container is up to you (The former being the more flexible option of course).

IMPORTANT: Do check out the utils directory before you work on containers for pre- or post-processing tools, maybe someone already did the same thing. If not, and you're gonna build useful containers, don't forget to add them there as well.

Here are some pointers on how to best approach the containerization:

  1. Check if your participant (or other tool) is already available as a Docker container, e.g. at

  2. If no Docker image is availabe for your tool

    • create a container on BioContainers via either a bioconda recipe or a Dockerfile
    • naming conventions:
      • if your container only contains one tool: apaeval/{tool_name}:{tool_version}, e.g. apaeval/my_tool:v1.0.0
      • if you combine all tools required for your workflow: apaeval/mwf_{participant_name}:{commit_hash}, where commit_hash is the short SHA of the Git commit in the APAeval repo that last modified the corresponding Dockerfile, e.g., 65132f2
  3. Now you just have to specify the docker image(s) in your method workflow:

    • For nextflow, the individual containers can be specified in the processes.
    • For Snakemake, the individual containers can be specified per rule.

Input

Test data

For more information about input files, see "sanctioned input files" above. For development and debugging you can use the small test input dataset we provide with this repository. You should use the .bam and/or .gtf files as input to your workflow. The .bed file serves as an example for a ground truth file. As long as the test_data directory doesn't contain a "poly(A) sites database file", which some methods will require, you should also use the .bed file for testing purposes.

Parameters

Both snakemake template and nextflow template contain example sample.csv files. Here you'd fill in the paths to the samples you'd be running, and any other sample specific information required by the workflow you're implementing. Thus, you can/must adapt the fields of this samples.csv according to your workflow's requirements.

Moreover, both workflow languages require additional information in config files. This is the place to specify run- or participant-specific parameters

Important notes:

  • Describe in your README extensively where parameters (sample info, participant specific parameters) have to be specified for a new run of the pipeline.
  • Describe in the README if your participant has different run modes, or parameter settings that might alter the participant's performance considerably. In such a case you should suggest that the different modes should be treated in APAeval as entirely distinct participants. Feel free to start discussions about this in our Github discussions board
  • Parameterize your code as much as possible, so that the user will only have to change the sample sheet and config file, and not the code. E.g. output file paths should be built from information the user has filled into the sample sheet or config file.
  • For information on how files need to be named see below!

Output

In principle you are free to store output files how it best suits you (or the participant). However, the "real" and final outputs for each run of the benchmarking will need to be copied to a directory in the format
PATH/TO/APAEVAL/EVENT/PARTICIPANT/

This directory must contain:

  • Output files (check formats and filenames)
  • Configuration files (with parameter settings), e.g. config.yaml and samples.csv.
  • logs/ directory with all log files created by the workflow exeuction.

Formats

File formats for the 3 benchmarking events are described in the output specification.

Filenames

As mentioned above it is best to parameterize filenames, such that for each run the names and codes can be set by changing only the sample sheet and config file!

File names must adhere to the following schema: PARTICIPANT.CHALLENGE.OUTCODE.ext
For the codes please refer to the following documents:

Example:
Identification_01/MISO/MISO.P19_siControl_R1.01.bed would be the output of MISO (your participant) for the identification benchmarking event (OUTCODE 01, we know that from method_workflow_file_specification.md), run on dataset "P19_siControl_R1" (exact name as sample_name in APAeval Zenodo snapshot)

PR reviews

At least 2 independent reviews are required before your code can be merged into the main APAeval branch. Why not review some other PR while you wait for yours to be accepted? You can find some instructions in Sam's PR review guide.