-
Notifications
You must be signed in to change notification settings - Fork 403
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #605 from nextstrain/support-gisaid-downloads
Support GISAID downloads
- Loading branch information
Showing
19 changed files
with
727 additions
and
99 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
strain virus gisaid_epi_isl genbank_accession date region country division location region_exposure country_exposure division_exposure segment length host age sex originating_lab submitting_lab authors url title date_submitted | ||
Wuhan/WH01/2019 ncov EPI_ISL_406798 LR757998 2019-12-26 Asia China Hubei Wuhan Asia China Hubei genome 29866 Human 44 Male General Hospital of Central Theater Command of People's Liberation Army of China BGI & Institute of Microbiology, Chinese Academy of Sciences & Shandong First Medical University & Shandong Academy of Medical Sciences & General Hospital of Central Theater Command of People's Liberation Army of China Weijun Chen et al https://www.gisaid.org ? 2020-01-30 |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
Wuhan/Hu-1/2019 | ||
Wuhan-Hu-1/2019 | ||
Wuhan/WH01/2019 | ||
Wuhan/WH01/2019 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,56 +1,47 @@ | ||
|
||
## This YAML file is sparsely commented, with a focus on the parts relevant to multiple inputs | ||
## See my_profiles/example/builds.yaml for more general comments | ||
## See docs/multiple_inputs.md for a walkthrough of this config. | ||
|
||
# custom_rules: | ||
# - my_profiles/example_multiple_inputs/rules.smk | ||
|
||
use_nextalign: true | ||
|
||
# Define an ordered list of input datasets including a "focal" set (e.g., a | ||
# custom selection of strains from GISAID's "custom selection" or search | ||
# interfaces) and a "contextual" set (e.g., a curated selection of strains | ||
# representing global or regional circulation at a specific point in time from | ||
# GISAID's "nextregion" downloads). | ||
inputs: | ||
- name: local | ||
metadata: "data/example_metadata.tsv" | ||
sequences: "data/example_sequences.fasta" | ||
- name: global | ||
# Define local paths to a pre-filtered global context. | ||
metadata: "data/global_subsampled_metadata.tsv.gz" | ||
filtered: "data/global_subsampled_sequences.fasta.gz" | ||
# Paths to files on S3 also work. For example: | ||
#metadata: "s3://nextstrain-ncov-private/global_subsampled_metadata.tsv.xz" | ||
#filtered: "s3://nextstrain-ncov-private/global_subsampled_sequences.fasta.xz" | ||
# TODO: ideally, we could support something like the following, to inject | ||
# a collection of contextual sequences into the analysis after all other | ||
# subsampling. | ||
# subsampled: "data/global_subsampled_sequences.fasta.gz" | ||
|
||
- name: focal | ||
metadata: "data/example_metadata_aus.tsv.xz" | ||
sequences: "data/example_sequences_aus.fasta.xz" | ||
- name: contextual | ||
metadata: "data/example_metadata_worldwide.tsv.xz" | ||
filtered: "data/example_sequences_worldwide.fasta.xz" | ||
|
||
# Define a single build named "global". | ||
builds: | ||
global-context: | ||
subsampling_scheme: custom-scheme # use a custom subsampling scheme defined below | ||
global: | ||
# Use a custom subsampling scheme defined below | ||
subsampling_scheme: custom-scheme | ||
|
||
# STAGE 1: Input-specific filtering parameters | ||
# Align sequences with nextalign instead of mafft. | ||
use_nextalign: true | ||
|
||
# Input-specific filtering parameters. | ||
filter: | ||
aus: | ||
min_length: 5000 # Allow shorter genomes. Parameter used to filter alignment. | ||
skip_diagnostics: True # skip diagnostics (which can remove genomes) for this input | ||
focal: | ||
# Allow shorter genomes. Parameter used to filter alignment. | ||
min_length: 5000 | ||
|
||
# Skip diagnostics (which can remove genomes) for this input. | ||
skip_diagnostics: True | ||
|
||
# STAGE 2: Subsampling parameters | ||
# Subsampling parameters | ||
subsampling: | ||
custom-scheme: | ||
# Use metadata key to include ALL from `input1` | ||
australian_focus: | ||
exclude: "--exclude-where 'aus!=yes'" # subset to sequences from input `aus` | ||
# Include all global contextual sequences without subsampling. | ||
global_context: | ||
exclude: "--exclude-where 'aus=yes'" # i.e. subset to sequences _not_ from input `aus` | ||
|
||
files: | ||
auspice_config: "my_profiles/example_global_context/my_auspice_config.json" | ||
description: "my_profiles/example_global_context/my_description.md" | ||
# Subsample from the focal sequences. Remove or comment out the | ||
# `max_sequences` line to select all focal sequences. | ||
focal: | ||
max_sequences: 100 | ||
query: --query "focal == 'yes'" | ||
# Subsample from the contextual sequences. Remove or comment out the | ||
# `max_sequences` line to select all contextual sequences. | ||
contextual: | ||
max_sequences: 100 | ||
query: --query "contextual == 'yes'" | ||
|
||
skip_travel_history_adjustment: True | ||
|
||
traits: | ||
global-context: | ||
sampling_bias_correction: 2.5 | ||
columns: ["country"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
import argparse | ||
from augur.utils import read_metadata | ||
import pandas as pd | ||
import re | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) | ||
parser.add_argument("--metadata", required=True, help="metadata to be sanitized") | ||
parser.add_argument("--strip-prefixes", nargs="+", help="prefixes to strip from strain names in the metadata") | ||
parser.add_argument("--output", required=True, help="sanitized metadata") | ||
|
||
args = parser.parse_args() | ||
|
||
metadata, columns = read_metadata(args.metadata) | ||
metadata = pd.DataFrame.from_dict(metadata, orient="index") | ||
|
||
if args.strip_prefixes: | ||
prefixes = "|".join(args.strip_prefixes) | ||
pattern = f"^({prefixes})" | ||
|
||
metadata["strain"] = metadata["strain"].apply( | ||
lambda strain: re.sub( | ||
pattern, | ||
"", | ||
strain | ||
) | ||
) | ||
|
||
metadata.to_csv( | ||
args.output, | ||
sep="\t", | ||
index=False | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
import argparse | ||
from augur.io import open_file, read_sequences, write_sequences | ||
import re | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter) | ||
parser.add_argument("--sequences", nargs="+", required=True, help="sequences to be sanitized") | ||
parser.add_argument("--strip-prefixes", nargs="+", help="prefixes to strip from strain names in the sequences") | ||
parser.add_argument("--output", required=True, help="sanitized sequences") | ||
|
||
args = parser.parse_args() | ||
|
||
if args.strip_prefixes: | ||
prefixes = "|".join(args.strip_prefixes) | ||
pattern = f"^({prefixes})" | ||
else: | ||
pattern = "" | ||
|
||
with open_file(args.output, "w") as output_handle: | ||
for sequence in read_sequences(*args.sequences): | ||
sequence.id = re.sub(pattern, "", sequence.id) | ||
write_sequences(sequence, output_handle) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.