Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cellxgene-schema CLI must update annotation of var['feature_length'] #990

Open
brianraymor opened this issue Aug 20, 2024 · 0 comments
Open
Assignees
Labels
5.2 Next minor CELLxGENE schema version after 5.1 curation software

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Aug 20, 2024

Implementation notes copied from #960

@Bento007 wrote:

The code changes will be easy. There are no unit tests at the moment which should be written, and this will make this work take longer. I estimate 2-3 days.

@nayib-jose-gloria wrote:

Note: after re-writing _get_gene_lengths_from_gtf to use the new approach, we will also need to re-run the Gene Processing script and push the resulting updated CSVs to /genecode_files/

Design

There are two changes in requirements.

The feature length for the "spike-in" feature_biotype must now be calculated. Previously, the feature length calculation was limited to the "gene" feature_biotype.

See the conversation in cell-science-census.

@pablo-gar requests that the length for "spike-in" which is already calculated be surfaced in the feature_length annotation.

We already calculate length in bps for spike ins https://github.com/chanzuckerberg/single-cell-curation/blob/main/cellxgene_schema_cli/cellxgene_schema/ontology_files/genes_ercc.csv.gz

Also capturing comments from cell-sci-platform related to the format of the ERCC download:

I found the ERCC92.fa & ERCC92.gtf sequence and annotation files (.zip) if we wanted to replace the custom processing of the current reference with generic GTF processing ... Not sure that it's worth the bother since the ERCC reference has been stable.


The calculation for genes is updated from "merged" to "median" using GTFtools implementation.

See feature_length.

Implementation details from @pablo-gar :

Current implementation in our code base is here:

def _get_gene_lengths_from_gtf(self, gtf_path: str) -> Dict[str, int]:

The new implementation should be taken from here:

https://github.com/RacconC/gtftools/blob/140fc21003a565a0f69b5176db734b9a04a004a4/gtftools/gtftools.py#L670-L688

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5.2 Next minor CELLxGENE schema version after 5.1 curation software
Projects
None yet
Development

No branches or pull requests

2 participants