Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating the sequence diversity between two variants returns a key error #356

Open
vsbuffalo opened this issue Mar 13, 2021 · 0 comments

Comments

@vsbuffalo
Copy link

Hello,

I found this behavior to be a bit unintuitive... I would be happy to submit a PR to address it, but it's not necessarily clear what the right behavior is. Here is an MRE; the issue is that calculating diversity on a range between variant raises an exception, rather than just returning zero pairwise diversity:

import allel as al
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=21, stop=24)

Because the selected range resides between two variants, pos.locate_range() errors out with a KeyError. Here is the full traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-0cac1b30194e> in <module>
     12 pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]
     13 
---> 14 al.sequence_diversity(pos, ac, start=21, stop=24)

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/stats/diversity.py in sequence_diversity(pos, ac, start, stop, is_accessible)
    266     # deal with subregion
    267     if start is not None or stop is not None:
--> 268         loc = pos.locate_range(start, stop)
    269         pos = pos[loc]
    270         ac = ac[loc]

~/miniconda3/envs/simple/lib/python3.8/site-packages/allel/model/ndarray.py in locate_range(self, start, stop)
   3609 
   3610         if stop_index - start_index == 0:
-> 3611             raise KeyError(start, stop)
   3612 
   3613         loc = slice(start_index, stop_index)

KeyError: (21, 24)

Generally, I think addressing this by modifying the behavior through pos.locate_range() is probably the wrong approach. Instead, we could wrap this in a try/except block for sequence_diversity() and like functions, and immediately return 0 if there are no variants within the region of interest (though, perhaps if no bases are accessible there, it should be np.nan instead? With the usual caveats this is a bit of a hack to deal with missingness values....).

Additionally, it's worth pointing out that this behavior is similar if no bases are accessible, but the range does include variants:

import allel as al
import numpy as np
g = al.GenotypeArray([[[0, 0], [0, 0]],
                          [[0, 0], [0, 1]],
                          [[0, 0], [1, 1]],
                          [[0, 1], [1, 1]],
                          [[1, 1], [1, 1]],
                          [[0, 0], [1, 2]],
                          [[0, 1], [1, 2]],
                          [[0, 1], [-1, -1]],
                          [[-1, -1], [-1, -1]]])

ac = g.count_alleles()
pos = [2, 4, 7, 14, 15, 18, 19, 25, 27]

al.sequence_diversity(pos, ac, start=12, stop=24, is_accessible=np.repeat(False, 30))

The issue here is that mask_inaccessible() is removing all positions that are inaccessible. In this case, I think a np.nan might be a more desired return value too?

I am curious what you think.

thanks,
Vince

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant