Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more bigwigaverageoverbed stats #42

Merged
merged 7 commits into from
Jun 28, 2024

Conversation

ghuls
Copy link
Contributor

@ghuls ghuls commented Jun 13, 2024

Closes #38

@ghuls
Copy link
Contributor Author

ghuls commented Jun 13, 2024

Closes: #38

Copy link
Owner

@jackh726 jackh726 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts. Really appreciate the PR!

pybigtools/src/lib.rs Show resolved Hide resolved
pybigtools/src/lib.rs Show resolved Hide resolved
pybigtools/src/lib.rs Outdated Show resolved Hide resolved
pybigtools/src/lib.rs Outdated Show resolved Hide resolved
pybigtools/src/lib.rs Outdated Show resolved Hide resolved
bigtools/src/utils/misc.rs Outdated Show resolved Hide resolved
@jackh726
Copy link
Owner

Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.

(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)

@ghuls
Copy link
Contributor Author

ghuls commented Jun 24, 2024

Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.

(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)

Looks like quite a bit of work in #44.

I was trying out returning a named tuple and got it to work, but noticed that my command was taking 5 seconds instead of around 4 seconds before.

So I decided to test creation times of tuples, dicts, named tuples and class instances. Named tuples seem to be the slowest to create:

In [1]: %timeit tuple_style = ('chr1\t3094805\t3095305\tchr1:3094805-3095305', 0.06325921607762575, 0.06325921607
   ...: 762575, 0.044780001044273376, 0.044780001044273376, 0.08956000208854675, 0.08956000208854675, 500, 500, 3
   ...: 1.629608038812876)
6.97 ns ± 0.0829 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)


In [3]: %timeit dict_style = {"name": 'chr1\t3094805\t3095305\tchr1:3094805-3095305', "mean": 0.06325921607762575
   ...: , "mean0": 0.06325921607762575, "min": 0.044780001044273376, "min0": 0.044780001044273376, "max": 0.08956
   ...: 000208854675, "max0": 0.08956000208854675, "size": 500, "bases": 500, "sum": 31.629608038812876}
202 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [5]: %timeit named_tuple_style = pybigtools.BigWigAverageOverBedSummaryStatistics(name='chr1\t3094805\t3095305
   ...: \tchr1:3094805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, m
   ...: in0=0.044780001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.
   ...: 629608038812876)
548 ns ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

class BigWigAverageOverBedSummaryStatisticsClass:
    def __init__(self, name, mean, mean0, min, min0, max, max0, size, bases, sum):
        self.name = name
        self.mean = mean
        self.mean0 = mean0
        self.min = min
        self.min0 = min0
        self.max = max
        self.max0 = max0
        self.size = size
        self.bases = bases
        self.sum = sum
        
In [12]: %timeit class_style = BigWigAverageOverBedSummaryStatisticsClass(name='chr1\t3094805\t3095305\tchr1:3094
    ...: 805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, min0=0.0447
    ...: 80001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.629608038
    ...: 812876)
449 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

It would be nice if bigwigaverageoverbed would return mean, min ,max, (only considering positions that have actual values, returning NaN if needed) and mean0, min0 and max0 (assuming 0.0 for all positions that don''t have an entry in the bigWig). This would change the behavior of mean different compared to Kent tools (NaN instead of 0.0 when no values are found), but feels more consistent.

-fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
+fn pybigtools(py: Python, m: &PyModule) -> PyResult<()> {
     m.add("__version__", env!("CARGO_PKG_VERSION"))?;
     m.add_wrapped(wrap_pyfunction!(open))?;
     m.add_wrapped(wrap_pyfunction!(bigWigAverageOverBed))?;
@@ -1366,5 +1382,25 @@ fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
 
     m.add_class::<EntriesIterator>()?;
 
+    let collections = PyModule::import(py, "collections").unwrap();
+    let namedtuple = collections.getattr("namedtuple").unwrap();
+    let bigwigaverageoverbed_summary_statistics = namedtuple
+        .call(
+            (
+                "BigWigAverageOverBedSummaryStatistics".to_object(py),
+                (
+                    "name", "mean", "mean0", "min", "min0", "max", "max0", "size", "bases", "sum",
+                )
+                    .to_object(py),
+            ),
+            None,
+        )
+        .unwrap();
+    m.add(
+        "BigWigAverageOverBedSummaryStatistics",
+        bigwigaverageoverbed_summary_statistics,
+    )
+    .unwrap();
+
     Ok(())
 }

@jackh726
Copy link
Owner

I'm simultaneously both surprised and not that both dict and namedtuple have that much worse perf than a tuple.

I think, at least for now, I'm not so worried about the time difference. I imagine if speed is needed at that level, people would either drop down to Rust, or use some future API that allows batching and returning a pre-allocated array or something.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

I think I'm pretty happy with the branch I have at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats. I updated the API to match the comments above. Would love to hear your thoughts on it.

@ghuls
Copy link
Contributor Author

ghuls commented Jun 25, 2024

Looks good. Do you know how I can check that branch out to try it locally?

@jackh726
Copy link
Owner

git fetch https://github.com/jackh726/bigtools.git bigtools_refactor_rebase2:bwaob_refactor
git checkout bwaob_refactor

should work

@ghuls
Copy link
Contributor Author

ghuls commented Jun 26, 2024

Managed to get the correct branch:

git fetch https://github.com/jackh726/bigtools.git ghuls/add_more_bigwigaverageoverbed_stats:bigtools_ghuls_add_more_bigwigaverageoverbed_stats

Help needs to be fixed:

    If ``"all"`` is specified, all summary statistics are returned in a named tuple. 

    If a single statistic is provided as a string, that statistic is returned
    as a float.
    If a single statistic is provided as a string, that statistic is returned
    as a float or int depending on the statistic.

Giving an unknown statistic (like "all" does not trigger and exception or an error message:

In [32]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 4.93 s, sys: 36.2 ms, total: 4.97 s
Wall time: 4.96 s
Out[32]: 0.06325921607762575

In [33]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 90 µs, sys: 2 µs, total: 92 µs
Wall time: 95.8 µs
Out[33]: 0.06325921607762575

In [34]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all2"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 49 µs, sys: 1e+03 ns, total: 50 µs
Wall time: 52.9 µs
Out[34]: 0.06325921607762575

In [35]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0", "mean", "min", "max", "size", "bases", "sum"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 5.13 s, sys: 80 ms, total: 5.21 s
Wall time: 5.2 s
Out[35]: 
(0.06325921607762575,
 0.06325921607762575,
 0.044780001044273376,
 0.08956000208854675,
 500,
 500,
 31.629608038812876)

@ghuls
Copy link
Contributor Author

ghuls commented Jun 26, 2024

It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0), so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.

with pybigtools.open("test.bw", "r") as bw:
    x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
    y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))

@ghuls
Copy link
Contributor Author

ghuls commented Jun 26, 2024

Opening BED files as a Path does not work (works for opening the bigwig):

bw.average_over_bed(Path("file.bed"))

@jackh726
Copy link
Owner

Help needs to be fixed:

    If ``"all"`` is specified, all summary statistics are returned in a named tuple. 

    If a single statistic is provided as a string, that statistic is returned
    as a float.
    If a single statistic is provided as a string, that statistic is returned
    as a float or int depending on the statistic.

Done

Giving an unknown statistic (like "all" does not trigger and exception or an error message:

Strange...there is a test that shows that it does:

assert pytest.raises(ValueError, bw.average_over_bed, path, None, 'bad')
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'bad'])
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'all'])

It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0), so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.

with pybigtools.open("test.bw", "r") as bw:
    x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
    y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))

This already should have worked - added a test to show.

Opening BED files as a Path does not work (works for opening the bigwig):

bw.average_over_bed(Path("file.bed"))

Added this + test

@jackh726
Copy link
Owner

(feel free to just force push that branch to this PR branch - I can do it directly if you don't know how. Do want to give you credit for your work here)

@ghuls
Copy link
Contributor Author

ghuls commented Jun 27, 2024

Calculating multiple average_over_bed on the same BBIRead handle andd "all" statistics now works. Not sure what was wrong here with my installation.

In [14]: %%time
    ...: with pybigtools.open(Path("test.bw"), "r") as bw:
    ...:     y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
    ...:     x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="all"))
    ...: print(y_list[0], x_list[0])
    ...: 
    ...: 
0.06325921607762575 SummaryStatistics(size=500, bases=500, sum=31.629608038812876, mean0=0.06325921607762575, mean=0.06325921607762575, min=0.044780001044273376, max=0.08956000208854675)
CPU times: user 8.71 s, sys: 188 ms, total: 8.9 s
Wall time: 8.89 s

Raising an exception form a with block does not work for me:

import pybigtools
from pathlib import Path

# Don't import numpy to show with block does not pass exception.
# import numpy as np

In [2]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
   ...:     x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"))
   ...: print(y_list[0], x_list[0])
   ...: 
   ...: 
0.06325921607762575 0.08956000208854675
CPU times: user 8.1 s, sys: 228 ms, total: 8.33 s
Wall time: 8.32 s

# Numpy is not imported, so the code should trow an exception, but doesn't:
In [3]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:4

NameError: name 'y_array' is not defined

In [4]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: #print(y_array[0], x_array[0])
   ...: 
   ...: 
CPU times: user 47 μs, sys: 1 μs, total: 48 μs
Wall time: 50.5 μs

# In a normal with block it trows an exception:
In [5]: %%time
   ...: with open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.read(), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.read(), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:2

NameError: name 'np' is not defined


# After importing numpy
import numpy as np

In [9]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
0.063259214 0.08956
CPU times: user 8.24 s, sys: 160 ms, total: 8.4 s
Wall time: 8.4 s

Minimal example:

In [11]: with pybigtools.open(Path("test.bw"), "r") as bw:
    ...:     raise ValueError("Some error")
    ...: 

In [12]: with open(Path("test.bw"), "r") as bw:
    ...:     raise ValueError("Some error")
    ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 2
      1 with open(Path("test.bw"), "r") as bw:
----> 2     raise ValueError("Some error")

ValueError: Some error

@jackh726
Copy link
Owner

Raising an exception form a with block does not work for me:

Thanks, this was a problem with #44, which is fixed now (with a test)

@jackh726 jackh726 force-pushed the add_more_bigwigaverageoverbed_stats branch from 86485e6 to b391f93 Compare June 27, 2024 16:08
@ghuls
Copy link
Contributor Author

ghuls commented Jun 28, 2024

Raising an exception form a with block does not work for me:

Thanks, this was a problem with #44, which is fixed now (with a test)

I now assume that this Iwas the reason why invoking average_over_bed twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.

Exceptions are now raised properly.

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

@ghuls
Copy link
Contributor Author

ghuls commented Jun 28, 2024

Seems like bigWigAverageOverBed is semi deprecated:
https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

@jackh726 jackh726 force-pushed the add_more_bigwigaverageoverbed_stats branch from b391f93 to 34a3691 Compare June 28, 2024 16:13
@jackh726
Copy link
Owner

I now assume that this Iwas the reason why invoking average_over_bed twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.

Exceptions are now raised properly.

Good to hear

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

Maybe today, most likely sometime this weekend

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values too.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

Just pushed a commit for this.

@jackh726
Copy link
Owner

Going to go ahead and merge this - happy to follow up on min0/max0.

@jackh726 jackh726 merged commit 5161c9a into jackh726:master Jun 28, 2024
4 checks passed
@ghuls
Copy link
Contributor Author

ghuls commented Jun 28, 2024

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

Maybe today, most likely sometime this weekend

Great!

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ
If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values too.

I think in pyBigWig returns np.nan for those positions.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

Just pushed a commit for this.

Thanks.

@ghuls
Copy link
Contributor Author

ghuls commented Jul 4, 2024

Seems like UCSC does not have time to fix bigWigAverageOverBed: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

Thank you for providing those details. Unfortunately, we don't have the bandwidth to fix this issue with bigWigAverageOverBed at this time.

However, we can point people to the bigtools suite in the future if they ask about tools with similar functionality.

So I guess addin min0/max0 to bigtools should not be a big issue if bigWigAverageOverBed is basically abandoned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement -minMax option of bigWigAverageOverBed.
2 participants