Add more bigwigaverageoverbed stats #42

ghuls · 2024-06-13T21:22:10Z

Closes #38

ghuls · 2024-06-13T21:22:51Z

Closes: #38

jackh726

A few thoughts. Really appreciate the PR!

pybigtools/src/lib.rs

bigtools/src/utils/misc.rs

jackh726 · 2024-06-23T21:45:43Z

Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.

(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)

ghuls · 2024-06-24T08:11:16Z

Just btw, I have a rebase over #44 at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats.

(Once I get that other PR merged, I'll get this updated for review comments - if you can't get to it in time - so we can get it merged)

Looks like quite a bit of work in #44.

I was trying out returning a named tuple and got it to work, but noticed that my command was taking 5 seconds instead of around 4 seconds before.

So I decided to test creation times of tuples, dicts, named tuples and class instances. Named tuples seem to be the slowest to create:

In [1]: %timeit tuple_style = ('chr1\t3094805\t3095305\tchr1:3094805-3095305', 0.06325921607762575, 0.06325921607
   ...: 762575, 0.044780001044273376, 0.044780001044273376, 0.08956000208854675, 0.08956000208854675, 500, 500, 3
   ...: 1.629608038812876)
6.97 ns ± 0.0829 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)


In [3]: %timeit dict_style = {"name": 'chr1\t3094805\t3095305\tchr1:3094805-3095305', "mean": 0.06325921607762575
   ...: , "mean0": 0.06325921607762575, "min": 0.044780001044273376, "min0": 0.044780001044273376, "max": 0.08956
   ...: 000208854675, "max0": 0.08956000208854675, "size": 500, "bases": 500, "sum": 31.629608038812876}
202 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


In [5]: %timeit named_tuple_style = pybigtools.BigWigAverageOverBedSummaryStatistics(name='chr1\t3094805\t3095305
   ...: \tchr1:3094805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, m
   ...: in0=0.044780001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.
   ...: 629608038812876)
548 ns ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

class BigWigAverageOverBedSummaryStatisticsClass:
    def __init__(self, name, mean, mean0, min, min0, max, max0, size, bases, sum):
        self.name = name
        self.mean = mean
        self.mean0 = mean0
        self.min = min
        self.min0 = min0
        self.max = max
        self.max0 = max0
        self.size = size
        self.bases = bases
        self.sum = sum
        
In [12]: %timeit class_style = BigWigAverageOverBedSummaryStatisticsClass(name='chr1\t3094805\t3095305\tchr1:3094
    ...: 805-3095305', mean=0.06325921607762575, mean0=0.06325921607762575, min=0.044780001044273376, min0=0.0447
    ...: 80001044273376, max=0.08956000208854675, max0=0.08956000208854675, size=500, bases=500, sum=31.629608038
    ...: 812876)
449 ns ± 7.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

It would be nice if bigwigaverageoverbed would return mean, min ,max, (only considering positions that have actual values, returning NaN if needed) and mean0, min0 and max0 (assuming 0.0 for all positions that don''t have an entry in the bigWig). This would change the behavior of mean different compared to Kent tools (NaN instead of 0.0 when no values are found), but feels more consistent.

-fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
+fn pybigtools(py: Python, m: &PyModule) -> PyResult<()> {
     m.add("__version__", env!("CARGO_PKG_VERSION"))?;
     m.add_wrapped(wrap_pyfunction!(open))?;
     m.add_wrapped(wrap_pyfunction!(bigWigAverageOverBed))?;
@@ -1366,5 +1382,25 @@ fn pybigtools(_py: Python, m: &PyModule) -> PyResult<()> {
 
     m.add_class::<EntriesIterator>()?;
 
+    let collections = PyModule::import(py, "collections").unwrap();
+    let namedtuple = collections.getattr("namedtuple").unwrap();
+    let bigwigaverageoverbed_summary_statistics = namedtuple
+        .call(
+            (
+                "BigWigAverageOverBedSummaryStatistics".to_object(py),
+                (
+                    "name", "mean", "mean0", "min", "min0", "max", "max0", "size", "bases", "sum",
+                )
+                    .to_object(py),
+            ),
+            None,
+        )
+        .unwrap();
+    m.add(
+        "BigWigAverageOverBedSummaryStatistics",
+        bigwigaverageoverbed_summary_statistics,
+    )
+    .unwrap();
+
     Ok(())
 }

jackh726 · 2024-06-24T15:59:16Z

I'm simultaneously both surprised and not that both dict and namedtuple have that much worse perf than a tuple.

I think, at least for now, I'm not so worried about the time difference. I imagine if speed is needed at that level, people would either drop down to Rust, or use some future API that allows batching and returning a pre-allocated array or something.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

I think I'm pretty happy with the branch I have at https://github.com/jackh726/bigtools/tree/ghuls/add_more_bigwigaverageoverbed_stats. I updated the API to match the comments above. Would love to hear your thoughts on it.

ghuls · 2024-06-25T09:08:11Z

Looks good. Do you know how I can check that branch out to try it locally?

jackh726 · 2024-06-25T15:18:20Z

git fetch https://github.com/jackh726/bigtools.git bigtools_refactor_rebase2:bwaob_refactor
git checkout bwaob_refactor

should work

ghuls · 2024-06-26T08:48:54Z

Managed to get the correct branch:

git fetch https://github.com/jackh726/bigtools.git ghuls/add_more_bigwigaverageoverbed_stats:bigtools_ghuls_add_more_bigwigaverageoverbed_stats

Help needs to be fixed:

    If ``"all"`` is specified, all summary statistics are returned in a named tuple. 

    If a single statistic is provided as a string, that statistic is returned
    as a float.

    If a single statistic is provided as a string, that statistic is returned
    as a float or int depending on the statistic.

Giving an unknown statistic (like "all" does not trigger and exception or an error message:

In [32]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 4.93 s, sys: 36.2 ms, total: 4.97 s
Wall time: 4.96 s
Out[32]: 0.06325921607762575

In [33]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 90 µs, sys: 2 µs, total: 92 µs
Wall time: 95.8 µs
Out[33]: 0.06325921607762575

In [34]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["all2"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 49 µs, sys: 1e+03 ns, total: 50 µs
Wall time: 52.9 µs
Out[34]: 0.06325921607762575

In [35]: %%time
    ...: with pybigtools.open("test.bw", "r") as bw:
    ...:     x = list(bw.average_over_bed("consensus_peaks.bed", stats=["mean0", "mean", "min", "max", "size", "bases", "sum"]))
    ...: x[0]
    ...: 
    ...: 
CPU times: user 5.13 s, sys: 80 ms, total: 5.21 s
Wall time: 5.2 s
Out[35]: 
(0.06325921607762575,
 0.06325921607762575,
 0.044780001044273376,
 0.08956000208854675,
 500,
 500,
 31.629608038812876)

ghuls · 2024-06-26T09:45:36Z

It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0), so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.

with pybigtools.open("test.bw", "r") as bw:
    x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
    y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))

ghuls · 2024-06-26T12:33:32Z

Opening BED files as a Path does not work (works for opening the bigwig):

bw.average_over_bed(Path("file.bed"))

jackh726 · 2024-06-26T15:37:44Z

Help needs to be fixed:

    If ``"all"`` is specified, all summary statistics are returned in a named tuple. 

    If a single statistic is provided as a string, that statistic is returned
    as a float.

    If a single statistic is provided as a string, that statistic is returned
    as a float or int depending on the statistic.

Done

Giving an unknown statistic (like "all" does not trigger and exception or an error message:

Strange...there is a test that shows that it does:

assert pytest.raises(ValueError, bw.average_over_bed, path, None, 'bad')
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'bad'])
assert pytest.raises(ValueError, bw.average_over_bed, path, None, ['min', 'all'])

It would be nice to have a method on BBIRead to reinitialize it (similar to seek(0), so that you can run e.g. multiple average_over_bed commands on the same bigwig filehandle.
with pybigtools.open("test.bw", "r") as bw:
    x = list(bw.average_over_bed("consensus_peaks_set1.bed", stats=["mean0"]))
    y = list(bw.average_over_bed("consensus_peaks_set2.bed", stats=["mean0"]))

This already should have worked - added a test to show.

Opening BED files as a Path does not work (works for opening the bigwig):
bw.average_over_bed(Path("file.bed"))

Added this + test

jackh726 · 2024-06-26T15:42:16Z

(feel free to just force push that branch to this PR branch - I can do it directly if you don't know how. Do want to give you credit for your work here)

ghuls · 2024-06-27T08:23:47Z

Calculating multiple average_over_bed on the same BBIRead handle andd "all" statistics now works. Not sure what was wrong here with my installation.

In [14]: %%time
    ...: with pybigtools.open(Path("test.bw"), "r") as bw:
    ...:     y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
    ...:     x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="all"))
    ...: print(y_list[0], x_list[0])
    ...: 
    ...: 
0.06325921607762575 SummaryStatistics(size=500, bases=500, sum=31.629608038812876, mean0=0.06325921607762575, mean=0.06325921607762575, min=0.044780001044273376, max=0.08956000208854675)
CPU times: user 8.71 s, sys: 188 ms, total: 8.9 s
Wall time: 8.89 s

Raising an exception form a with block does not work for me:

import pybigtools
from pathlib import Path

# Don't import numpy to show with block does not pass exception.
# import numpy as np

In [2]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_list = list(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"))
   ...:     x_list = list(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"))
   ...: print(y_list[0], x_list[0])
   ...: 
   ...: 
0.06325921607762575 0.08956000208854675
CPU times: user 8.1 s, sys: 228 ms, total: 8.33 s
Wall time: 8.32 s

# Numpy is not imported, so the code should trow an exception, but doesn't:
In [3]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:4

NameError: name 'y_array' is not defined

In [4]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: #print(y_array[0], x_array[0])
   ...: 
   ...: 
CPU times: user 47 μs, sys: 1 μs, total: 48 μs
Wall time: 50.5 μs

# In a normal with block it trows an exception:
In [5]: %%time
   ...: with open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.read(), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.read(), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:2

NameError: name 'np' is not defined


# After importing numpy
import numpy as np

In [9]: %%time
   ...: with pybigtools.open(Path("test.bw"), "r") as bw:
   ...:     y_array = np.fromiter(bw.average_over_bed("consensus_peaks.bed", names=None, stats="mean0"), dtype=np.float32)
   ...:     x_array = np.fromiter(bw.average_over_bed(Path("consensus_peaks.bed"), names=None, stats="max"), dtype=np.float32)
   ...: print(y_array[0], x_array[0])
   ...: 
   ...: 
0.063259214 0.08956
CPU times: user 8.24 s, sys: 160 ms, total: 8.4 s
Wall time: 8.4 s

Minimal example:

In [11]: with pybigtools.open(Path("test.bw"), "r") as bw:
    ...:     raise ValueError("Some error")
    ...: 

In [12]: with open(Path("test.bw"), "r") as bw:
    ...:     raise ValueError("Some error")
    ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[12], line 2
      1 with open(Path("test.bw"), "r") as bw:
----> 2     raise ValueError("Some error")

ValueError: Some error

jackh726 · 2024-06-27T15:59:27Z

Raising an exception form a with block does not work for me:

Thanks, this was a problem with #44, which is fixed now (with a test)

…gWigAverageOverBed.

…eOverBed.

… Allow passing Path objects.

ghuls · 2024-06-28T11:44:17Z

Raising an exception form a with block does not work for me:

Thanks, this was a problem with #44, which is fixed now (with a test)

I now assume that this Iwas the reason why invoking average_over_bed twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.

Exceptions are now raised properly.

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

ghuls · 2024-06-28T14:14:21Z

Seems like bigWigAverageOverBed is semi deprecated:
https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

…covered

jackh726 · 2024-06-28T16:19:22Z

I now assume that this Iwas the reason why invoking average_over_bed twice on the bigWig filehandle didn't work as I might have used some code that invoked an exception.

Exceptions are now raised properly.

Good to hear

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

Maybe today, most likely sometime this weekend

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values too.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

Just pushed a commit for this.

jackh726 · 2024-06-28T16:21:14Z

Going to go ahead and merge this - happy to follow up on min0/max0.

ghuls · 2024-06-28T17:13:19Z

Is there a timeframe for a 0.2.0 release with this pull request merged? We hope to use it soon in a package we want to publish.

Maybe today, most likely sometime this weekend

Great!

Seems like bigWigAverageOverBed is semi deprecated: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ
If that is really the case, could the min0, max0 options being considered in the future? (could be behind a different flag than --min-max, so --min-max stays compatible with Kent tools).

Potentially! I do think that the general question of whether uncovered is treated as 0 or not is important - it's relevant to values too.

I think in pyBigWig returns np.nan for those positions.

I think I probably agree that fully uncovered regions should return nan instead of 0. I'm less inclined to add more stats or change the semantics of those that exist compared to the ucsc tools, since that will either just get confusing or too much. For now, I'd recommend just calling values (which will have a robust enough API) and doing to summary stats on that.

Just pushed a commit for this.

Thanks.

ghuls · 2024-07-04T11:59:32Z

Seems like UCSC does not have time to fix bigWigAverageOverBed: https://groups.google.com/a/soe.ucsc.edu/g/genome/c/fh9nNK52Qrw/m/ITy1wlCBBQAJ

Thank you for providing those details. Unfortunately, we don't have the bandwidth to fix this issue with bigWigAverageOverBed at this time.

However, we can point people to the bigtools suite in the future if they ask about tools with similar functionality.

So I guess addin min0/max0 to bigtools should not be a big issue if bigWigAverageOverBed is basically abandoned.

jackh726 reviewed Jun 14, 2024

View reviewed changes

jackh726 force-pushed the add_more_bigwigaverageoverbed_stats branch from 86485e6 to b391f93 Compare June 27, 2024 16:08

ghuls and others added 6 commits June 27, 2024 23:52

Add --min-max option to bigwigaverageoverbed similar to -minMax of bi…

12c586d

…gWigAverageOverBed.

Expose all statistics instead of only mean in pybigtools bigWigAverag…

48f89ae

…eOverBed.

Address minor review comments

004cd5b

Clean up BigWigAverageOverBedEntriesIterator::__next__

1e6d4e3

Overhaul average_over_bed API

e291a5d

Fix doc to include int. Add test for multiple average_over_bed calls.…

edeb5e9

… Allow passing Path objects.

For stats_for_bed_item, return nan for mean, min, and max if fully un…

34a3691

…covered

jackh726 force-pushed the add_more_bigwigaverageoverbed_stats branch from b391f93 to 34a3691 Compare June 28, 2024 16:13

jackh726 merged commit 5161c9a into jackh726:master Jun 28, 2024
4 checks passed

ghuls mentioned this pull request Aug 9, 2024

Update match.py jmschrei/tangermeme#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more bigwigaverageoverbed stats #42

Add more bigwigaverageoverbed stats #42

ghuls commented Jun 13, 2024 •

edited by jackh726

Loading

ghuls commented Jun 13, 2024

jackh726 left a comment

jackh726 commented Jun 23, 2024

ghuls commented Jun 24, 2024 •

edited

Loading

jackh726 commented Jun 24, 2024

ghuls commented Jun 25, 2024

jackh726 commented Jun 25, 2024

ghuls commented Jun 26, 2024

ghuls commented Jun 26, 2024 •

edited

Loading

ghuls commented Jun 26, 2024

jackh726 commented Jun 26, 2024

jackh726 commented Jun 26, 2024

ghuls commented Jun 27, 2024

jackh726 commented Jun 27, 2024

ghuls commented Jun 28, 2024

ghuls commented Jun 28, 2024 •

edited

Loading

jackh726 commented Jun 28, 2024

jackh726 commented Jun 28, 2024

ghuls commented Jun 28, 2024

ghuls commented Jul 4, 2024 •

edited

Loading

Add more bigwigaverageoverbed stats #42

Add more bigwigaverageoverbed stats #42

Conversation

ghuls commented Jun 13, 2024 • edited by jackh726 Loading

ghuls commented Jun 13, 2024

jackh726 left a comment

Choose a reason for hiding this comment

jackh726 commented Jun 23, 2024

ghuls commented Jun 24, 2024 • edited Loading

jackh726 commented Jun 24, 2024

ghuls commented Jun 25, 2024

jackh726 commented Jun 25, 2024

ghuls commented Jun 26, 2024

ghuls commented Jun 26, 2024 • edited Loading

ghuls commented Jun 26, 2024

jackh726 commented Jun 26, 2024

jackh726 commented Jun 26, 2024

ghuls commented Jun 27, 2024

jackh726 commented Jun 27, 2024

ghuls commented Jun 28, 2024

ghuls commented Jun 28, 2024 • edited Loading

jackh726 commented Jun 28, 2024

jackh726 commented Jun 28, 2024

ghuls commented Jun 28, 2024

ghuls commented Jul 4, 2024 • edited Loading

ghuls commented Jun 13, 2024 •

edited by jackh726

Loading

ghuls commented Jun 24, 2024 •

edited

Loading

ghuls commented Jun 26, 2024 •

edited

Loading

ghuls commented Jun 28, 2024 •

edited

Loading

ghuls commented Jul 4, 2024 •

edited

Loading