Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data sets cause IOError #4

Open
forestdussault opened this issue Nov 27, 2017 · 1 comment
Open

Large data sets cause IOError #4

forestdussault opened this issue Nov 27, 2017 · 1 comment

Comments

@forestdussault
Copy link

I'm not sure if this is an intended use case for Neptune, but I attempted to run the program with ~150 inclusion genomes (450 MB) and ~8000 exclusion genomes (32 GB) and it caused the program to crash before completion. Here is the log from my console:

Estimating k-mer size ...
k = 25

k-mer Counting...
Submitted 8164 jobs.
44.61319 seconds

k-mer Aggregation...
Submitted 65 jobs.
Traceback (most recent call last):
  File "/home/dussaultf/miniconda3/envs/neptune/bin/neptune-conda", line 11, in <module>
    load_entry_point('neptune==1.2.5', 'console_scripts', 'neptune')()
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/Neptune.py", line 986, in main
    parse(parameters)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/Neptune.py", line 765, in parse
    executeParallel(parameters)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/Neptune.py", line 749, in executeParallel
    execute(execution)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/Neptune.py", line 662, in execute
    aggregateKMers(execution, inclusionKMerLocations, exclusionKMerLocations)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/Neptune.py", line 290, in aggregateKMers
    inclusionKMerLocations, exclusionKMerLocations)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/Neptune.py", line 356, in aggregateMultipleFiles
    execution.jobManager.runJobs(jobs)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/JobManagerParallel.py", line 138, in runJobs
    self.synchronize(jobs)
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/site-packages/neptune/JobManagerParallel.py", line 178, in synchronize
    job.get()  # get() over wait() to propagate excetions upwards
  File "/home/dussaultf/miniconda3/envs/neptune/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
IOError: [Errno 24] Too many open files: '/mnt/scratch/Forest/neptune_analysis/output_debug/kmers/exclusion/GCF_001642675.1_ASM164267v1_genomic.fna.kmers.AAA'
@emarinier
Copy link
Member

Thanks for reporting this error.

What's happening is that each aggregation job is opening up a temporary file associated with each input file (~150 + ~8000). I suspect Python is unable to open ~8150 files simultaneously and is throwing this error.

The problem is there's currently no way to change the input parameters so that this doesn't happen. The number of aggregation jobs being run can be changed, but each job is still going to try to open as many files simultaneously as there are inputs.

The short term solution would be to run Neptune with less input files. I believe the biggest we've run the software is with approximately 800 total input files. The long term solution (on my end) might involve limiting the software to perform aggregation in iterative batches with a reasonable number of files open simultaneously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants