-
Notifications
You must be signed in to change notification settings - Fork 21
Out of Memory but the job seemed already finished #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: The file content in the summary folder of OOM run:
The file content in the summary folder of completed run:
Is there a way to resume earlGrey from where it failed? |
Hi @ting-hsuan-chen! In this case it is likely that the OOM step prevented proper processing during the divergence calculations, where the annotations are read into memory to calculate kimura divergence. It is probably worth rerunning these jobs just to make sure. You can rerun the failed steps of EarlGrey here by deleting |
Thank you @TobyBaril, I'll try it. |
Hi @TobyBaril, turned out I restarted a fresh run using |
Hi @ting-hsuan-chen, the library construction terminates after On this, I've made a note to add another subscript for the final annotation and defragmentation for the next release! |
Thank you @TobyBaril ! I allocated 10 cpus and a total of 100G of memory for each genome (each about 500-600Mb in size). For some genomes, I still got the "Out Of Memory" error from slurm when using For your reference, the tail of the log file for the slurm job with OOM error was attached. It seemed that TEstrainer step has been completed? If not, how do I resume
|
Hi! You can resume by following the comment. In the case of the log you posted it looks as though the job should have successfully completed. OOM in slurm can be a bit odd though as it terminates the job without giving it a chance to finish nicely, so there's no guarantee the step it interrupted finished properly. |
Hi, I've also had the same issue when running EarlGrey on SLURM HPC, and what surprised me is that for the exactly same input file, one run consummed more than 300G of RAM (and generated an OOM error) while the second run ended without error and a maximum consumption of ca. 20Gb of RAM. |
Hi @jeankeller, This is strange...are you happy to provide the log files from both runs for us to take a look at? |
Hi @TobyBaril Best, |
Hi Jean, It shouldn't be related to conda. It might be related to the divergence calculations that run several in parallel, although this shouldn't really cause any issues until we get to files with millions of TE hits...I'll continue trying to narrow this down - it is a strange one as nothing in the core pipeline has changed for several iterations now! |
Hi Toby, Yes, it is weird. The HPC team installed EarlGrey as a SLURM module instead of an user-conda environment and on the tests I have run, it looks like the error has gone. I am running more tests on species with different genome size to confirm the pattern. I can share with you the log of the failed run (under conda environment) that used more than 300Gb of RAM. We redid it the exact same way and it consumed only 10-15Gb of RAM. |
Hi Toby, Cheers |
Okay so this looks like it could be linked to something in TEstrainer, or potentially a conda module. @jamesdgalbraith might be able to provide more information on specific sections of TEstrainer that could be the culprit, but we will look into it |
The memory-hungry stages of TEstrainer is the multi-sequence alignments using MAFFT, and the amount of memory used can vary between runs on the same genome depending on several factors including what repeats that particular run of RepeatModeler found (the seed it uses varies), especially if it detects a satellite repeats, as constructing MSAs of long arrays of tandem repeats is very memory-hungry. This may be what you're encountering @jeankeller . Unfortunately I don't currenty have a fix for this, but have been exploring potential ways of overcoming this issue. In the first jobs you mentions @ting-hsuan-chen I don't think it's OOM in TEstrainer due to the presence of the |
Hi all, |
Hi @jeankeller, thanks for the update! It would be great to have the log file if possible. I’ve had a chat with @jamesdgalbraith and we think this might be related to running several instances of MAFFT in parallel, particularly on big genomes and generating alignments for families with high copy #. We are working on refining the memory management but it would still be useful to check the logs to make sure this is indeed the issue that you have faced with this genome. Thanks! |
Hi @TobyBaril, |
Hey all! Just a quick update on this - we have tracked this to the memory-hungry EMBOSS |
Thanks @TobyBaril! I'm looking forward to it!!! |
This has been pushed in release |
Hi Toby, |
Version 5.1.0 contains the changes that should solve these issues. It is currently waiting for approval in bioconda so should be live later this evening or tomorrow depending on when it gets a review (it has already passed all the appropriate tests). |
excellent, thank you! I'll try it asap :) |
This is awesome! |
Hi, Happy new year 2025! Best, |
Hi all, I have the OOM issue with slurm also. Running earlGreyLibConstruct with 250G mem and 32 CPUS using apptainer with slurm. The 5 rounds of RepeatModeler have completed. It seems like the OOM kill event occurred during the strainer step (at 5 am this morning). These are the files that are present in the strained dir:
And in the summary dir, also with the 5 am timestamp:
It seems that strainer has completed and then immediately the next step has failed.
Because the families.fa.strained file is present in the summaryFiles dir I wasn't sure if I should be removing the contents of the strained dir to resume, because perhaps it actually completed. Any idea what the problem was with lines 193-195 in the script for me? Many thanks, |
Thanks for sharing @gemmacol. This looks like TEstrainer finished as expected (@jamesdgalbraith could confirm for sure). The OOM is potentially happening during the file tidying steps of TEstrainer. In your tree above, can you check the size of This should help us to narrow down exactly where the issue is. Unfortunately, this seems to mainly be an issue on (very) large and/or highly repetitive genomes, so is taking us some time to work out exactly what is happening. |
Hi Toby, I have checked the two things you suggest and indeed it seems that TEstrainer finished and the issue is further down the pipeline. AXX_polished_unitigs-families.fa.strained = 45581630 AXX_polished_unitigs-families.fa.nonsatellite.classified = 45150251
Is there a way I can just test the next step? Because of the error in the log file:
And is the file Also do you know Dr Sarah Semeraro from your building? If you see her say hi from me! Many thanks, |
Hi @gemmacol, Thanks for the update! In this case, I reckon the culprit might be the divergence calculator as this performs lots of alignments, which gets quite memory hungry on very large annotation files (i.e lots of alignments to do). The All steps will be skipped up to the defragmentation step. My guess is the OOM might occur in this or the following step. If we can narrow this down, it might help us to come up with a solution for large genomes. I haven't run into Sarah, but if I do I'll say hello! Best Wishes, Toby |
@gemmacol I've realised you're running just the library construction, so (weirdly) for you the pipeline actually finished successfully, even with the OOM message. Each step was successful and the appropriate files were generated and look as they should, so in this case either the OOM just occurred as TEstrainer finished, or it occurred during the cleaning up of all the files (which seems a little unlikely). |
@jeankeller would you be able to shared the |
@jamesdgalbraith yes sure. You can find the files here: https://sdrive.cnrs.fr/s/MaWzpD46fm9MjjJ |
Sorry for the long delay @jeankeller, I think I've found the error but not the bug causing it. Looking at the GFF it appears all the repeats in it were discovered using HELIANO and none of the RepeatModeler+TEstrainer were found. I'm assuming this is a bug, as the TEstainer repeat library has over 2000 sequences in it! We'll see if we can narrow down what happened and get back to you. |
Hello!
I ran EarlGrey (v4.4.4) for multiple genomes (size between 500-600 Mb) using Slurm. Some jobs were completed but the others showed Out Of Memory (exit code 0).
For those OOM jobs, I checked the log file generated by earlGrey and it seemed that the pipeline has completed. Like the following:
And the number of files in the summary folder is the same as those genomes with completed run.
What would be the cause of the OOM error? Which step is the most RAM-consuming step? Should I rerun EarlGrey for those having OOM error or ignore the OOM error? Or would it be a problem caused by our Slurm system instead?
p.s. I used 16 cores and 60G of RAM for each job.
Any guidance is much appreciated.
Cheers
Ting-Hsuan
The text was updated successfully, but these errors were encountered: