Skip to content

Add DBs needed for MAGs workflow to CVMFS #945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
paulzierep opened this issue Mar 20, 2025 · 14 comments
Open

Add DBs needed for MAGs workflow to CVMFS #945

paulzierep opened this issue Mar 20, 2025 · 14 comments

Comments

@paulzierep
Copy link

paulzierep commented Mar 20, 2025

We want to add a MAGs workflow to IWC and it needs some DBs in CVMFS. The DBs are available on .eu.
They can all be installed via DMs. Can we provide a list to get copied to CVMFS ?
What do you need for that exactly. The DMs with parameters or the .loc entry on .eu ?

@paulzierep
Copy link
Author

  • SemiBin

@SantaMcCloud
Copy link

SantaMcCloud commented Mar 27, 2025

  • gtdbtk_database_metadata_versioned
  • CheckM2
  • Update NCBI DB?

@paulzierep
Copy link
Author

They need to go in here:
Xref: http://datacache.galaxyproject.org/indexes/mapseq/

@paulzierep
Copy link
Author

checkm2:

Table: checkm2

1.0.2	1.0.2	CheckM2 diamond DB downloaded with version 1.0.2	1.0.2	/data/db/data_managers/checkm2/1.0.2/uniref100.KO.1.dmnd

semibin

Table: gtdb

17102022	GTDB reference genome generated by MMseqs2 used in SemiBin	gtdb	/data/db/data_managers/semibin/data/gtdb

GTDB-tk

Table: gtdbtk_database_versioned

full_database_release_220_downloaded_2024-10-28	Full Database - release 220 (2024-10-28)	220	/data/db/data_managers/gtdbtk_database_versioned/full_database_release_220_downloaded_2024-10-28

bakta

Table: amrfinderplus_versioned_database

amrfinderplus_V3.12_2024-05-02.2	V3.12-2024-05-02.2	3.12	/data/db/data_managers/amrfinderplus-db/amrfinderplus_V3.12_2024-05-02.2

and
Table: bakta_database

V5.1_2024-01-19	10522951	1.7	/data/db/data_managers/bakta_database/10522951

ncbi taxonomy

Table: ncbi_taxonomy

2024-06-05	2024-06-05	/data/db/data_managers/ncbi_taxonomy/2024-06-05

gtdbtk metadata

Table: gtdbtk_database_metadata_versioned

GTDB-Tk Database Version 220 Metadata from 09-10-2024 _release_220_downloaded_2024-09-10	GTDB-Tk Database Version 220 Metadata from 09-10-2024	220	/data/db/data_managers/gtdbtk_database_metadata_versioned/GTDB-Tk Database Version 220 Metadata from 09-10-2024 _release_220_downloaded_2024-09-10

@SantaMcCloud
Copy link

SantaMcCloud commented Mar 28, 2025

CheckM2

current DB: https://zenodo.org/records/5571251/files/checkm2_database.tar.gz
after galaxyproject/tools-iuc#6861 getting merged and the DB get downlaoded via DM: https://zenodo.org/records/14897628/files/checkm2_database.tar.gz

SemiBin

current DB: https://zenodo.org/record/4751564/files/GTDB_v95.tar.gz
after galaxyproject/tools-iuc#6817 getting merged and the DB get downlaoded via DM:

=> This data will be used by MMSeqs2 to build the DB which is used by SemiBin. The current DM download a old version which is is a finish build by MMSeqs2

GTDB-Tk

https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz

GTDB-Tk Metadata

=> This are the latest version which are used in the DM. Older can also be downaload by it.

Bakta

https://zenodo.org/records/10522951/files/db.tar.gz

=> Latest version, also be used in the DM

Amrfinderplus

Data will be donwlaoded using this command wget -nd -np -r ftp://anonymous:[email protected]/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/3.12/2024-05-02.2 -P <output_dir_name>. I could not found any direct FTP link sorry if this needed.

NCBI

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

=> has to be this link. NOTE: NCBI did updated this file today in the morning which means that the current DM runs with the old DB.

@natefoo
Copy link
Member

natefoo commented Mar 28, 2025

checkm2

nate@alchemist% cat /cvmfs/data.galaxyproject.org/byhand/location/checkm2.loc
1.0.2	1.0.2	CheckM2 diamond DB downloaded with version 1.0.2	1.0.2	/cvmfs/data.galaxyproject.org/byhand/checkm2/1.0.2/uniref100.KO.1.dmnd

GTDB-tk

nate@alchemist% grep 220 /cvmfs/data.galaxyproject.org/managed/location/gtdbtk_database_versioned.loc
full_database_release_220_downloaded_2024-10-19	Full Database - release 220 (2024-10-19)	220	/cvmfs/data.galaxyproject.org/managed/gtdbtk_database_versioned/full_database_release_220_downloaded_2024-10-19

bakta

nate@alchemist% grep ^amr /cvmfs/data.galaxyproject.org/byhand/location/amrfinderplus_versioned.loc
amrfinderplus_V3.12_2024-05-02.2	V3.12-2024-05-02.2	3.12	/cvmfs/data.galaxyproject.org/byhand/amrfinderplus-db/amrfinderplus_V3.12_2024-05-02.2
nate@alchemist% grep 5.1 /cvmfs/data.galaxyproject.org/byhand/location/bakta_database.loc
V5.1_2024-01-19	10522951	1.7	/cvmfs/data.galaxyproject.org/byhand/bakta_database/10522951

ncbi taxonomy

nate@alchemist% grep 2024-06-05 /cvmfs/data.galaxyproject.org/byhand/location/ncbi_taxonomy.loc
2024-06-05	2024-06-05	/cvmfs/data.galaxyproject.org/byhand/ncbi_taxonomy/2024-06-05

the rest

I'll work on these ASAP.

@paulzierep
Copy link
Author

paulzierep commented Mar 31, 2025

checkm2

nate@alchemist% cat /cvmfs/data.galaxyproject.org/byhand/location/checkm2.loc
1.0.2 1.0.2 CheckM2 diamond DB downloaded with version 1.0.2 1.0.2 /cvmfs/data.galaxyproject.org/byhand/checkm2/1.0.2/uniref100.KO.1.dmnd

GTDB-tk

nate@alchemist% grep 220 /cvmfs/data.galaxyproject.org/managed/location/gtdbtk_database_versioned.loc
full_database_release_220_downloaded_2024-10-19 Full Database - release 220 (2024-10-19) 220 /cvmfs/data.galaxyproject.org/managed/gtdbtk_database_versioned/full_database_release_220_downloaded_2024-10-19

bakta

nate@alchemist% grep ^amr /cvmfs/data.galaxyproject.org/byhand/location/amrfinderplus_versioned.loc
amrfinderplus_V3.12_2024-05-02.2 V3.12-2024-05-02.2 3.12 /cvmfs/data.galaxyproject.org/byhand/amrfinderplus-db/amrfinderplus_V3.12_2024-05-02.2
nate@alchemist% grep 5.1 /cvmfs/data.galaxyproject.org/byhand/location/bakta_database.loc
V5.1_2024-01-19 10522951 1.7 /cvmfs/data.galaxyproject.org/byhand/bakta_database/10522951

ncbi taxonomy

nate@alchemist% grep 2024-06-05 /cvmfs/data.galaxyproject.org/byhand/location/ncbi_taxonomy.loc
2024-06-05 2024-06-05 /cvmfs/data.galaxyproject.org/byhand/ncbi_taxonomy/2024-06-05

the rest

I'll work on these ASAP.

Since one needs to know the entry of the loc files and DBs in /data.galaxyproject.org/byhand/ to write the IWC workflow, is this information somehow publicly available? https://datacache.galaxyproject.org/byhand gives 404, ping @natefoo

@mvdbeek
Copy link
Member

mvdbeek commented Mar 31, 2025

Why is this needed ? You should only need the dbkey

@paulzierep
Copy link
Author

Why is this needed ? You should only need the dbkey

well, how do I know that the dbkey is related to the DB I need without knowing the content of the loc file ? Some DMs allow manual entry and where do I find the dbkey if I cannot see the loc file ?

@mvdbeek
Copy link
Member

mvdbeek commented Mar 31, 2025

If you need one specific database, select it in the editor. If you want to have the user select it, connect a text param to a select, like in https://usegalaxy.org/u/marius/w/checkm2-example. As an author you do not need access to loc files, and unless you know an admin, you will never get it. It's not part of what we consider available to users.

@natefoo
Copy link
Member

natefoo commented Apr 3, 2025

For the record, /cvmfs/data.galaxyproject.org/byhand is this: http://datacache.galaxyproject.org/indexes/

@paulzierep
Copy link
Author

For the record, /cvmfs/data.galaxyproject.org/byhand is this: http://datacache.galaxyproject.org/indexes/

Thanks @natefoo now I get it, was looking for the loc files !

@natefoo
Copy link
Member

natefoo commented Apr 7, 2025

The last two are done:

semibin

nate@alchemist% cat /cvmfs/data.galaxyproject.org/byhand/location/gtdb.loc
17102022	GTDB reference genome generated by MMseqs2 used in SemiBin	gtdb	/cvmfs/data.galaxyproject.org/byhand/gtdb

gtdbtk metadata

nate@alchemist% cat /cvmfs/data.galaxyproject.org/byhand/location/gtdbtk_database_metadata_versioned.loc
GTDB-Tk Database Version 220 Metadata from 09-10-2024 _release_220_downloaded_2024-09-10	GTDB-Tk Database Version 220 Metadata from 09-10-2024	220	/cvmfs/data.galaxyproject.org/byhand/gtdbtk_database_metadata_versioned/_release_220_downloaded_2024-09-10

@paulzierep
Copy link
Author

thanks @natefoo !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants