Skip to content

Duplicate field error in GenomicsDBImport 4.1.2.0 #6158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tmm211 opened this issue Sep 12, 2019 · 8 comments
Closed

Duplicate field error in GenomicsDBImport 4.1.2.0 #6158

tmm211 opened this issue Sep 12, 2019 · 8 comments
Assignees

Comments

@tmm211
Copy link

tmm211 commented Sep 12, 2019

Several users have run into this issue where GenomicsDBImport errors out due to duplicate fields in their Info, Format, and/or Filter fields. They want to be able to run GenomicsDBImport without having to manually alter their files to remove duplicates.

This is the latest issue reported Sept 6. Here are some others I found that may be related:

GATK version is 4.1.2.0

_Command: _
gatk_megs=$(head -n1 /proc/meminfo | awk '{print int(0.9*($2/1024))}');
gatk --java-options "-Xmx${gatk_megs}m" GenomicsDBImport --genomicsdb-workspace-path pon_db -V GHS_PT100006_233694007.gvcf.gz -L xgen_plus_spikein.b38.bed --batch-size 50 --reader-threads 5 --tmp-dir=./tmp2

Error log:
Using GATK jar /mnt/PoN_gvcf/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx14441m -jar /mnt/PoN_gvcf/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar GenomicsDBImport --genomicsdb-workspace-path pon_db -V GHS_PT100006_233694007.gvcf.gz -L xgen_plus_spikein.b38.bed --batch-size 50 --reader-threads 5 --tmp-dir=./tmp2
16:20:56.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/PoN_gvcf/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:20:57.244 INFO GenomicsDBImport - ------------------------------------------------------------
16:20:57.244 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.2.0
16:20:57.245 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
16:20:57.245 INFO GenomicsDBImport - Executing as s.sean.okeeffe1@ip-172-24-85-170 on Linux v4.4.0-1090-aws amd64
16:20:57.245 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10
16:20:57.245 INFO GenomicsDBImport - Start Date/Time: September 9, 2019 4:20:56 PM UTC
16:20:57.249 INFO GenomicsDBImport - ------------------------------------------------------------
16:20:57.249 INFO GenomicsDBImport - ------------------------------------------------------------
16:20:57.252 INFO GenomicsDBImport - HTSJDK Version: 2.19.0
16:20:57.252 INFO GenomicsDBImport - Picard Version: 2.19.0
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:20:57.252 INFO GenomicsDBImport - Deflater: IntelDeflater
16:20:57.253 INFO GenomicsDBImport - Inflater: IntelInflater
16:20:57.253 INFO GenomicsDBImport - GCS max retries/reopens: 20
16:20:57.253 INFO GenomicsDBImport - Requester pays: disabled
16:20:57.253 INFO GenomicsDBImport - Initializing engine
16:20:57.921 INFO FeatureManager - Using codec BEDCodec to read file file:///mnt/PoN_gvcf/xgen_plus_spikein.b38.bed
16:20:58.514 INFO IntervalArgumentCollection - Processing 38997831 bp from intervals
16:20:58.591 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
16:20:58.594 INFO GenomicsDBImport - Done initializing engine
16:20:58.829 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/PoN_gvcf/pon_db/vidmap.json
16:20:58.829 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/PoN_gvcf/pon_db/callset.json
16:20:58.829 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/PoN_gvcf/pon_db/vcfheader.vcf
16:20:58.829 INFO GenomicsDBImport - Importing to array - /mnt/PoN_gvcf/pon_db/genomicsdb_array
16:20:58.830 WARN GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.
16:20:58.830 INFO ProgressMeter - Starting traversal
16:20:58.830 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
16:26:23.008 INFO GenomicsDBImport - Importing batch 1 with 1 samples
Duplicate field name BR found in vid attribute "fields"
Duplicate field name MQ found in vid attribute "fields"
Duplicate field name QD found in vid attribute "fields"
terminate called after throwing an instance of 'FileBasedVidMapperException'
what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/comment/60780#Comment_60780

@tmm211 tmm211 added the Vanilla label Sep 12, 2019
@ldgauthier
Copy link
Contributor

The other tickets were poorly formatted VCFs or a combination of VCFs with different headers (for which I think using a sample map instead of multiple -V inputs might work). This once is more complicated, so I think that @nalinigans and her team might have to weigh in.

@droazen
Copy link
Contributor

droazen commented Sep 12, 2019

@nalinigans Could you or a member of your team please comment when you get a chance? Thanks!

@nalinigans
Copy link
Collaborator

Currently, there is no support for duplicate field names. While there is some hardcoded support to handle DP where DP in FORMAT fields is renamed to DP_FORMAT, GenomicsDB will have to add support for other duplicated field names. Starting to look for a solution, but pulling in @mlathara and @kgururaj for their inputs.

@lbergelson
Copy link
Member

It might make sense to just internally rename all the format fields to X_FORMAT instead of just DP.

@kgururaj
Copy link
Collaborator

Will fix this

@droazen droazen added this to the GATK-Priority-Backlog milestone Oct 30, 2019
@droazen
Copy link
Contributor

droazen commented Jun 22, 2020

@nalinigans Can this one be closed?

@droazen droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020
@nalinigans
Copy link
Collaborator

@droazen, yes this one can be closed - changes in this GenomicDB PR. Thanks.

@droazen
Copy link
Contributor

droazen commented Jun 29, 2020

Closing!

@droazen droazen closed this as completed Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants