-
Notifications
You must be signed in to change notification settings - Fork 603
Duplicate field error in GenomicsDBImport 4.1.2.0 #6158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The other tickets were poorly formatted VCFs or a combination of VCFs with different headers (for which I think using a sample map instead of multiple -V inputs might work). This once is more complicated, so I think that @nalinigans and her team might have to weigh in. |
@nalinigans Could you or a member of your team please comment when you get a chance? Thanks! |
Currently, there is no support for duplicate field names. While there is some hardcoded support to handle DP where DP in FORMAT fields is renamed to DP_FORMAT, GenomicsDB will have to add support for other duplicated field names. Starting to look for a solution, but pulling in @mlathara and @kgururaj for their inputs. |
It might make sense to just internally rename all the format fields to X_FORMAT instead of just DP. |
Will fix this |
@nalinigans Can this one be closed? |
@droazen, yes this one can be closed - changes in this GenomicDB PR. Thanks. |
Closing! |
Several users have run into this issue where GenomicsDBImport errors out due to duplicate fields in their Info, Format, and/or Filter fields. They want to be able to run GenomicsDBImport without having to manually alter their files to remove duplicates.
This is the latest issue reported Sept 6. Here are some others I found that may be related:
GATK version is 4.1.2.0
_Command: _
gatk_megs=$(head -n1 /proc/meminfo | awk '{print int(0.9*($2/1024))}');
gatk --java-options "-Xmx${gatk_megs}m" GenomicsDBImport --genomicsdb-workspace-path pon_db -V GHS_PT100006_233694007.gvcf.gz -L xgen_plus_spikein.b38.bed --batch-size 50 --reader-threads 5 --tmp-dir=./tmp2
Error log:
Using GATK jar /mnt/PoN_gvcf/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx14441m -jar /mnt/PoN_gvcf/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar GenomicsDBImport --genomicsdb-workspace-path pon_db -V GHS_PT100006_233694007.gvcf.gz -L xgen_plus_spikein.b38.bed --batch-size 50 --reader-threads 5 --tmp-dir=./tmp2
16:20:56.770 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/PoN_gvcf/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
16:20:57.244 INFO GenomicsDBImport - ------------------------------------------------------------
16:20:57.244 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.2.0
16:20:57.245 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
16:20:57.245 INFO GenomicsDBImport - Executing as s.sean.okeeffe1@ip-172-24-85-170 on Linux v4.4.0-1090-aws amd64
16:20:57.245 INFO GenomicsDBImport - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10
16:20:57.245 INFO GenomicsDBImport - Start Date/Time: September 9, 2019 4:20:56 PM UTC
16:20:57.249 INFO GenomicsDBImport - ------------------------------------------------------------
16:20:57.249 INFO GenomicsDBImport - ------------------------------------------------------------
16:20:57.252 INFO GenomicsDBImport - HTSJDK Version: 2.19.0
16:20:57.252 INFO GenomicsDBImport - Picard Version: 2.19.0
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:20:57.252 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:20:57.252 INFO GenomicsDBImport - Deflater: IntelDeflater
16:20:57.253 INFO GenomicsDBImport - Inflater: IntelInflater
16:20:57.253 INFO GenomicsDBImport - GCS max retries/reopens: 20
16:20:57.253 INFO GenomicsDBImport - Requester pays: disabled
16:20:57.253 INFO GenomicsDBImport - Initializing engine
16:20:57.921 INFO FeatureManager - Using codec BEDCodec to read file file:///mnt/PoN_gvcf/xgen_plus_spikein.b38.bed
16:20:58.514 INFO IntervalArgumentCollection - Processing 38997831 bp from intervals
16:20:58.591 WARN GenomicsDBImport - A large number of intervals were specified. Using more than 100 intervals in a single import is not recommended and can cause performance to suffer. If GVCF data only exists within those intervals, performance can be improved by aggregating intervals with the merge-input-intervals argument.
16:20:58.594 INFO GenomicsDBImport - Done initializing engine
16:20:58.829 INFO GenomicsDBImport - Vid Map JSON file will be written to /mnt/PoN_gvcf/pon_db/vidmap.json
16:20:58.829 INFO GenomicsDBImport - Callset Map JSON file will be written to /mnt/PoN_gvcf/pon_db/callset.json
16:20:58.829 INFO GenomicsDBImport - Complete VCF Header will be written to /mnt/PoN_gvcf/pon_db/vcfheader.vcf
16:20:58.829 INFO GenomicsDBImport - Importing to array - /mnt/PoN_gvcf/pon_db/genomicsdb_array
16:20:58.830 WARN GenomicsDBImport - GenomicsDBImport cannot use multiple VCF reader threads for initialization when the number of intervals is greater than 1. Falling back to serial VCF reader initialization.
16:20:58.830 INFO ProgressMeter - Starting traversal
16:20:58.830 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
16:26:23.008 INFO GenomicsDBImport - Importing batch 1 with 1 samples
Duplicate field name BR found in vid attribute "fields"
Duplicate field name MQ found in vid attribute "fields"
Duplicate field name QD found in vid attribute "fields"
terminate called after throwing an instance of 'FileBasedVidMapperException'
what(): FileBasedVidMapperException : Duplicate fields exist in vid attribute "fields"
This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/comment/60780#Comment_60780
The text was updated successfully, but these errors were encountered: