-
Notifications
You must be signed in to change notification settings - Fork 602
PrintBGZFBlockInformation: a tool to dump information about blocks in a BGZF file #4239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jamesemery please review (after using the tool to help diagnose #4224) |
0e696e6
to
5d13c2b
Compare
Codecov Report
@@ Coverage Diff @@
## master #4239 +/- ##
===============================================
+ Coverage 87.048% 87.052% +0.004%
- Complexity 31445 31479 +34
===============================================
Files 1921 1923 +2
Lines 144977 145146 +169
Branches 16062 16081 +19
===============================================
+ Hits 126199 126352 +153
- Misses 12935 12944 +9
- Partials 5843 5850 +7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor changes, I might insist though that for this to be a useful tool in diagnosing problems with (valid).vcf.gz files which have null blocks before the final block that there should at least be a check for this functionality, as the code looks like it would stop reading blocks after that point.
Also probably making a better error message for empty file
@Argument(fullName = "bgzf-file", doc = "The BGZF-format file for which to print block information", optional = false) | ||
private String bgzfPathString; | ||
|
||
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "File to which to write block information (if not specified, prints to standard output", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing parenthesis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
if ( ! AbstractFeatureReader.hasBlockCompressedExtension(bgzfPathString) ) { | ||
throw new UserException("File " + bgzfPathString + " does not end in a recognized BGZF file extension (" + | ||
AbstractFeatureReader.BLOCK_COMPRESSED_EXTENSIONS + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, but you should probably use a string joiner on this set as I suspect it will mangle the formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a UserException.BadInput
instead to be more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@Argument(fullName = "bgzf-file", doc = "The BGZF-format file for which to print block information", optional = false) | ||
private String bgzfPathString; | ||
|
||
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "File to which to write block information (if not specified, prints to standard output", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing ')'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
public class PrintBGZFBlockInformationIntegrationTest extends CommandLineProgramTest { | ||
|
||
@Test | ||
public void testNormalInput() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably add a test asserting that this behaves rationally for bad input (eg. ending on a non-empty block or by having null blocks interspersed with occupied blocks in the file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a test with a corrupt BGZF file.
} | ||
} | ||
|
||
private BGZFBlockMetadata processNextBlock( InputStream stream, String streamSource) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it was pulled from htsjdk, this should maybe exist in some common utility method somewhere exposed in htsjdk ideally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes, but beyond the scope of this PR.
@@ -0,0 +1,1712 @@ | |||
Block at file offset 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps these output files should have a header pointing to the file they are summarizing to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, added.
@ExperimentalFeature | ||
@CommandLineProgramProperties( | ||
summary = "Print information about the compressed blocks in a BGZF format file", | ||
oneLineSummary = "Print information about the compressed blocks in a BGZF format file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expand this comment to more explicitly summarize the compressed and uncompressed size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary I think -- the tool-level docs have that information.
Also, this seems to misbehave when run on a GZIP file that is NOT a block gzipped file. I get the following result which appears to be flawed in a number of ways:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments for documentation and simplification.
* A diagnostic tool that prints information about the compressed blocks in a BGZF format file, | ||
* such as a .vcf.gz file. | ||
* | ||
* The output looks like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For properly formatted in the docgen, this requires the <p>
HTML tag (I guess).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* The output looks like this: | ||
* | ||
* Block at file offset 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably can be in a <code>
tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* ... | ||
* etc. | ||
* | ||
* The output can be redirected to a file using the -O option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also <p>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
programGroup = DiagnosticsAndQCProgramGroup.class | ||
) | ||
public class PrintBGZFBlockInformation extends CommandLineProgram { | ||
private final Logger logger = LogManager.getLogger(PrintBGZFBlockInformation.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CommandLineProgram
already has a protected logger initialized for this.getClass()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -- removed
bgzfPath = IOUtils.getPath(bgzfPathString); | ||
|
||
if ( ! Files.exists(bgzfPath) ) { | ||
throw new UserException("File " + bgzfPathString + " does not exist"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better a UserException.CouldNotReadInputFile
to narrow down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to UserException.CouldNotReadInputFile
|
||
if ( ! AbstractFeatureReader.hasBlockCompressedExtension(bgzfPathString) ) { | ||
throw new UserException("File " + bgzfPathString + " does not end in a recognized BGZF file extension (" + | ||
AbstractFeatureReader.BLOCK_COMPRESSED_EXTENSIONS + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a UserException.BadInput
instead to be more specific?
try ( InputStream bgzfInputStream = Files.newInputStream(bgzfPath) ) { | ||
BGZFBlockMetadata blockInfo = null; | ||
while ( (blockInfo = processNextBlock(bgzfInputStream, bgzfPathString)) != null ) { | ||
outStream.println(String.format("Block at file offset %d", blockInfo.blockOffset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better outStream.printf()
for removing the String.format()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
outStream.println(); | ||
} | ||
} catch ( IOException e ) { | ||
throw new UserException("Error while parsing BGZF file. Error message was: " + e.getMessage(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better a UserException.CouldNotReadInputFile
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed.
@CommandLineProgramProperties( | ||
summary = "Print information about the compressed blocks in a BGZF format file", | ||
oneLineSummary = "Print information about the compressed blocks in a BGZF format file", | ||
programGroup = DiagnosticsAndQCProgramGroup.class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be out-of-place in DiagnosticsAndQCProgramGroup
, which currently contains things like metrics collectors, amongst other things. Other utilities like this are in OtherProgramGroup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to OtherProgramGroup
5d13c2b
to
c57a265
Compare
@jamesemery I added a pretty comprehensive set of tests for various kinds of corrupt BGZF files (as well as a regular GZIP file), and patched the tool to report something sensible in these cases. You can see the new tool output in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do with these comments what you will, the tests are good now and this was the primary issue with this branch before so once you have acknowledged/responded 👍
} | ||
|
||
// Emit a warning at the end if we encountered any terminator blocks before the final block: | ||
if ( sawNonFinalTerminatorBlock ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would almost think that this should also print out the location of the first one found, just so its use friendly and doesn't confuse anybody about there being two empty blocks in a row at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either that or clearly separate this from the very similarly formatted output line that appears right before it in the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- I added block numbers, and reference them in the error messages.
IntegrationTestSpec.assertEqualTextFiles(actualOutput, expectedOutput); | ||
} | ||
|
||
@Test(expectedExceptions= UserException.CouldNotReadInputFile.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a brief comment explaining what this test means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explanatory comments to all tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This output looks clearer, 👍
No description provided.