-
Notifications
You must be signed in to change notification settings - Fork 1.5k
BUG: Using compress_identical_objects on transformed content duplicates differing content #3197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…es differing content
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3197 +/- ##
=======================================
Coverage 96.54% 96.54%
=======================================
Files 53 53
Lines 8935 8935
Branches 1642 1642
=======================================
Hits 8626 8626
Misses 186 186
Partials 123 123 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks for the report and PR. Could you please add a corresponding test as well? The function provided by you should be rather easy to transform into an integration test if you run As for why Lines 1003 to 1018 in a548ca1
_data is not set unless you call get_data() explicitly. (contents is page.get_contents() in this case.)
|
@stefan6419846 That test would be nice, but I can't see how I can extract the text from the PdfWriter. |
@stefan6419846 I've added a test that uses |
The current approach looks fine for me. I have added a comment about a small code style change we should add; afterwards, I see no problem with getting this merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
## What's new ### New Features (ENH) - Add support for IndirectObject.__iter__ (#3228) by @bryan-brancotte - Allow filtering by font when removing text (#3216) by @samuelbradshaw ### Bug Fixes (BUG) - Add missing named destinations being ByteStringObjects (#3282) by @stefan6419846 - Get font information more reliably when removing text (#3252) by @samuelbradshaw - T* 2D Translation consistent with PDF 1.7 Spec (#3250) by @hackowitz-af - Add font stack to q/Q operations in layout mode (#3225) by @hackowitz-af - Avoid completely hiding image loading issues like exceeding image size limits (#3221) by @stefan6419846 - Using compress_identical_objects on transformed content duplicates differing content (#3197) by @danio - Consider BlackIs1 parameter for CCITTFaxDecode filter (#3196) by @stefan6419846 ### Robustness (ROB) - Deal with insufficient cm matrix during text extraction (#3283) by @stefan6419846 - Allow merging when annotations miss D entry (#3281) by @stefan6419846 - Fix merging documents if there are no Dests (#3280) by @stefan6419846 - Fix crash on malformed action in outline (#3278) by @larsga - Fix compression issues for removed images which might be None (#3246) by @stefan6419846 - Attempt to deal with non-rectangular FlateDecode streams (#3245) by @stefan6419846 - Handle some None values for broken PDF files (#3230) by @stefan6419846 ### Developer Experience (DEV) - Multiple style improvements by @j-t-1 - Update ruff to 0.11.0 by @stefan6419846 ### Maintenance (MAINT) - Conform ASCIIHexDecode implementation to specification (#3274) by @j-t-1 - Modify comments of filters that do not use decode_parms (#3260) by @j-t-1 ### Code Style (STY) - Simplify warnings & debugging in layout mode text extraction (#3271) by @hackowitz-af - Standardize mypy assert statements (#3276) by @j-t-1 [Full Changelog](5.4.0...5.5.0)
compress_identical_objects()
can result in lost content, ifpage.add_transformation
is used.Test that creates bad output:
Input file:
two-different-pages.pdf
This contains "1" on page 1 and "2" on page 2
Result before fix:
result-before.pdf
This contains "1" on page 1 and "1" on page 2
Result after fix:
result-withfix.pdf
This contains "1" on page 1 and "2" on page 2
The issue is around the
EncodedStreamObject
which gets converted into aContentStream
after the transformation. BothContentStream
objects on each page have the sameobj.hash_value()
, even though they have the same contents. Printing the_data
value out for the contentstreams showed it was empty after the transformation.I tried creating a simple test to reproduce just the hashcode calculation for the content streams but couldn't get it to work, I'm not sure quite what is happening inside
page.add_transformation
to create the content stream with no data yet calculated. This is what I tried (inspired bytest_contentstream_arrayobject_containing_nullobject
):