Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? #584

Google-Autofuzz · 2020-11-13T13:24:28Z

edit: This code was adjusted for PyPDF2==2.9.0:

When running the following code with the latest PyPI version of PyPDF2 on test.pdf

from PyPDF2 import PdfReader

reader = PdfReader("test.pdf", strict=False)

results in an unexpected ValueError:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1333, in _find_startxref_pos
    startxref = int(line)
ValueError: invalid literal for int() with base 10: b'startxref'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 270, in __init__
    self.read(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1272, in read
    startxref = self._find_startxref_pos(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1338, in _find_startxref_pos
    startxref = int(line[9:].strip())
ValueError: invalid literal for int() with base 10: b''

The text was updated successfully, but these errors were encountered:

VictorCarlquist · 2022-04-30T03:42:41Z

This exception raises when the startxref doesn't have a value.
Maybe we could consider the startxref as zero and search for xref table after the PR #493 lands.

def test_startxref_zero():
   strict = False
   should_fail = True
    with_prev_0 = False
    pdf_data = (
        b"%%PDF-1.7\n"
        b"1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj\n"
        b"2 0 obj << >> endobj\n"
        b"3 0 obj << >> endobj\n"
        b"4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]"
        b" /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R"
        b" /Resources << /Font << >> >>"
        b" /Rotate 0 /Type /Page >> endobj\n"
        b"5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj\n"
        b"xref 1 5\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"trailer << %s/Root 5 0 R /Size 6 >>\n"
        b"startxref\n"
        b"%%%%EOF"
    )
    pdf_data = pdf_data % (
        pdf_data.find(b"1 0 obj"),
        pdf_data.find(b"2 0 obj"),
        pdf_data.find(b"3 0 obj"),
        pdf_data.find(b"4 0 obj"),
        pdf_data.find(b"5 0 obj"),
        b"/Prev 0 " if with_prev_0 else b"",
    )
    pdf_stream = io.BytesIO(pdf_data)
    PdfFileReader(pdf_stream, strict=strict)

pubpub-zz · 2022-04-30T20:26:50Z

a file without value after startxref and with a %%EOF seems very odd. the test file can not be read with acrobat. is it normal to accept such a corrupted file ?

jvoisin · 2022-05-02T12:53:29Z

The bug here is that PyPDF2 is raising a ValueError instead of a documented exception, not that the PDF isn't accepted.

pubpub-zz · 2022-05-02T13:18:31Z

The bug here is that PyPDF2 is raising a ValueError instead of a documented exception, not that the PDF isn't accepted.

As long as an exception is raised during pdf loading, it means that the loading fails. converting the exception into an other is not very helpfull. If you put the loading code in a try/except you can convert or hangle it as you wish.

To clarify my point the test.pdf provided can hardly be viewed as a pdf, and the example with no value behind startxref seems very unlikely to be produce by any software.

My proposal would be to close this issue as not relevant.

jvoisin · 2022-05-02T13:30:05Z

Are you implying that the correct way to use PyPDF2 is to wrap calls to PdfFileReader into a try:… except Exception:?

MartinThoma · 2022-05-02T20:58:22Z

Let me share some thoughts

If PyPDF2 can't read a pdf for that follows the pdf specifications, it's either a missing feature in PyPDF2 or a bug.
If PyPDF2 fails to read a broken file, it is never a bug in PyPDF2. It might still be a robustness issue of PyPDF2.
When it comes to exceptions, I like transparency where it helps the user to fix issues. Hence I like specific exception messages. If they are reasonable to group, I will group them by custom exceptions. But PyPDF2 will never capture all possible ways PDFs can be broken. No reasonably complex library does that. Just have a look at the error messages you get when you try to open an image file with a zip library or an pdf file with an audio library.
For any code dealing with parsing user data, you have to expect non-recoverable failures (eg the wrong file type, an empty file). If you want to avoid to see exceptions, you have to capture all possible exceptions.

The bug here is that PyPDF2 is raising a ValueError instead of a documented exception

I partially agree here: exceptions that are expected should be documented. However, throwing a ValueError seems fine to me, but it should be part of the docs (on read the docs)

pubpub-zz · 2022-06-21T20:34:01Z

@MartinThoma,
this issue could be closed

MartinThoma · 2022-08-06T12:17:46Z

The issue is not clearly a bug to me, but the expected behavior / good behavior isn't clear either.

For this reason, I've moved this to a discussion: #1210

MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness labels Apr 7, 2022

MartinThoma changed the title ~~ValueError on malformed pdf~~ Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? Aug 6, 2022

py-pdf locked and limited conversation to collaborators Aug 6, 2022

MartinThoma converted this issue into a discussion Aug 6, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? #584

Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? #584

Google-Autofuzz commented Nov 13, 2020 •

edited by MartinThoma

Loading

VictorCarlquist commented Apr 30, 2022

Uh oh!

pubpub-zz commented Apr 30, 2022

Uh oh!

jvoisin commented May 2, 2022

Uh oh!

pubpub-zz commented May 2, 2022

Uh oh!

jvoisin commented May 2, 2022

Uh oh!

MartinThoma commented May 2, 2022

Uh oh!

pubpub-zz commented Jun 21, 2022

Uh oh!

MartinThoma commented Aug 6, 2022 •

edited

Loading

Uh oh!

Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? #584

Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? #584

Comments

Google-Autofuzz commented Nov 13, 2020 • edited by MartinThoma Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VictorCarlquist commented Apr 30, 2022

Uh oh!

pubpub-zz commented Apr 30, 2022

Uh oh!

jvoisin commented May 2, 2022

Uh oh!

pubpub-zz commented May 2, 2022

Uh oh!

jvoisin commented May 2, 2022

Uh oh!

MartinThoma commented May 2, 2022

Uh oh!

pubpub-zz commented Jun 21, 2022

Uh oh!

MartinThoma commented Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Google-Autofuzz commented Nov 13, 2020 •

edited by MartinThoma

Loading

MartinThoma commented Aug 6, 2022 •

edited

Loading