Skip to content

Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? #584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Google-Autofuzz opened this issue Nov 13, 2020 · 8 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness

Comments

@Google-Autofuzz
Copy link

Google-Autofuzz commented Nov 13, 2020

edit: This code was adjusted for PyPDF2==2.9.0:

When running the following code with the latest PyPI version of PyPDF2 on test.pdf

from PyPDF2 import PdfReader

reader = PdfReader("test.pdf", strict=False)

results in an unexpected ValueError:

Traceback (most recent call last):
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1333, in _find_startxref_pos
    startxref = int(line)
ValueError: invalid literal for int() with base 10: b'startxref'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 270, in __init__
    self.read(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1272, in read
    startxref = self._find_startxref_pos(stream)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1338, in _find_startxref_pos
    startxref = int(line[9:].strip())
ValueError: invalid literal for int() with base 10: b''
@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness labels Apr 7, 2022
@VictorCarlquist
Copy link

This exception raises when the startxref doesn't have a value.
Maybe we could consider the startxref as zero and search for xref table after the PR #493 lands.

def test_startxref_zero():
   strict = False
   should_fail = True
    with_prev_0 = False
    pdf_data = (
        b"%%PDF-1.7\n"
        b"1 0 obj << /Count 1 /Kids [4 0 R] /Type /Pages >> endobj\n"
        b"2 0 obj << >> endobj\n"
        b"3 0 obj << >> endobj\n"
        b"4 0 obj << /Contents 3 0 R /CropBox [0.0 0.0 2550.0 3508.0]"
        b" /MediaBox [0.0 0.0 2550.0 3508.0] /Parent 1 0 R"
        b" /Resources << /Font << >> >>"
        b" /Rotate 0 /Type /Page >> endobj\n"
        b"5 0 obj << /Pages 1 0 R /Type /Catalog >> endobj\n"
        b"xref 1 5\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"%010d 00000 n\n"
        b"trailer << %s/Root 5 0 R /Size 6 >>\n"
        b"startxref\n"
        b"%%%%EOF"
    )
    pdf_data = pdf_data % (
        pdf_data.find(b"1 0 obj"),
        pdf_data.find(b"2 0 obj"),
        pdf_data.find(b"3 0 obj"),
        pdf_data.find(b"4 0 obj"),
        pdf_data.find(b"5 0 obj"),
        b"/Prev 0 " if with_prev_0 else b"",
    )
    pdf_stream = io.BytesIO(pdf_data)
    PdfFileReader(pdf_stream, strict=strict)

@pubpub-zz
Copy link
Collaborator

a file without value after startxref and with a %%EOF seems very odd. the test file can not be read with acrobat. is it normal to accept such a corrupted file ?

@jvoisin
Copy link

jvoisin commented May 2, 2022

The bug here is that PyPDF2 is raising a ValueError instead of a documented exception, not that the PDF isn't accepted.

@pubpub-zz
Copy link
Collaborator

The bug here is that PyPDF2 is raising a ValueError instead of a documented exception, not that the PDF isn't accepted.

As long as an exception is raised during pdf loading, it means that the loading fails. converting the exception into an other is not very helpfull. If you put the loading code in a try/except you can convert or hangle it as you wish.

To clarify my point the test.pdf provided can hardly be viewed as a pdf, and the example with no value behind startxref seems very unlikely to be produce by any software.

My proposal would be to close this issue as not relevant.

@jvoisin
Copy link

jvoisin commented May 2, 2022

Are you implying that the correct way to use PyPDF2 is to wrap calls to PdfFileReader into a try:… except Exception:?

@MartinThoma
Copy link
Member

Let me share some thoughts

  1. If PyPDF2 can't read a pdf for that follows the pdf specifications, it's either a missing feature in PyPDF2 or a bug.
  2. If PyPDF2 fails to read a broken file, it is never a bug in PyPDF2. It might still be a robustness issue of PyPDF2.
  3. When it comes to exceptions, I like transparency where it helps the user to fix issues. Hence I like specific exception messages. If they are reasonable to group, I will group them by custom exceptions. But PyPDF2 will never capture all possible ways PDFs can be broken. No reasonably complex library does that. Just have a look at the error messages you get when you try to open an image file with a zip library or an pdf file with an audio library.
  4. For any code dealing with parsing user data, you have to expect non-recoverable failures (eg the wrong file type, an empty file). If you want to avoid to see exceptions, you have to capture all possible exceptions.

The bug here is that PyPDF2 is raising a ValueError instead of a documented exception

I partially agree here: exceptions that are expected should be documented. However, throwing a ValueError seems fine to me, but it should be part of the docs (on read the docs)

@pubpub-zz
Copy link
Collaborator

@MartinThoma,
this issue could be closed

@MartinThoma MartinThoma changed the title ValueError on malformed pdf Is it ok to throw non-PyPDF2 exceptions (e.g. ValueError) on malformed PDFs? Aug 6, 2022
@py-pdf py-pdf locked and limited conversation to collaborators Aug 6, 2022
@MartinThoma MartinThoma converted this issue into a discussion Aug 6, 2022
@MartinThoma
Copy link
Member

MartinThoma commented Aug 6, 2022

The issue is not clearly a bug to me, but the expected behavior / good behavior isn't clear either.

For this reason, I've moved this to a discussion: #1210

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF is-robustness-issue From a users perspective, this is about robustness
Projects
None yet
Development

No branches or pull requests

5 participants