Skip to content

TypeError while loading some documents, caused by _cmap.py, line 93 #2286

Closed
@elhele

Description

@elhele

While using the library I'm getting the following error:
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'int'

Environment

Which environment were you using when you encountered the problem?

macOS-10.16-x86_64-i386-64bit
pypdf==3.17.0, crypt_provider=('cryptography', '37.0.4'), PIL=9.0.1

It happens locally as well as during Azure-deployment.

Code + PDF

This is a minimal, complete example that shows the issue:

reader = PdfReader(file)
pages = reader.pages

Unfortunately I cannot share the document that causes the problem as it contains sensitive information. I also couldn't reproduce it with other documents. This adjustment after line 89, however, solves the problem:

import pypdf
...
sp_width = compute_space_width(ft, sp, space_width)
sp_width = sp_width if type(sp_width) != pypdf.generic._base.IndirectObject else sp_width.get_object()

Traceback

This is the complete Traceback I see:

  File "..././scripts/prepdocs.py", line 262, in <module>
    loop.run_until_complete(main(file_strategy, azd_credential, args))
  File ".../opt/anaconda3/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "..././scripts/prepdocs.py", line 137, in main
    await strategy.run(search_info)
  File ".../scripts/prepdocslib/filestrategy.py", line 58, in run
    pages = [page async for page in self.pdf_parser.parse(content=file.content)]
  File ".../scripts/prepdocslib/filestrategy.py", line 58, in <listcomp>
    pages = [page async for page in self.pdf_parser.parse(content=file.content)]
  File ".../scripts/prepdocslib/pdfparser.py", line 52, in parse
    page_text = p.extract_text()
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_page.py", line 2284, in extract_text
    return self._extract_text(
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_page.py", line 1903, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_cmap.py", line 29, in build_char_map
    font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
  File ".../scripts/.venv/lib/python3.9/site-packages/pypdf/_cmap.py", line 93, in build_char_map_from_dict
    float(sp_width / 2),
TypeError: unsupported operand type(s) for /: 'IndirectObject' and 'int'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions