PDFTextStripper - parsing incorrectness

Hello,

I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.

Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);

However, I am seeing a few problems:
1) Invisible, redundant text
sometimes the PDF will have invisible text in front of the actual text.
e.g.

HTML:
line1
line2
line3

PDF:
line1
line2 (<--- invisible)
line2
line3

This happens even when you just open the pdf and select / copy the text.

2) commas are places in the wrong position, when parsed
commas show up correctly, but when parsed, they show in incorrect position
e.g.
HTML:
hello, my name, is 

PDF:
,,hello my name is

NOTE this does not happen when you open the pdf and select / copy the text.

3) Interestingly, the comma problem goes away when I parse like this
        final PDDocument document = PDDocument.load(pdfBytes);
        final PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        return pdfTextStripper.getText(document);

However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións

Do you know why these happens?

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDFTextStripper - parsing incorrectness #458

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PDFTextStripper - parsing incorrectness #458

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions