The Arlington PDF Model #3235

j-t-1 · 2025-04-03T17:03:12Z

j-t-1
Apr 3, 2025

https://github.com/pdf-association/arlington-pdf-model/

The model is specification derived, using structured data which is machine and human readable.

It has a comprehensive definition of every PDF object:

Every key in every dictionary and stream
Every array element in every array

The model has applications for:

PDF readers
PDF writers
Test case generation
Code coverage

PDF-Days-2021-Arlington-PDF-model.pdf

So there are potential use cases for pypdf.

More broadly, the model + pypdf could be used to create PDFs.

Something like starting with Catalog.tsv then either select a value from column PossibleValues or iterate the process from the table referenced in column Link. Probably lots of complications, like xref and things like that.

Comments (and hopefully code!) welcome on how the model and pypdf could be used to create PDFs. The use case would be PDF analysis, so text for example could be from lorem ipsum.

stefan6419846 · 2025-04-03T17:35:21Z

stefan6419846
Apr 3, 2025
Maintainer

Thanks for the link.

As far as I understand, the Arlington PDF model basically is for PDF files what XSD is for XML files. This means that it should enforce strict compliance to the standard itself and thus (from my experience) fail with quite some PDF files.

Regarding pypdf, I do not see the real benefit for this. We can already create PDF files to some extent, although it quickly becomes quite complex to do some object handling from scratch (which tends to be out of scope for pypdf IMHO). Defining a mandatory dependency on the model itself would further restrict the license of pypdf as the model definitions are subject to Apache-2.0, which we should avoid - apart from the fact that this most likely would mean larger rewrites.

TL;DR: I do not see any real benefit of the Arlington model for pypdf.

The use case would be PDF analysis, so text for example could be from lorem ipsum.

I honestly do not get the rationale/goal of this sentence. What are you trying to achieve? PDF analysis can be done on any PDF file in conjunction with the official specification (this most likely is what most developers with experience of the PDF internals - including myself - have done at some point and/or are still doing on a regular basis) and using the model (or developing with it) already requires proper experience (from their README):

The starting assumption is that you are a software developer and already know about the PDF document object model, PDF syntax, and how PDF files generally 'work'. You also have experience in debugging valid and invalid PDFs.

1 reply

j-t-1 Apr 3, 2025
Author

Although there are use cases that pypdf could use, I mentioned this generally. If someone was to use the Arlington model to help with pypdf, creating test files or whatever, that would be great.

I would like to be able to automate creation of PDFs. So for example things like creating a file that has a specific annotation, or all types of annotation. This would not be part of pypdf, but would use it. So I am interested in this from a user perspective (and any use cases for pypdf would have been a bonus).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Arlington PDF Model #3235

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

The Arlington PDF Model #3235

Uh oh!

j-t-1 Apr 3, 2025

Replies: 1 comment · 1 reply

Uh oh!

stefan6419846 Apr 3, 2025 Maintainer

Uh oh!

j-t-1 Apr 3, 2025 Author

j-t-1
Apr 3, 2025

Replies: 1 comment 1 reply

stefan6419846
Apr 3, 2025
Maintainer

j-t-1 Apr 3, 2025
Author