-
Build pdfium statically for Macos
-
Parse document using pdfium
- Parse char
- Merge chars into CharSpans
- Merge spans into Lines
-
Layout:
- Find Layout Model and run with ORT
- Accelerate Model on ANE/GPU
- Extract Page Layout
- Preprocess pdfium image
- Postprocess tensor -> nms
- Verify labels
- Determine pages needing OCR (coverage lines/blocks)
- OCR -> Use Apple vision on macOS target_os
- Merge Layout with pdfium lines
- Rescale / or / downscale line bbox/ layout bbox
- Merge intersection lines (from pdfium and OCR) with max bbox into blocks
- Add lines to bbox based on distance
- Add remaining layout blocks to blocks based on position
-
Document merge:
- Group listItems into list : Find first and merge subsequent items
- Group caption/footer blocks with image blocks
- Group Page header / Page footer
- Move header element to the top of the page
- Process SubHeader/Titles using kmeans on line heigths to get the title_level
- Merge text block based on gap distance
- Run Block processors (Text, List, PageHeader )
- Get PDF Bookmarks (TOC) and reconcile detected titles with TOC and
-
Tables
- Extract tables
- Group captions with tables
-
Render Document
- JSON renderer
- Crop images and save in directory if
--save_image
flag
- Crop images and save in directory if
- HTML renderer
- Markdown renderer (based on html renderer)
- JSON renderer
-
CLI ferrules
- Add variables
- Add debug flag
- Add range flag
- Configure hyperparams/execution providers
- Add export format -> JSON (default) or markdown
-
Build pdfium statically for Linux
-
Change NMS algorithm to more robust one
-
Add tracing to core
-
Configurable inference params: ORTProviders/ batch_size, confidence_score, NMS ..
-
eyre
|thiserror
for custom errosk -
OCR: Find good recognition model for (target_os != macos)
-
Inference:
- Export onnx layout model with dynamic
batch_size
- Run layout on &[DynamicImage]
- Implement Linux/CUDA inference (EP)
- Batch inference on pages (For Nvidia GPU, batch_size on macos didn't yield good results)
- Export onnx layout model with dynamic
-
API
- Full OTEL + sentry tracing in API
- Clap API + Env variables
- Unify Config for env/CLI/API
- Dynamic batching of document(pages) to process
-
Optim
- Determine page orientation + deskew
- Optimize layout model for ANE -> Look at changing shapes and operators to maximize ANE perf
- ORT inference in fp16/mixed precision
- Move to other Yolo versions: yolov11s seems better with less params yolo-doclaynet
- Explore arena allocators (one per page)
- String -> CowStr
-
Benchmark
- [] Compare to unstructured, tika, docling