Roadmap

Build pdfium statically for Macos
Parse document using pdfium
- Parse char
- Merge chars into CharSpans
- Merge spans into Lines
Layout:
- Find Layout Model and run with ORT
- Accelerate Model on ANE/GPU
- Extract Page Layout
  - Preprocess pdfium image
  - Postprocess tensor -> nms
  - Verify labels
- Determine pages needing OCR (coverage lines/blocks)
- OCR -> Use Apple vision on macOS target_os
- Merge Layout with pdfium lines
  - Rescale / or / downscale line bbox/ layout bbox
  - Merge intersection lines (from pdfium and OCR) with max bbox into blocks
  - Add lines to bbox based on distance
  - Add remaining layout blocks to blocks based on position
Document merge:
- Group listItems into list : Find first and merge subsequent items
- Group caption/footer blocks with image blocks
- Group Page header / Page footer
- Move header element to the top of the page
- Process SubHeader/Titles using kmeans on line heigths to get the title_level
- Merge text block based on gap distance
- Run Block processors (Text, List, PageHeader )
- Get PDF Bookmarks (TOC) and reconcile detected titles with TOC and
Tables
- Extract tables
- Group captions with tables
Render Document
- JSON renderer
  - Crop images and save in directory if --save_image flag
- HTML renderer
- Markdown renderer (based on html renderer)
CLI ferrules
- Add variables
- Add debug flag
- Add range flag
- Configure hyperparams/execution providers
- Add export format -> JSON (default) or markdown
Build pdfium statically for Linux
Change NMS algorithm to more robust one
Add tracing to core
Configurable inference params: ORTProviders/ batch_size, confidence_score, NMS ..
eyre | thiserror for custom errosk
OCR: Find good recognition model for (target_os != macos)
Inference:
- Export onnx layout model with dynamic batch_size
- Run layout on &[DynamicImage]
- Implement Linux/CUDA inference (EP)
- Batch inference on pages (For Nvidia GPU, batch_size on macos didn't yield good results)
API
- Full OTEL + sentry tracing in API
- Clap API + Env variables
- Unify Config for env/CLI/API
- Dynamic batching of document(pages) to process
Optim
- Determine page orientation + deskew
- Optimize layout model for ANE -> Look at changing shapes and operators to maximize ANE perf
- ORT inference in fp16/mixed precision
- Move to other Yolo versions: yolov11s seems better with less params yolo-doclaynet
- Explore arena allocators (one per page)
- String -> CowStr
Benchmark
- [] Compare to unstructured, tika, docling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROADMAP.md

ROADMAP.md

Roadmap

Files

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

Roadmap