Skip to content

Latest commit

 

History

History
90 lines (68 loc) · 2.88 KB

ROADMAP.md

File metadata and controls

90 lines (68 loc) · 2.88 KB

Roadmap

  • Build pdfium statically for Macos

  • Parse document using pdfium

    • Parse char
    • Merge chars into CharSpans
    • Merge spans into Lines
  • Layout:

    • Find Layout Model and run with ORT
    • Accelerate Model on ANE/GPU
    • Extract Page Layout
      • Preprocess pdfium image
      • Postprocess tensor -> nms
      • Verify labels
    • Determine pages needing OCR (coverage lines/blocks)
    • OCR -> Use Apple vision on macOS target_os
    • Merge Layout with pdfium lines
      • Rescale / or / downscale line bbox/ layout bbox
      • Merge intersection lines (from pdfium and OCR) with max bbox into blocks
      • Add lines to bbox based on distance
      • Add remaining layout blocks to blocks based on position
  • Document merge:

    • Group listItems into list : Find first and merge subsequent items
    • Group caption/footer blocks with image blocks
    • Group Page header / Page footer
    • Move header element to the top of the page
    • Process SubHeader/Titles using kmeans on line heigths to get the title_level
    • Merge text block based on gap distance
    • Run Block processors (Text, List, PageHeader )
    • Get PDF Bookmarks (TOC) and reconcile detected titles with TOC and
  • Tables

    • Extract tables
    • Group captions with tables
  • Render Document

    • JSON renderer
      • Crop images and save in directory if --save_image flag
    • HTML renderer
    • Markdown renderer (based on html renderer)
  • CLI ferrules

    • Add variables
    • Add debug flag
    • Add range flag
    • Configure hyperparams/execution providers
    • Add export format -> JSON (default) or markdown
  • Build pdfium statically for Linux

  • Change NMS algorithm to more robust one

  • Add tracing to core

  • Configurable inference params: ORTProviders/ batch_size, confidence_score, NMS ..

  • eyre | thiserror for custom errosk

  • OCR: Find good recognition model for (target_os != macos)

  • Inference:

    • Export onnx layout model with dynamic batch_size
    • Run layout on &[DynamicImage]
    • Implement Linux/CUDA inference (EP)
    • Batch inference on pages (For Nvidia GPU, batch_size on macos didn't yield good results)
  • API

    • Full OTEL + sentry tracing in API
    • Clap API + Env variables
    • Unify Config for env/CLI/API
    • Dynamic batching of document(pages) to process
  • Optim

    • Determine page orientation + deskew
    • Optimize layout model for ANE -> Look at changing shapes and operators to maximize ANE perf
    • ORT inference in fp16/mixed precision
    • Move to other Yolo versions: yolov11s seems better with less params yolo-doclaynet
    • Explore arena allocators (one per page)
    • String -> CowStr
  • Benchmark

    • [] Compare to unstructured, tika, docling