The mission
Build a desktop JavaFX application that lets non-technical users drop in invoices (PDF/image) and export clean CSVdata. Invoices come in all shapes—your job is to make the app read them like a human: robust parsing, confidence scoring, and easy validation.
What you’ll build
- JavaFX desktop app (Java 17+) with a clean, responsive UI
- Invoice ingestion: PDF, PNG/JPG, multi-page, batches, drag-and-drop
- AI/OCR pipeline (choose best fit; hybrid is fine):
- Classical OCR (e.g., Tesseract) + layout analysis or
- Cloud OCR (e.g., AWS Textract, Google Vision) or
- LLM-assisted parsing (prompting/JSON schema) with guardrails
- Field extraction (line-items + headers): vendor, invoice #, dates, currency, taxes, subtotals/totals, PO, line descriptions, qty, unit price, amounts
- Validation & review UI: highlight zones, flag low-confidence fields, quick fixes, autocomplete
- CSV export: stable schema, locale/number/date normalization
- Rules & heuristics: vendor templates, regex fallbacks, learned patterns
- Quality metrics: confidence scores, per-field accuracy, reject reasons, simple analytics
- Operate offline where possible with optional cloud connectors
You’re a great fit if you have
- 4+ years Java; 2+ years JavaFX building production desktop apps
- Real-world OCR/NLP or document understanding experience (invoices, receipts, forms)
- Hands-on with one or more: Tesseract, Textract, Google Vision, Azure Form Recognizer, OpenCV, spaCy, LLM JSON extraction
- Comfortable designing parsing pipelines: pre-processing, layout detection, table extraction, post-processing, and human-in-the-loop review
- Strong data wrangling: CSV schemas, date/currency parsing, edge cases
- Solid testing: golden files, fixture PDFs, deterministic pipelines
Nice to have
- Prompt engineering for structured outputs with LLMs
- Vendor-specific templateing and auto-learning
- Experience with Maven/Gradle, native packaging, code signing
- Knowledge of ONNX/TensorFlow Lite models for document layout
- Basic DevOps for OCR services and model hosting
Tech we expect to use (flexible)
Java 17+, JavaFX, Gradle, Tesseract/OpenCV or Textract/Vision, optional Python micro-services for ML bits, SQLite for local cache, JUnit + test fixtures, GitHub Actions CI.
Success looks like
- ≥95% header-field accuracy on a mixed test set
- ≥90% line-item recall on clear tabular invoices
- Review UI fixes a typical invoice in <60 seconds
- One-click CSV export that matches our schema and loads cleanly
What we provide
- Labeled sample invoices (PDFs/images) across vendors
- Target CSV schema + acceptance tests
- Design mocks for the core screens
- Fast feedback loop with a technical product owner