Unlimited-OCR β†’ Core AI (on-device document OCR)

On-device document β†’ structured-markdown OCR, end-to-end on Apple Core AI. A port of baidu/Unlimited-OCR (3B-A0.5B MoE, MIT): drop a document image, get back markdown β€” tables as HTML (<table><tr><td>…), formulas as LaTeX, reading order, and <|det|> layout boxes. Japanese + English + multilingual.

Runs on the stock coreai.runtime with no engine patch β€” the decoder is driven directly on inputs_embeds, so this is a pure-export port (not the static-input-buffer VLM path).

Use it

▢️ Run it (source) β€” the ReadDoc runner (GUI + CLI, one app for every document-OCR model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ReadDoc/ReadDoc.xcodeproj
# β†’ Run, then pick "Unlimited-OCR" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ReadDoc
swift run readdoc-cli --model unlimited-ocr --image sample.png

πŸ’» Build with it β€” complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let reader = try await KitDocReader(catalog: "unlimited-ocr")
let markdown = try await reader.read(imageAt: imageURL)
// markdown: the document as structured text β€” tables as <table>/<tr>/<td>,
// <|det|> layout boxes, reading order β€” fully on-device

The take-home is Examples/ReadDoc/Sources/QuickStart.swift β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same KitDocReader(catalog:) on the image you pick. One read(imageAt:) call per page; chunk a PDF into page images first. The output keeps the model's structural markup (tables as HTML, formulas as LaTeX, <|det|> boxes) β€” strip or render it as your app prefers.

Integration checklist

  • SPM: https://github.com/john-rocky/coreai-kit β†’ product CoreAIKit
  • Info.plist: none needed
  • Entitlements: none needed
  • First run downloads the model β€” 4.5 GB (Mac) β€” then it loads from the local cache (Application Support; progress via the downloadProgress callback)
  • Measure in Release β€” Debug is ~3Γ— slower on per-token host work

What's exciting (why you'd use it)

  • Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
  • Structured, not just text: tables β†’ HTML, equations β†’ LaTeX, layout β†’ boxes. RAG-ready ingestion.
  • Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) keeps every tensor shape constant, so the runtime compiles once and decode stays flat at 12.7 ms/token (79 tok/s on M4 Max) β€” no growing-cache recompilation stalls.
  • SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

Bundles

path what dtype size
vision/unlimited_ocr_vision.aimodel DeepEncoder (SAM-ViT + CLIP-ViT cascade) β†’ 100 visual tokens fp16 762 MB
decoder/unlimited_ocr_decoder.aimodel DeepseekV2 R-SWA MoE decoder, functions prefill + decode sharing one weight set + KV state sym8 3.2 GB
assets/embed_tokens.f16 token embedding table [129280,1280] (host row-gather) fp16 316 MB
assets/{image_newline,view_seperator}.f16, assets/prompt_input_ids.i32, assets/recipe.json arrangement constants + the assembly recipe β€” tiny
tokenizer/ fast tokenizer (tokenizer.json + configs) β€” β€”

Pipeline (Base mode, 640px)

image β†’ preprocess (pad to 640Β², normalize mean=std=0.5)
      β†’ vision .aimodel                         β†’ visual tokens [1,100,1280]
      β†’ arrange (10Γ—10 + image_newline per row + view_seperator) β†’ [111,1280]
      β†’ scatter into embed_tokens(prompt_ids)   β†’ prefix [1,115,1280]
      β†’ decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) β†’ tokens
      β†’ detokenize (keep special tokens)        β†’ markdown

The exact, verified recipe is in assets/recipe.json. Reference implementations (Python end-to-end

  • a macOS app, CoreAIOCR, driving the stock runtime) are in the Core AI Model Zoo: conversion/unlimited_ocr/ and apps/CoreAIOCR/.

Notes

  • Appropriate input: clean single-page documents (invoice / paper / report / table / formula), roughly square or portrait, with text still legible when fit to 640Β². Very dense small-text scans (newspaper) want the tiled crop_mode vision export (not included here; Base mode only).
  • Prompt is fixed to document parsing (layout + structured extraction).
  • License: MIT (inherited from baidu/Unlimited-OCR).

Community port β€” not affiliated with Apple or baidu.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for mlboydaisuke/Unlimited-OCR-CoreAI

Finetuned
(7)
this model