Unlimited-OCR → Core AI (on-device document OCR)

On-device document → structured-markdown OCR, end-to-end on Apple Core AI. A port of baidu/Unlimited-OCR (3B-A0.5B MoE, MIT): drop a document image, get back markdown — tables as HTML (<table><tr><td>…), formulas as LaTeX, reading order, and <|det|> layout boxes. Japanese + English + multilingual.

Runs on the stock coreai.runtime with no engine patch — the decoder is driven directly on inputs_embeds, so this is a pure-export port (not the static-input-buffer VLM path).

Use it

▶️ Run it (source) — the ReadDoc runner (GUI + CLI, one app for every document-OCR model in the catalog):

git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/ReadDoc/ReadDoc.xcodeproj
# → Run, then pick "Unlimited-OCR" in the model picker

# agents / headless (macOS):
cd coreai-kit/Examples/ReadDoc
swift run readdoc-cli --model unlimited-ocr --image sample.png

💻 Build with it — complete; the glue is kit API, copy-paste runs:

import CoreAIKit

let reader = try await KitDocReader(catalog: "unlimited-ocr")
let markdown = try await reader.read(imageAt: imageURL)
// markdown: the document as structured text — tables as <table>/<tr>/<td>,
// <|det|> layout boxes, reading order — fully on-device

The take-home is Examples/ReadDoc/Sources/QuickStart.swift — this exact code as one typed function, no UI; the CLI is an argument shell over it, and the GUI drives the same KitDocReader(catalog:) on the image you pick. One read(imageAt:) call per page; chunk a PDF into page images first. The output keeps the model's structural markup (tables as HTML, formulas as LaTeX, <|det|> boxes) — strip or render it as your app prefers.

Integration checklist

SPM: https://github.com/john-rocky/coreai-kit → product CoreAIKit
Info.plist: none needed
Entitlements: none needed
First run downloads the model — 4.5 GB (Mac) — then it loads from the local cache (Application Support; progress via the downloadProgress callback)
Measure in Release — Debug is ~3× slower on per-token host work

What's exciting (why you'd use it)

Private OCR: invoices, receipts, contracts, papers, forms never leave the device.
Structured, not just text: tables → HTML, equations → LaTeX, layout → boxes. RAG-ready ingestion.
Flat latency: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) keeps every tensor shape constant, so the runtime compiles once and decode stays flat at ~~12.7 ms/token (~~79 tok/s on M4 Max) — no growing-cache recompilation stalls.
SOTA quality: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

Bundles

path	what	dtype	size
`vision/unlimited_ocr_vision.aimodel`	DeepEncoder (SAM-ViT + CLIP-ViT cascade) → 100 visual tokens	fp16	762 MB
`decoder/unlimited_ocr_decoder.aimodel`	DeepseekV2 R-SWA MoE decoder, functions `prefill` + `decode` sharing one weight set + KV state	sym8	3.2 GB
`assets/embed_tokens.f16`	token embedding table `[129280,1280]` (host row-gather)	fp16	316 MB
`assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json`	arrangement constants + the assembly recipe	—	tiny
`tokenizer/`	fast tokenizer (`tokenizer.json` + configs)	—	—

Pipeline (Base mode, 640px)

image → preprocess (pad to 640², normalize mean=std=0.5)
      → vision .aimodel                         → visual tokens [1,100,1280]
      → arrange (10×10 + image_newline per row + view_seperator) → [111,1280]
      → scatter into embed_tokens(prompt_ids)   → prefix [1,115,1280]
      → decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) → tokens
      → detokenize (keep special tokens)        → markdown

The exact, verified recipe is in assets/recipe.json. Reference implementations (Python end-to-end

a macOS app, CoreAIOCR, driving the stock runtime) are in the Core AI Model Zoo: conversion/unlimited_ocr/ and apps/CoreAIOCR/.

Notes

Appropriate input: clean single-page documents (invoice / paper / report / table / formula), roughly square or portrait, with text still legible when fit to 640². Very dense small-text scans (newspaper) want the tiled crop_mode vision export (not included here; Base mode only).
Prompt is fixed to document parsing (layout + structured extraction).
License: MIT (inherited from baidu/Unlimited-OCR).

Community port — not affiliated with Apple or baidu.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mlboydaisuke/Unlimited-OCR-CoreAI

Base model

baidu/Unlimited-OCR

Finetuned

(7)

this model