Omniparse

The Problem

Most LLM workflows need to ingest documents — Excel sheets, slide decks, Python source, scanned PDFs, entire directory trees. The default move is to glue together five different parsers, each with its own dependency tree, error model, and output shape. The output usually still needs cleanup before a model can read it.

What I Built

@tyroneross/omniparse — a single SDK + CLI that takes any of those inputs and emits clean Markdown plus structured JSON. Published to npm as v1.0.0. The CLI is omniparse <path>; the SDK is a typed function call.

Monorepo

Package	Purpose	Stack
`packages/sdk`	Core parsing + CLI binary `omniparse`	TypeScript 5, tsup, xlsx, sax, p-limit
`packages/web`	Web app for upload + browse	Next.js 16, React 19, better-sqlite3, Radix UI, Tailwind 4
`packages/mac`	Planned native Mac wrapper	SwiftUI

The SDK is the contract; the web app and the future Mac app are surfaces over the same parser core.

What it parses

Excel — xlsx, multi-sheet, formulas resolved to values
PowerPoint — slide-by-slide markdown with notes preserved
PDF — text extraction (text-based PDFs only; no OCR)
Python — module → markdown with docstrings hoisted
Directories — recursive walk, file-type-aware, single combined output

Architecture

Omniparse is a deterministic parsing pipeline, not an AI system — no LLM or model runs anywhere in the request path. The SDK detects input type from filesystem inspection (extension, isDirectory()), dispatches to one of four independent, lazily-loaded parsers, and returns a single shared result shape. Three surfaces — CLI, direct SDK import, and a Next.js web app — sit on top of that one core.

flowchart LR
  A["Input: file path or directory<br/>(CLI arg / SDK call / web upload)"] --> B{"detectInputType()<br/>extension + isDirectory()"}
  B -->|.xlsx .xls .csv .tsv .ods .xlsb| C["Excel parser<br/>xlsx (SheetJS)"]
  B -->|.pptx| D["PPTX parser<br/>sax + zlib, ZIP read once"]
  B -->|.py| E["Python parser<br/>regex static analysis"]
  B -->|.pdf| F["PDF parser<br/>hand-rolled BT/ET + zlib inflate"]
  B -->|directory| G["Recursive walk<br/>p-limit, concurrency 4"]
  G --> B
  C --> H["Unified ParseResult<br/>markdown + text + wordCount +<br/>estimatedTokens + metadata"]
  D --> H
  E --> H
  F --> H
  H --> I["CLI: stdout or -o file"]
  H --> J["SDK: returned to caller"]
  H --> K["Web: POST /api/parse"]
  K --> L[("SQLite via better-sqlite3<br/>packages/web/data/omniparse.db")]
  L --> M["SQL LIKE scan<br/>/api/documents/search"]

How it works

Input arrives as a file path or directory — a CLI argument, a direct parse() call, or a multipart upload to the web app’s /api/parse route.
detectInputType() resolves the path and reads its extension (or checks isDirectory()) to classify it as excel, pptx, python, pdf, directory, or unsupported. This is filesystem/string logic only — no content sniffing, no magic-byte detection.
The router dispatches to one of four independent parsers, each dynamically imported so unused parsers never load into memory:
- Excel (.xlsx/.xls/.csv/.tsv/.ods/.xlsb) — the xlsx (SheetJS) library reads the workbook once; full mode additionally opens the XLSX ZIP directly to pull images, charts, comments, merged cells, hyperlinks, and named ranges.
- PowerPoint (.pptx) — the ZIP is read once; slide and notes XML streams through sax (no DOM tree built) and is decompressed with Node’s built-in zlib; slides parse concurrently.
- Python (.py) — regex-based static analysis pulls docstrings, imports, functions, classes, and variables. No Python runtime is invoked.
- PDF (.pdf) — a hand-rolled extractor scans BT/ET text-showing operators directly out of the PDF byte stream, falling back to zlib.inflateSync on FlateDecode-compressed content streams when plaintext isn’t found. There is no pdf-parse/pdf.js dependency; text-based PDFs only, no OCR.
Directory input recurses the tree (opt-in via recursive), filters to supported extensions, and fans the file list out through p-limit at concurrency 4 (configurable) — bounded parallelism, not a queue or worker pool.
Every parser returns the same ParseResult shape: markdown, text, wordCount, an estimatedTokens heuristic (character-count-based, e.g. Math.ceil(text.length / 4) for PDF — not a real tokenizer call), parseTime, and parser-specific metadata.
Three surfaces consume that shared result. The CLI (bin/omniparse.ts) writes it to stdout or -o file. SDK callers get it directly from parse() / parseMultiple(). The web app’s /api/parse route writes the upload to a temp file, calls the same parse(), then persists the result.
Web persistence: better-sqlite3 opens (and auto-creates) packages/web/data/omniparse.db, running raw CREATE TABLE IF NOT EXISTS DDL for Project and Document on first access. There is no Prisma schema or ORM in the repo — a prisma/ directory exists only as a legacy path the app copies forward from if a pre-existing prisma/omniparse.db is found.
Search (/api/documents/search) runs a parameterized SQL LIKE scan across the text, markdown, and fileName columns — substring matching, not embeddings or ranked full-text search.

Models

None. Omniparse runs no LLM or ML model at any stage — parsing is deterministic (library calls, regex, streaming XML, byte-stream scanning). The estimatedTokens field on every result is a character-count heuristic, not a tokenizer invocation, and exists so downstream LLM callers can budget context without needing a model themselves.

Tools & Infra

xlsx (SheetJS) — Excel/CSV/ODS parsing; vendored as a local tarball (vendor/xlsx-0.20.3.tgz) rather than pulled from npm.
sax — streaming XML parser for PPTX slide/notes XML; avoids building a DOM for large decks.
p-limit — bounded-concurrency control for directory and multi-file batch parsing (default 4).
Node zlib (built-in) — decompresses ZIP-internal XML parts (PPTX/XLSX) and FlateDecode PDF content streams.
tsup — builds the SDK to CJS + ESM + type declarations for npm publish.
Next.js 16 + React 19 — the web app (packages/web), a local upload/browse/search surface over the SDK.
better-sqlite3 — synchronous SQLite driver for local persistence; chosen over Prisma/an ORM for a single-user local app where raw prepared statements are simpler than a schema-migration layer.
Radix UI + Tailwind v4 — web app component primitives and styling.
No vector store, queue, or external service — the entire pipeline runs in-process with local disk/SQLite as the only state.

Tech stack

TypeScript 5, tsup, xlsx (SheetJS), sax, p-limit — no LLM/model (Next.js 16, React 19, better-sqlite3, Radix UI, Tailwind v4 for the web surface).