Ingestion Pipeline
Overview
The ingestion pipeline is what turns an uploaded file into something the LLMs can search. It has two stages, with an optional human-in-the-loop step in between.
Arcanna deliberately splits the two stages because they have very different costs and very different failure modes:
- Conversion is the expensive step that turns your file into clean text. For PDFs it runs AI-based layout and table extraction (plus OCR when there's no text layer); for the rest, it parses the document's native structure.
- Indexing is the cheaper step that takes the (now approved) text and makes it searchable.
Splitting them lets you stop and review the conversion result before it becomes searchable. You get a chance to correct a bad conversion — which matters most for PDFs — without wasting the indexing work or, worse, polluting search results with wrong text.
Pipeline stages
- Markdown Conversion — Arcanna extracts the text from your file. Depending on the approval mode you've configured, the result is either approved automatically or waits for a reviewer to edit the text and approve it.
- Indexing — once the text is approved, Arcanna prepares it for semantic search and makes it available to the AI Assistant and Agentic Workflows.
High-level flow
upload ─▶ convert ─▶ text ─┬──▶ auto-approved ─▶ index ─▶ searchable
│
└──▶ in review ─▶ edit +/ approve ─▶ index ─▶ searchable
Each document carries two independent statuses that you'll see in the UI:
- Conversion status —
pending,processing,in_review,approved,rejected,failed. Where the document is in the conversion + review flow. - Indexing status —
idle,pending,loading,success,error. Where the document is in the indexing flow.