Skip to main content

Indexing Pipeline

Overview

Indexing is the step that takes the approved text from the Conversion stage and makes it searchable. Once a document finishes indexing, the AI Assistant and Agentic Workflows can find and use it.

You don't need to do anything to trigger indexing — it runs automatically when you (or Arcanna) approve a document. From your point of view, an approved document simply transitions from Indexing to Indexed in the UI and starts showing up in search results.

What happens during indexing

Three things happen, and they happen automatically:

  1. The text is split into smaller passages. Arcanna doesn't index a document as one giant blob — it breaks it into passages that preserve the document's structure (sections, lists, tables stay together where possible). This is what lets retrieval find the specific paragraph answering a question, rather than just the document it lives in.
  2. Each passage is turned into a numerical representation (an embedding). Arcanna uses a state-of-the-art multilingual model that handles all the languages your documents are typically written in with the same shared model.
  3. The passages and their embeddings are stored. Together with the document metadata, the passages become searchable using both keyword matching and semantic similarity.

Languages

Arcanna uses a single multilingual model for embeddings, so you don't need to configure a language per collection or per document. Documents in mixed languages, or collections that contain a mix of English and Romanian (or any other supported language), work out of the box. Queries in one language can also retrieve relevant passages from documents written in another, because both end up in the same shared semantic space.

When does Arcanna re-index a document?

Re-indexing happens automatically in three situations:

  • You edit the text and approve again. Saving new text removes the old version from search and re-indexes from the new content, so search results are never stale.
  • You retry indexing manually. If the indexing status is Error, the UI offers a retry action that re-runs only the indexing step — conversion is not redone.
  • A document is approved for the first time. Approval is what triggers the initial indexing job.

Re-indexing never re-runs the conversion stage. Arcanna treats the (possibly edited) text as the source of truth at indexing time.

What this means for your search results

Two consequences of how indexing works are worth keeping in mind when you're authoring or curating content:

  • Passages preserve structure. Well-structured documents (proper headings, real lists, clean tables) split into clean passages, which retrieve cleanly. Documents that are one long wall of text split less informatively. This is another reason native formats beat PDFs.
  • Search is semantic, not just keyword based. Arcanna finds passages that mean the same thing as the query, not just passages that share words with it. A question about "credential rotation" can retrieve a passage that talks about "password renewal" — they end up close to each other in the shared embedding space.