Installation

Overview

The RAG component is installed as part of Arcanna using a dedicated profile flag:

bash arcanna_install_new.sh --profile rag

The --profile rag flag activates the RAG-specific deployment, the RAG service container, and the storage volume that holds uploaded documents. Without this flag, the RAG component is not installed and Arcanna runs as before.

important

--profile rag is not a stand-alone RAG installer. It installs the entire Arcanna platform alongside the RAG component — not just the RAG service. The profile is the way to deploy Arcanna with RAG enabled, as opposed to the default Arcanna installation, which deploys without RAG. There is no installation path that provisions only the RAG container in isolation; it always ships together with the rest of Arcanna.

Why a separate installation

The RAG module is shipped as an opt-in install for two reasons:

Hardware requirements. RAG is compute-heavy. The ingestion pipeline runs document layout analysis and OCR, then generates embeddings. The retrieval pipeline runs hybrid search and an optional precision-refinement model on the results. All of these benefit significantly from a GPU; running on CPU is supported but slower.
Flexibility. Not every Arcanna deployment needs document RAG. Keeping it opt-in means customers who don't use the functionality don't have to provision the additional CPU, RAM, and (optionally) GPU resources it requires.

Hardware requirements

Recommended: GPU

The RAG container image ships with everything it needs to run on GPU pre-installed: the GPU runtime libraries, the embedding model, the precision-refinement model, and the document-understanding models. Arcanna detects GPU availability at startup and uses the GPU automatically when present.

Recommended baseline for the GPU host:

Resource	Recommendation
GPU	NVIDIA
VRAM	≥ 8 GB (all models live in VRAM concurrently)
Container toolkit	NVIDIA Container Toolkit installed and Docker configured to expose `--gpus all`
System RAM	≥ 4 GB (in addition to GPU VRAM)
Disk	≥ 50 GB for models and runtime; storage volume scales with your document corpus

GPU is preferred because it accelerates every heavy stage:

Embedding generation — faster on GPU than on CPU.
Precision refinement at retrieval time — feasible on GPU at low latency; too slow on CPU to use interactively.
Document conversion — layout, OCR, and table-extraction models all run faster on GPU, dramatically reducing per-page conversion time.
End-to-end retrieval latency — typical query: sub-1000 ms on GPU; higher on CPU.

Supported: CPU

The RAG module runs on CPU-only hosts. The same container image is used; Arcanna detects that no GPU is available and falls back to the CPU profile automatically.

Recommended baseline for the CPU host:

Resource	Recommendation
CPU	≥ 8 physical cores
RAM	≥ 8 GB (all models are held in process memory)
Disk	≥ 50 GB for models and runtime; storage volume scales with your document corpus

To keep CPU retrieval latency interactive, Arcanna applies a different profile when no GPU is detected:

The precision-refinement model is skipped at query time. It is too slow on CPU (seconds per query) and would make retrieval unusable interactively. Hybrid search still runs; you simply don't get the extra precision pass on top of it. On GPU, this model re-scores the top candidates against the query and filters out borderline-relevant ones, improving the quality of the top results.
Embedding generation uses smaller batches and shorter inputs, which reduces per-call latency and memory pressure.
Retrieval returns a smaller top result set by default — large result counts aren't useful without the precision-refinement pass to filter them.

CPU vs GPU tradeoffs

Aspect	GPU	CPU
Document ingestion (OCR, tables, layout)	Fast	Significantly slower, especially for scanned PDFs
Embedding generation	faster than CPU	-
Retrieval latency (per query)	Sub-1000 ms end-to-end	Higher; the precision-refinement pass is skipped to compensate
Precision-refinement pass at retrieval	Enabled — improves top-result quality	Skipped to keep latency interactive
Result-set size returned by retrieval	Unrestricted	Smaller maximum
Memory profile	Models live in VRAM	Models live in system RAM

The short version: on GPU you get faster ingestion and higher-precision retrieval; on CPU you still get the same set of relevant documents back from search, but lose the precision-refinement pass that GPU adds on top, and ingestion is slower.

Vertical scaling

Both ingestion and retrieval are throughput-bound on a single machine before they're latency-bound. Arcanna lets you scale up vertically on the host by increasing three worker-pool counts in the RAG module's configuration.

The three pools you can size independently:

Conversion workers — how many documents Arcanna converts in parallel (parsing, layout analysis, OCR), defaulting to 1. The memory footprint increases linearly with the number of workers (the models are loaded in memory number of workers times).
Indexing workers — how many documents Arcanna chunks, embeds, and indexes in parallel, defaulting to 1.
API workers — how many parallel API processes serve incoming requests (search, status, uploads). Increase this if you expect many concurrent users or workflows hitting Arcanna at the same time. Again, the memory footprint increases lienarly with the number of workers. Each api worker, comes with its own subworkers (conversion and indexing).

Memory rule of thumb: each conversion and indexing worker (added up) can transiently use 4-8 GB of RAM. When scaling, fit the worker count inside the host's available memory.

Prerequisites

The image is self-contained

The RAG container image is built to be fully self-contained and offline-ready. Once the image is on the host, no internet access is required at runtime. Specifically, the image bundles:

All AI models — embedding model, precision-refinement model, and the document-understanding models (layout, OCR, table extraction). These are baked into the image at build time, and Arcanna is configured at runtime to never reach out for them.
The full software runtime — everything the service needs to start and run, with no further dependencies installed on the host.
GPU runtime libraries — for GPU deployments. The container ships with the matching CUDA libraries inside it; you only need an NVIDIA driver on the host. Upgrading CUDA on the host does not affect the container, and vice versa.

In short, once the image is pulled, the container is ready to serve. No model downloads on first start, no package fetches, no outbound network calls.

GPU runtime (only for GPU deployments)

If you are deploying with GPU, Docker must be able to expose the host's NVIDIA device to the RAG container. Arcanna's automatic GPU detection only works if the container actually has access to the GPU — which requires the NVIDIA Container Toolkit to be installed on the host and registered as a Docker runtime.

Without this, the container starts but does not see a GPU, and Arcanna silently falls back to the CPU profile (no precision-refinement pass, smaller batches, smaller result sets) even though a GPU is present on the host.

Install the NVIDIA Container Toolkit

Follow NVIDIA's installation guide: Installing the NVIDIA Container Toolkit.

Register it as a Docker runtime

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

This writes the nvidia runtime into /etc/docker/daemon.json. Verify with:

docker info | grep -i nvidia

You should see nvidia listed under Runtimes.

Run the container with GPU access

important

The Arcanna installer does not currently start the RAG container with GPU access automatically. Even on a host with a working NVIDIA Container Toolkit, the installer-provisioned container runs on CPU by default. Customers who want to use the GPU profile must currently start the RAG container manually with the NVIDIA runtime flags.

The required flags are --runtime=nvidia and --gpus all.

For this type of deployment, please contact Arcanna's team support.

Without these flags the container will start but won't see the GPU, and Arcanna silently falls back to the CPU profile (no precision-refinement pass, smaller batches, smaller result sets) even though a GPU is present on the host.

For CPU-only deployments, none of the above applies — no NVIDIA runtime, no toolkit, no GPU flags. The default installer-provisioned container is already running on CPU.

Overview​

Why a separate installation​

Hardware requirements​

Recommended: GPU​

Supported: CPU​

CPU vs GPU tradeoffs​

Vertical scaling​

Prerequisites​

The image is self-contained​

GPU runtime (only for GPU deployments)​

Install the NVIDIA Container Toolkit​

Register it as a Docker runtime​

Run the container with GPU access​