Documentation Specification SDKs

Pipeline States

Every resource in an OmniData container moves through a three-stage pipeline: bronze, silver, and gold. This progression tracks how far a piece of content has been processed, from raw metadata to fully searchable knowledge.

The current state is stored in the pipeline_state column on the omnidata_resources table in index.db.

The three states

Bronze (raw ingested)

A resource enters the container in the bronze state. At this point, OmniData knows the resource exists — its URI, source adapter, title, MIME type, and timestamps are recorded — but the actual content has not yet been processed.

Bronze means: “We know about this item. We have its metadata. We may have its raw binary stored in the blobs/ directory. But it has not been chunked or embedded.”

A bronze resource is discoverable by metadata queries but invisible to semantic search.

Silver (chunked)

A silver resource has had its content extracted and split into text segments stored in omnidata_chunks (inside index.db). The chunking strategy depends on content type — sliding window for prose, tree-sitter for code, message boundaries for conversations.

Silver means: “The content has been broken into searchable text segments. Full-text search (FTS5) can find this resource. But vector embeddings have not yet been generated.”

A silver resource appears in FTS5 keyword searches but not in vector similarity queries.

Gold (embedded and searchable)

A gold resource is fully processed. Every chunk has a vector embedding stored as a little-endian float32 BLOB in index.db, with the embedding model recorded. The resource is now findable by both keyword search and semantic similarity.

Gold means: “This resource is fully indexed. RRF search will include it in results.”

Pipeline promotion

Resources move forward through the pipeline, never backward. The typical flow:

  1. Adapter sync creates a bronze resource (metadata + optional blob in blobs/)
  2. Chunker reads the content, splits it into segments, promotes to silver
  3. Embedder generates vectors for each chunk, promotes to gold

Each stage is idempotent. Re-running the chunker on an already-silver resource is a no-op. Re-running the embedder on an already-gold resource is a no-op.

Querying by state

You can query resources at any pipeline state. These queries run against index.db:

-- Find resources still waiting to be chunked
SELECT uri, title FROM omnidata_resources
WHERE pipeline_state = 'bronze' AND deleted_at IS NULL;

-- Count resources at each stage
SELECT pipeline_state, COUNT(*) FROM omnidata_resources
WHERE deleted_at IS NULL
GROUP BY pipeline_state;

Why three states?

The pipeline exists because chunking and embedding are expensive operations. By separating them, the system can:

  • Ingest quickly — adapters write metadata and move on, without blocking on embedding
  • Process in bulk — a background worker promotes resources in batches
  • Resume after interruption — if the process crashes mid-embedding, bronze and silver resources are still safe and can be promoted on the next run