Chunks and Embeddings
The omnidata_chunks table stores text segments extracted from resources, along with their vector embeddings. It lives in index.db inside the .omnidata bundle. This is the table that powers both full-text search (via the companion FTS5 index) and vector similarity search.
Schema
-- index.db
CREATE TABLE omnidata_chunks (
id TEXT PRIMARY KEY,
resource_id TEXT NOT NULL REFERENCES omnidata_resources(id),
chunk_index INTEGER NOT NULL,
chunk_type TEXT NOT NULL DEFAULT 'text',
content TEXT NOT NULL,
token_count INTEGER,
embedding BLOB,
embedding_model TEXT,
metadata TEXT DEFAULT '{}',
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
deleted_at TEXT
);
Columns
id
Text (UUID v4). Primary key for the chunk.
resource_id
Text. Foreign key to omnidata_resources.id (both tables live in index.db). Every chunk belongs to exactly one resource. A resource may have many chunks.
chunk_index
Integer. The ordinal position of this chunk within its parent resource, starting at 0. Used to reconstruct the original document order.
chunk_type
Text. Describes what kind of segment this is:
"text"— General prose, the most common type"code"— A code block or file segment"heading"— A section heading, useful for hierarchical navigation"message"— A single message in a conversation thread"transcript"— A segment of transcribed audio or video
The chunk type influences how the text is displayed and may affect chunking strategy.
content
Text. The actual text content of the chunk. This is what gets indexed by FTS5 and what gets embedded into a vector.
token_count
Integer, nullable. The number of tokens in the chunk, as counted by the embedding model’s tokenizer. Useful for prompt budgeting when an agent selects chunks to include in context.
embedding
BLOB, nullable. The vector embedding of the chunk’s content, stored as a little-endian float32 array. The dimensionality is inferred at read time from len(embedding) / 4. For example, a 768-dimensional embedding occupies 3072 bytes.
NULL when the resource is in the silver pipeline state (chunked but not yet embedded).
embedding_model
Text, nullable. The name of the model used to generate the embedding (e.g., "nomic-embed-text-v1.5"). Recorded so that consumers know which model to use for query embedding. Mixing models within a single container is technically possible but not recommended — cosine similarity is only meaningful between vectors from the same model.
metadata
Text (JSON object). Chunk-level metadata. Examples: source line numbers for code chunks, speaker identity for transcript chunks, heading level for heading chunks.
FTS5 companion index
The fts_chunks virtual table mirrors the chunk text for full-text search, also in index.db:
-- index.db
CREATE VIRTUAL TABLE fts_chunks USING fts5(
content,
content='omnidata_chunks',
content_rowid='rowid'
);
FTS5 uses BM25 ranking by default. The index is kept in sync via triggers that fire on INSERT, UPDATE, and DELETE against omnidata_chunks.
Reading embeddings
To read an embedding in Python:
import struct
blob = row["embedding"]
dimensions = len(blob) // 4
vector = struct.unpack(f"<{dimensions}f", blob)
The < prefix specifies little-endian byte order. The f format character reads 32-bit floats. This is the canonical way to decode OmniData embeddings in any language that supports binary unpacking.