Documentation Specification SDKs

Chunks and Embeddings

The omnidata_chunks table stores text segments extracted from resources, along with their vector embeddings. It lives in index.db inside the .omnidata bundle. This is the table that powers both full-text search (via the companion FTS5 index) and vector similarity search.

Schema

-- index.db
CREATE TABLE omnidata_chunks (
    id               TEXT PRIMARY KEY,
    resource_id      TEXT NOT NULL REFERENCES omnidata_resources(id),
    chunk_index      INTEGER NOT NULL,
    chunk_type       TEXT NOT NULL DEFAULT 'text',
    content          TEXT NOT NULL,
    token_count      INTEGER,
    embedding        BLOB,
    embedding_model  TEXT,
    metadata         TEXT DEFAULT '{}',
    created_at       TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
    updated_at       TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ', 'now')),
    deleted_at       TEXT
);

Columns

id

Text (UUID v4). Primary key for the chunk.

resource_id

Text. Foreign key to omnidata_resources.id (both tables live in index.db). Every chunk belongs to exactly one resource. A resource may have many chunks.

chunk_index

Integer. The ordinal position of this chunk within its parent resource, starting at 0. Used to reconstruct the original document order.

chunk_type

Text. Describes what kind of segment this is:

  • "text" — General prose, the most common type
  • "code" — A code block or file segment
  • "heading" — A section heading, useful for hierarchical navigation
  • "message" — A single message in a conversation thread
  • "transcript" — A segment of transcribed audio or video

The chunk type influences how the text is displayed and may affect chunking strategy.

content

Text. The actual text content of the chunk. This is what gets indexed by FTS5 and what gets embedded into a vector.

token_count

Integer, nullable. The number of tokens in the chunk, as counted by the embedding model’s tokenizer. Useful for prompt budgeting when an agent selects chunks to include in context.

embedding

BLOB, nullable. The vector embedding of the chunk’s content, stored as a little-endian float32 array. The dimensionality is inferred at read time from len(embedding) / 4. For example, a 768-dimensional embedding occupies 3072 bytes.

NULL when the resource is in the silver pipeline state (chunked but not yet embedded).

embedding_model

Text, nullable. The name of the model used to generate the embedding (e.g., "nomic-embed-text-v1.5"). Recorded so that consumers know which model to use for query embedding. Mixing models within a single container is technically possible but not recommended — cosine similarity is only meaningful between vectors from the same model.

metadata

Text (JSON object). Chunk-level metadata. Examples: source line numbers for code chunks, speaker identity for transcript chunks, heading level for heading chunks.

FTS5 companion index

The fts_chunks virtual table mirrors the chunk text for full-text search, also in index.db:

-- index.db
CREATE VIRTUAL TABLE fts_chunks USING fts5(
    content,
    content='omnidata_chunks',
    content_rowid='rowid'
);

FTS5 uses BM25 ranking by default. The index is kept in sync via triggers that fire on INSERT, UPDATE, and DELETE against omnidata_chunks.

Reading embeddings

To read an embedding in Python:

import struct

blob = row["embedding"]
dimensions = len(blob) // 4
vector = struct.unpack(f"<{dimensions}f", blob)

The < prefix specifies little-endian byte order. The f format character reads 32-bit floats. This is the canonical way to decode OmniData embeddings in any language that supports binary unpacking.