Documentation Specification SDKs

Content-Addressed Blobs

When you capture a screenshot, record a voice note, or save a PDF, the raw binary content is stored on the filesystem inside the .omnidata bundle’s blobs/ directory. No binary data lives inside SQLite.

Content addressing

Every blob is named by its SHA-256 hash. This provides:

  • Deduplication: Same file captured twice, stored once
  • Integrity: The filename is the checksum – verification is free
  • Immutability: A blob’s content never changes. New content = new hash

Filesystem storage with fanout

Blobs are organized using a two-character fanout prefix, similar to Git’s object storage:

.omnidata/
└── blobs/
    ├── ab/
    │   ├── ab3f7c9e2d...sha256
    │   └── ab91f0a412...sha256
    ├── c8/
    │   └── c84d02ee7b...sha256
    └── f1/
        └── f1a8b3cc00...sha256

The first two hex characters of the hash become the directory name. This prevents any single directory from accumulating millions of entries, which degrades filesystem performance.

Store, read, verify

Storing a blob

  1. Compute SHA-256 of the content
  2. Derive the path: blobs/{hash[:2]}/{hash}
  3. Create the fanout directory if it doesn’t exist
  4. Write the file (atomic write with temp file + rename)
  5. Record the hash, size, and MIME type in index.db
import hashlib, os, tempfile

def store_blob(bundle_path, data, mime_type):
    h = hashlib.sha256(data).hexdigest()
    fanout = os.path.join(bundle_path, "blobs", h[:2])
    os.makedirs(fanout, exist_ok=True)
    blob_path = os.path.join(fanout, h)

    if not os.path.exists(blob_path):
        # Atomic write: temp file in same dir, then rename
        fd, tmp = tempfile.mkstemp(dir=fanout)
        os.write(fd, data)
        os.close(fd)
        os.rename(tmp, blob_path)

    return h  # content_hash for index.db

Reading a blob

def read_blob(bundle_path, content_hash):
    path = os.path.join(bundle_path, "blobs", content_hash[:2], content_hash)
    with open(path, "rb") as f:
        return f.read()

Verifying integrity

def verify_blob(bundle_path, content_hash):
    data = read_blob(bundle_path, content_hash)
    actual = hashlib.sha256(data).hexdigest()
    return actual == content_hash

Because the filename is the hash, verification is a single hash operation. No separate checksum database is needed.

Filesystem benefits

Storing blobs as regular files means the host filesystem does the heavy lifting:

Filesystem What you get for free
btrfs Transparent compression (zstd), CoW snapshots, block-level dedup, checksums
ZFS Compression, send/receive for replication, scrubbing, checksums
APFS Clonefile for instant copies, per-file encryption, snapshots
ext4 Universally available baseline, journaling
S3 / object storage Versioning, lifecycle policies, cross-region replication

None of these features are available when binary data is locked inside a SQLite database. By storing blobs on the filesystem, OmniData inherits whatever the host provides – without writing a single line of compression, dedup, or snapshot code.

See Filesystem as Runtime for a deeper treatment.

Deduplication

Content addressing gives dedup at the application layer: two resources that reference the same binary share one blob file. But filesystem-level dedup goes further:

  • btrfs: duperemove or online dedup finds identical blocks across different blobs
  • ZFS: Block-level dedup (RAM-intensive but effective)
  • APFS: clonefile() creates zero-cost copies that share blocks until modified

Even across separate .omnidata instances, a filesystem with block-level dedup will share storage for identical content.

What gets stored as a blob vs. chunks

Content type Blob? Chunks? Notes
Screenshots Yes No PNG/JPEG binary
Voice recordings Yes No WAV/MP3 binary
PDFs Yes Yes Original binary as blob; extracted text as chunks
Images from web pages Yes No Captured image data
Video clips Yes No Large binary files
Text documents No Yes Text goes directly into index.db chunks
Code files No Yes Text goes into chunks with tree-sitter splitting

A resource can have both a blob (the original binary) and chunks (the extracted text). The blob preserves the original; the chunks make it searchable.