Content-Addressed Blobs

When you capture a screenshot, record a voice note, or save a PDF, the raw binary content is stored on the filesystem inside the .omnidata bundle’s blobs/ directory. No binary data lives inside SQLite.

Content addressing

Every blob is named by its SHA-256 hash. This provides:

Deduplication: Same file captured twice, stored once
Integrity: The filename is the checksum – verification is free
Immutability: A blob’s content never changes. New content = new hash

Filesystem storage with fanout

Blobs are organized using a two-character fanout prefix, similar to Git’s object storage:

.omnidata/
└── blobs/
    ├── ab/
    │   ├── ab3f7c9e2d...sha256
    │   └── ab91f0a412...sha256
    ├── c8/
    │   └── c84d02ee7b...sha256
    └── f1/
        └── f1a8b3cc00...sha256

The first two hex characters of the hash become the directory name. This prevents any single directory from accumulating millions of entries, which degrades filesystem performance.

Store, read, verify

Storing a blob

Compute SHA-256 of the content
Derive the path: blobs/{hash[:2]}/{hash}
Create the fanout directory if it doesn’t exist
Write the file (atomic write with temp file + rename)
Record the hash, size, and MIME type in index.db

import hashlib, os, tempfile

def store_blob(bundle_path, data, mime_type):
    h = hashlib.sha256(data).hexdigest()
    fanout = os.path.join(bundle_path, "blobs", h[:2])
    os.makedirs(fanout, exist_ok=True)
    blob_path = os.path.join(fanout, h)

    if not os.path.exists(blob_path):
        # Atomic write: temp file in same dir, then rename
        fd, tmp = tempfile.mkstemp(dir=fanout)
        os.write(fd, data)
        os.close(fd)
        os.rename(tmp, blob_path)

    return h  # content_hash for index.db

Reading a blob

def read_blob(bundle_path, content_hash):
    path = os.path.join(bundle_path, "blobs", content_hash[:2], content_hash)
    with open(path, "rb") as f:
        return f.read()

Verifying integrity

def verify_blob(bundle_path, content_hash):
    data = read_blob(bundle_path, content_hash)
    actual = hashlib.sha256(data).hexdigest()
    return actual == content_hash

Because the filename is the hash, verification is a single hash operation. No separate checksum database is needed.

Filesystem benefits

Storing blobs as regular files means the host filesystem does the heavy lifting:

Filesystem	What you get for free
btrfs	Transparent compression (zstd), CoW snapshots, block-level dedup, checksums
ZFS	Compression, send/receive for replication, scrubbing, checksums
APFS	Clonefile for instant copies, per-file encryption, snapshots
ext4	Universally available baseline, journaling
S3 / object storage	Versioning, lifecycle policies, cross-region replication

None of these features are available when binary data is locked inside a SQLite database. By storing blobs on the filesystem, OmniData inherits whatever the host provides – without writing a single line of compression, dedup, or snapshot code.

See Filesystem as Runtime for a deeper treatment.

Deduplication

Content addressing gives dedup at the application layer: two resources that reference the same binary share one blob file. But filesystem-level dedup goes further:

btrfs: duperemove or online dedup finds identical blocks across different blobs
ZFS: Block-level dedup (RAM-intensive but effective)
APFS: clonefile() creates zero-cost copies that share blocks until modified

Even across separate .omnidata instances, a filesystem with block-level dedup will share storage for identical content.

What gets stored as a blob vs. chunks

Content type	Blob?	Chunks?	Notes
Screenshots	Yes	No	PNG/JPEG binary
Voice recordings	Yes	No	WAV/MP3 binary
PDFs	Yes	Yes	Original binary as blob; extracted text as chunks
Images from web pages	Yes	No	Captured image data
Video clips	Yes	No	Large binary files
Text documents	No	Yes	Text goes directly into `index.db` chunks
Code files	No	Yes	Text goes into chunks with tree-sitter splitting

A resource can have both a blob (the original binary) and chunks (the extracted text). The blob preserves the original; the chunks make it searchable.