Content-Addressed Blobs
When you capture a screenshot, record a voice note, or save a PDF, the raw binary content is stored on the filesystem inside the .omnidata bundle’s blobs/ directory. No binary data lives inside SQLite.
Content addressing
Every blob is named by its SHA-256 hash. This provides:
- Deduplication: Same file captured twice, stored once
- Integrity: The filename is the checksum – verification is free
- Immutability: A blob’s content never changes. New content = new hash
Filesystem storage with fanout
Blobs are organized using a two-character fanout prefix, similar to Git’s object storage:
.omnidata/
└── blobs/
├── ab/
│ ├── ab3f7c9e2d...sha256
│ └── ab91f0a412...sha256
├── c8/
│ └── c84d02ee7b...sha256
└── f1/
└── f1a8b3cc00...sha256
The first two hex characters of the hash become the directory name. This prevents any single directory from accumulating millions of entries, which degrades filesystem performance.
Store, read, verify
Storing a blob
- Compute SHA-256 of the content
- Derive the path:
blobs/{hash[:2]}/{hash} - Create the fanout directory if it doesn’t exist
- Write the file (atomic write with temp file + rename)
- Record the hash, size, and MIME type in
index.db
import hashlib, os, tempfile
def store_blob(bundle_path, data, mime_type):
h = hashlib.sha256(data).hexdigest()
fanout = os.path.join(bundle_path, "blobs", h[:2])
os.makedirs(fanout, exist_ok=True)
blob_path = os.path.join(fanout, h)
if not os.path.exists(blob_path):
# Atomic write: temp file in same dir, then rename
fd, tmp = tempfile.mkstemp(dir=fanout)
os.write(fd, data)
os.close(fd)
os.rename(tmp, blob_path)
return h # content_hash for index.db
Reading a blob
def read_blob(bundle_path, content_hash):
path = os.path.join(bundle_path, "blobs", content_hash[:2], content_hash)
with open(path, "rb") as f:
return f.read()
Verifying integrity
def verify_blob(bundle_path, content_hash):
data = read_blob(bundle_path, content_hash)
actual = hashlib.sha256(data).hexdigest()
return actual == content_hash
Because the filename is the hash, verification is a single hash operation. No separate checksum database is needed.
Filesystem benefits
Storing blobs as regular files means the host filesystem does the heavy lifting:
| Filesystem | What you get for free |
|---|---|
| btrfs | Transparent compression (zstd), CoW snapshots, block-level dedup, checksums |
| ZFS | Compression, send/receive for replication, scrubbing, checksums |
| APFS | Clonefile for instant copies, per-file encryption, snapshots |
| ext4 | Universally available baseline, journaling |
| S3 / object storage | Versioning, lifecycle policies, cross-region replication |
None of these features are available when binary data is locked inside a SQLite database. By storing blobs on the filesystem, OmniData inherits whatever the host provides – without writing a single line of compression, dedup, or snapshot code.
See Filesystem as Runtime for a deeper treatment.
Deduplication
Content addressing gives dedup at the application layer: two resources that reference the same binary share one blob file. But filesystem-level dedup goes further:
- btrfs:
duperemoveor online dedup finds identical blocks across different blobs - ZFS: Block-level dedup (RAM-intensive but effective)
- APFS:
clonefile()creates zero-cost copies that share blocks until modified
Even across separate .omnidata instances, a filesystem with block-level dedup will share storage for identical content.
What gets stored as a blob vs. chunks
| Content type | Blob? | Chunks? | Notes |
|---|---|---|---|
| Screenshots | Yes | No | PNG/JPEG binary |
| Voice recordings | Yes | No | WAV/MP3 binary |
| PDFs | Yes | Yes | Original binary as blob; extracted text as chunks |
| Images from web pages | Yes | No | Captured image data |
| Video clips | Yes | No | Large binary files |
| Text documents | No | Yes | Text goes directly into index.db chunks |
| Code files | No | Yes | Text goes into chunks with tree-sitter splitting |
A resource can have both a blob (the original binary) and chunks (the extracted text). The blob preserves the original; the chunks make it searchable.