Blob Storage
The blobs/ directory stores raw binary content — PDFs, images, audio files, screenshots, anything that has a byte representation. Content is stored using content addressing (SHA-256) on the filesystem, with a fanout directory structure for performance at scale.
Directory structure
Blobs are organized by the first two characters of their SHA-256 hash, creating a two-level fanout:
blobs/
├── ab/
│ ├── ab3f7c8e9d...full-sha256-hash
│ └── ab91e2f4a0...full-sha256-hash
├── c7/
│ └── c7d4e8f123...full-sha256-hash
└── ff/
└── ff0a1b2c3d...full-sha256-hash
Each file is named by the full lowercase hex-encoded SHA-256 hash of its contents, with no file extension. The fanout prefix directory uses the first two hex characters of the hash.
This structure prevents any single directory from accumulating too many entries, which degrades filesystem performance on most operating systems. With two hex characters of fanout, content is distributed across up to 256 subdirectories.
Store operation
To store a blob:
- Compute the SHA-256 hash of the complete file content.
- Derive the fanout prefix: first two characters of the hex hash.
- Create the prefix directory if it does not exist.
- Write the file to
blobs/{prefix}/{full_hash}. - If the file already exists, skip the write — the content is already stored.
import hashlib
from pathlib import Path
def store_blob(container: Path, content: bytes) -> str:
"""Store content and return its SHA-256 hash."""
content_hash = hashlib.sha256(content).hexdigest()
prefix = content_hash[:2]
blob_dir = container / "blobs" / prefix
blob_dir.mkdir(parents=True, exist_ok=True)
blob_path = blob_dir / content_hash
if not blob_path.exists():
blob_path.write_bytes(content)
return content_hash
Read operation
To read a blob, reconstruct the path from the hash:
def read_blob(container: Path, content_hash: str) -> bytes:
"""Read content by its SHA-256 hash."""
prefix = content_hash[:2]
blob_path = container / "blobs" / prefix / content_hash
return blob_path.read_bytes()
Verify operation
To verify blob integrity, recompute the hash and compare:
def verify_blob(container: Path, content_hash: str) -> bool:
"""Verify a blob's integrity against its hash."""
content = read_blob(container, content_hash)
actual_hash = hashlib.sha256(content).hexdigest()
return actual_hash == content_hash
A blob that fails verification is corrupt and should be flagged for re-ingestion from the original source.
Linking to resources
Blobs are linked to resources through the content_hash column on omnidata_resources in index.db. The hash stored in the database maps directly to a file path on disk:
# Given a resource's content_hash from index.db
prefix = content_hash[:2]
blob_path = container / "blobs" / prefix / content_hash
Multiple resources may reference the same content_hash. The blob is stored once and referenced many times.
Deduplication
Content addressing provides automatic deduplication. If two adapters ingest the same file, or the same image is attached to multiple emails, the SHA-256 hash will be identical. The store operation checks for existence before writing, so no duplicate bytes are stored on disk.
Filesystem features
| Feature | Details |
|---|---|
| Fanout | 2-character hex prefix, up to 256 subdirectories |
| Naming | Full lowercase SHA-256 hex, no extension |
| Max file size | Unlimited (filesystem limit applies) |
| Chunking | None — files are stored whole |
| Compression | None at the format layer (filesystem or future backend may compress) |
| Permissions | Recommended 0644 for blobs, 0755 for directories |
Garbage collection
A blob should only be removed when no active resource references its hash. Garbage collection is an application-layer concern:
- Query
index.dbfor all distinctcontent_hashvalues wheredeleted_at IS NULL. - Walk the
blobs/directory and collect all stored hashes. - Any blob hash not in the active set is orphaned and can be deleted.
- Remove empty fanout directories after cleanup.
Garbage collection should not run during active ingest. Implementations should acquire an exclusive lock or use a stop-the-world approach to avoid race conditions.
Future backends
The blobs/ directory is the default local backend. The specification anticipates alternative backends for distributed or cloud deployments:
| Backend | Use case |
|---|---|
| Filesystem (default) | Local development, single-machine deployments |
| S3-compatible | Cloud storage, shared containers, archival |
| Content-addressed cache | Multi-container deduplication across instances |
Backend selection is an implementation concern. The addressing scheme (SHA-256 hash as identifier) remains the same regardless of backend. A future manifest.json field may declare the active blob backend; for now, blobs/ on the local filesystem is the only specified backend.