Filesystem as Runtime

OmniData stores blobs as regular files on the host filesystem. This is a deliberate architectural decision: instead of reimplementing compression, snapshots, and deduplication inside the application, OmniData delegates these to the operating system.

The filesystem becomes a runtime layer. What it provides depends on which filesystem the .omnidata bundle lives on.

Per-filesystem capabilities

btrfs (Linux)

btrfs is the strongest match for OmniData’s storage model.

Transparent compression. Mount with compress=zstd and every blob is compressed on write, decompressed on read. The application sees uncompressed data. A bundle storing PDFs and images can shrink 40-60% with zero application code.

Copy-on-Write snapshots. btrfs subvolume snapshot creates an instant, zero-cost snapshot of an .omnidata bundle. The snapshot shares all blocks with the original until one side modifies a file. This enables:

Point-in-time recovery before risky operations
Branching a bundle for experimentation
Instant backup before a large ingest

Block-level deduplication. Tools like duperemove find identical blocks across files and collapse them into shared extents. Two .omnidata instances that ingested the same PDF share the blob’s disk blocks.

Checksums. btrfs checksums every data and metadata block. Bit rot is detected on read and can be self-healed from RAID mirrors. OmniData’s content-addressed naming provides a second layer: if the filename hash doesn’t match the content, the blob is corrupt regardless of what the filesystem reports.

ZFS (Linux, FreeBSD, macOS via OpenZFS)

send/receive. zfs send streams a snapshot to another machine or pool. This enables efficient replication of .omnidata bundles across machines – only changed blocks are transferred after the initial full send.

Scrubbing. zfs scrub reads every block on the pool and verifies checksums. Silent corruption is detected and, on mirrored or RAIDZ pools, automatically repaired.

Compression. Like btrfs, ZFS supports transparent compression (LZ4, zstd). Blobs are compressed at the block level without application involvement.

Checksums. All data and metadata is checksummed (SHA-256, fletcher4, or skein). Combined with OmniData’s content-addressed naming, corruption is caught at two independent layers.

APFS (macOS)

Clonefile. clonefile() creates an instant copy of a file that shares all disk blocks with the original. Copying a blob within or across .omnidata bundles on the same volume is nearly free until one copy is modified.

Per-file encryption. APFS supports per-file encryption keys, which means different .omnidata bundles on the same volume can have different encryption properties when managed by FileVault or third-party tools.

Snapshots. APFS snapshots are created automatically by Time Machine and can be created manually. They provide point-in-time recovery of the entire volume, including all .omnidata bundles.

Space sharing. Multiple APFS volumes share a single container’s free space. Multiple .omnidata instances don’t need pre-allocated partitions.

ext4 (Linux)

ext4 is the baseline. It provides:

Journaling for crash consistency
Universal availability on every Linux distribution
Mature tooling for backup, recovery, and monitoring

ext4 does not provide transparent compression, deduplication, or snapshots. On ext4, OmniData works correctly but does not benefit from filesystem-level optimization. The application-level dedup from content addressing still applies.

S3 and object storage (future)

Object storage maps well to the blob model:

Each blob is an S3 object, keyed by its SHA-256 hash
Versioning provides history without application logic
Lifecycle policies can tier old blobs to Glacier or Deep Archive
Cross-region replication provides geographic redundancy
Server-side encryption (SSE-S3, SSE-KMS) encrypts at rest

The index.db and memory.db files would need a different strategy (they are not object-friendly), but the blob layer maps directly. A future adapter could sync the blobs/ directory to S3 while keeping the databases local.

Why this is impossible with single-file formats

When all content lives inside a single SQLite file:

The OS cannot see the blobs. They are opaque rows inside a B-tree. The filesystem cannot compress, deduplicate, or snapshot individual blobs.
Backup is all-or-nothing. Changing one blob means the entire database file has changed. rsync, Time Machine, and btrfs snapshots must process the full file.
No incremental sync. Transferring a .omnidata file to another machine means sending the whole thing, even if only one resource was added.
No block sharing. Two instances with identical blobs store them independently. The filesystem has no way to know the data is duplicated.
WAL contention. SQLite’s write-ahead log serializes all writes. Ingesting blobs competes with search queries and metadata updates for the same lock.

The directory bundle format makes blobs visible to the filesystem, unlocking every optimization the host provides.

Comparison table

Feature	Single SQLite file	Directory bundle (btrfs)	Directory bundle (ZFS)	Directory bundle (APFS)	Directory bundle (ext4)
Transparent compression	No	zstd, lzo, zlib	lz4, zstd	No	No
Snapshots	No	Instant, CoW	Instant, CoW	Time Machine, manual	No
Block-level dedup	No	duperemove	Native (RAM-heavy)	clonefile (manual)	No
Checksums	No	Per-block	Per-block	No	No
Incremental backup	Full file	Changed files only	zfs send (delta)	Time Machine (changed files)	Changed files only
Incremental sync (rsync)	Full file	Changed blobs only	Changed blobs only	Changed blobs only	Changed blobs only
Per-blob encryption	No	No (volume-level)	Native datasets	Per-file (FileVault)	No (volume-level)
Self-healing	No	RAID1/RAID10	RAIDZ mirrors	No	No

The “No” entries for single SQLite are not limitations of SQLite itself. They are limitations of storing binary content inside any single-file database. The data is invisible to the filesystem, so the filesystem cannot optimize it.

Practical guidance

Development machines (macOS, APFS). You get clonefile dedup, Time Machine snapshots, and FileVault encryption. No configuration needed.

Production servers (Linux, btrfs). Mount the instances directory with compress=zstd. Run duperemove periodically. Use btrfs snapshots before large ingests.

NAS / backup targets (ZFS). Use zfs send | zfs receive for efficient replication. Enable compression on the dataset. Schedule scrubs weekly.

Minimal environments (ext4). Everything works. You lose filesystem-level compression and dedup, but content-addressed storage still provides application-level dedup, and standard backup tools handle the rest.