The Practical Guide to Data Compression

Shrink It, Store It, Restore It — The Practical Guide to Data Compression 📦✨

Compression is the unsung hero of every storage system, CDN, backup, and media app. It makes huge datasets fit into small places, saves money, speeds up transfers, and can even make things more private (less data surface). This blog explains everything you need — concepts, terminology, algorithms, a step-by-step engineering pipeline for compressing & restoring data reliably, and checkpoints before you build your own compression pipeline. Let’s dive in! 🚀

data-compression-process

TL;DR

Two main kinds: lossless (exact recovery) and lossy (approximate recovery, smaller size).
Common lossless algorithms: Run-Length, Huffman, LZ77/LZ78, LZW, DEFLATE, Brotli, Zstandard.
Common lossy transforms: DCT (JPEG), quantization, MP3/AAC for audio, video codecs (H.264/AVC, H.265/HEVC).
Build pipeline: profile → preprocess → transform → encode → package → test → backup.
Always keep checks: data integrity (checksums), retention, metadata, versioning, and monitoring.

1) Core concepts & terminology 🧠

Entropy — theoretical lower bound on average bits needed per symbol (Shannon). Lower entropy → more compressible.
Redundancy — repeated or predictable information that compression removes.
Lossless compression — original data can be perfectly reconstructed (e.g., gzip, zstd). Use for code, text, databases.
Lossy compression — some information discarded to save space for much higher compression (e.g., JPEG, MP3). Use for images/audio/video where exactness isn’t required.
Compression ratio — original_size / compressed_size (or sometimes compressed_size / original_size). Higher ratio usually better.
Bitrate — bits per second for streaming multimedia or bits per sample; trade-off between quality and size.
Dictionary / codebook — mapping of sequences to shorter tokens (LZ-family, Huffman tables).
Sliding window — technique for finding repeated sequences within a fixed-size recent buffer (LZ77).
Entropy coder — converts symbol probabilities into bitstreams (Huffman, arithmetic coding, ANS).
Transform — change domain for better compression (DCT, BWT, delta encoding).
Quantization — lossy step that reduces precision (used in JPEG, MP3).
Prefix code — no code is a prefix of another (Huffman codes).
Canonical Huffman — a standardized way to represent Huffman codes compactly.
Checksum / CRC / hash — for detecting corruption (MD5/SHA2/Crc32).
Chunking / block — dividing data into units to compress independently or in parallel.
Streaming vs block compression — streaming compressors operate on the fly (zlib), block compressors operate per-chunk (parquet + compression).

2) Fundamental principles of compression 🔍

Exploit redundancy: repeated bytes, predictable patterns, similar samples.
Model the data: better statistical model = better entropy coding. (e.g., context modeling).
Transform to highlight redundancy: e.g., delta encoding for time-series, DCT for images.
Separate stages: transform → quantize (if lossy) → entropy code. Modular design helps optimize each stage.
Balance compute vs space: high compression might be CPU-heavy. Choose algorithms according to constraints.
Design for error detection & recovery: corrupted compressed data can be fatal — include checksums and chunk boundaries.
Test on representative data: compression works differently across data types.

3) Key algorithms (with short, clear explanations) ⚙️

Lossless

Run-Length Encoding (RLE): compress repeated symbols as (symbol,count) pairs. Simple; great for long repetitions (e.g., monochrome images).
Huffman Coding: builds a variable-length prefix code using symbol frequencies. Good general-purpose entropy coding.
Arithmetic Coding: represents message as a fraction in [0,1) using symbol probabilities — often achieves better compression than Huffman for skewed distributions.
LZ77 / LZ78: dictionary-based approaches. LZ77 uses a sliding window and back-references; LZ78 builds a dictionary dynamically.
LZW: builds a dictionary of seen sequences; used historically in GIF and compress utilities.
DEFLATE: combines LZ77 + Huffman. Used by gzip and PNG.
Burrows-Wheeler Transform (BWT): reorders data so similar bytes cluster, then apply MTF (move-to-front) + RLE + entropy coding. Used in bzip2.
Zstandard (zstd) and Brotli: modern compressors combining dictionary techniques + entropy coding; tunable for speed vs ratio.

Lossy (brief)

DCT (Discrete Cosine Transform) + Quantization + entropy coding = JPEG.
Psychoacoustic models (MP3, AAC): remove audio components humans are unlikely to hear.
Video codecs: use spatial transforms (DCT), temporal prediction (motion vectors), and entropy coding (CABAC/CAVLC).

4) Step-by-step practical pipeline — compress & restore reliably 🛠️

Below is an engineering-level pipeline you can adapt.

A. Plan & profile (before writing code)

Data profiling: sample files; measure entropy, types (text, binary, media), sizes, and repeating patterns.
Set requirements: lossless vs lossy, latency, throughput, storage budget, CPU budget, random-access needs.
Regulatory & privacy checks: encryption requirements, legal retention.

B. Preprocessing (improves compression)

Filtering: remove irrelevant metadata (or separate it).
Delta encoding: store differences between successive samples (great for numeric time-series).
Byte shuffling / bit shuffling: reorder bytes to make patterns contiguous (useful for columnar numerical data).
Tokenization / normalization: canonicalize text (lowercase, normalize unicode) if acceptable.

C. Transform (optional)

BWT for text-like data, DCT for images/videos, or PCA/domain-specific transforms.
For images/audio/video, apply perceptual transforms and downsampling where acceptable.

D. (If lossy) Quantization & Thresholding

Reduce precision using quantization tables (e.g., JPEG) and remove imperceptible components.

E. Entropy coding / compression

Choose appropriate codec:
- Text/logs → zstd, Brotli (good compression + speed).
- Binary blobs → zstd or lz4 (if speed needed).
- Media → JPEG/PNG/MP3/AAC/H.264 as appropriate.
Tune parameters: compression level, window size, dictionary.

F. Containerization & metadata

Store metadata: original checksum, original size, compression algorithm + parameters, timestamp, version.
Use container formats (tar + gzip, ZIP, custom JSON metadata) or object storage metadata fields.

G. Integrity & backup

Compute and store checksums (e.g., SHA-256) for original and compressed files.
Keep at least one redundant backup before deleting originals.
For large datasets, use versioned object storage (S3 versioning) or snapshot-based backups.

H. Decompression & verification

Decompress using recorded parameters.
Verify checksums and file sizes.
If lossy, compare quality metric (PSNR, SSIM for images) to ensure acceptable loss.

I. Monitoring & maintenance

Track compression ratio, CPU/time per compression, error rates, and data corruption incidents.
Periodically re-evaluate parameters; new algorithms may give better results.

5) Example: simple compress/decompress workflows 🧩

Command-line (UNIX)

Lossless (gzip):

# compress (keeps original -k)
gzip -k -9 data.csv        # -9 max compression
# decompress
gunzip data.csv.gz

Zstandard (fast + good ratio):

# compress
zstd -19 data.bin -o data.bin.zst   # -19 high compression
# decompress
zstd -d data.bin.zst -o data.bin

Tar + compression (archive many files):

# create tar + gz
tar -czf archive.tar.gz folder/
# extract
tar -xzf archive.tar.gz

Python (lossless using zlib)

import zlib, hashlib, json

with open('data.bin','rb') as f:
    raw = f.read()
orig_hash = hashlib.sha256(raw).hexdigest()
comp = zlib.compress(raw, level=9)
with open('data.bin.z', 'wb') as g:
    g.write(comp)
# metadata
meta = {'alg':'zlib','level':9,'orig_sha256':orig_hash, 'orig_size':len(raw)}
with open('data.bin.z.meta.json','w') as m:
    json.dump(meta, m)
# decompress
decomp = zlib.decompress(open('data.bin.z','rb').read())
assert hashlib.sha256(decomp).hexdigest() == meta['orig_sha256']

6) Checkpoints before you build a production compression pipeline ✅

Understand your data: types, distributions, typical sizes. Compression behaves differently for images, logs, numeric columnar data, and media.
Decide lossless vs lossy: legal/regulatory and business needs matter (medical images -> usually lossless).
Performance constraints: CPU cost, memory use, latency, throughput, parallelization. Real-time systems need low-latency encoders.
Random access needs: Do you need to read partial files? Choose block-level compression (e.g., compress by chunk) rather than single big stream.
Error tolerance & recovery strategy: plan for partial corruption — use chunk-level checksums and independent blocks so partial recovery is possible.
Backup & retention policy: how many copies, frequency, and whether compressed files are included in backups.
Compatibility & interoperability: standard formats (gzip, zip, zstd) vs custom. Where will files be read? Ensure libraries exist for all consumers.
Security & privacy: encryption before/after compression? Note: compressing encrypted data yields poor results; often compress first, then encrypt.
Monitoring & metrics: define KPIs (compression ratio, throughput, CPU per MB, error rate).
Testing on real workloads: stress with representative data and measure both quality and performance.
Scalability: plan for parallelization, distributed chunking, and re-compression strategies as data grows.
Versioning & metadata: store algorithm/version/parameters with every file for future decompression compatibility.

7) Common pitfalls & best practices ⚠️➡️✅

Don’t compress already compressed data (e.g., JPEG, MP3, video containers) — often increases size or gives no benefit.
Compress before encryption: encrypting first hides redundancy and prevents compression.
Store algorithm metadata — otherwise future you won’t know how to decompress.
Chunk large files — enables parallel compression/decompression and reduces damage risk.
Tune for the right axis: maximum compression often costs time; choose levels appropriate for your SLA.
Monitor long-term — compression tradeoffs change as data and access patterns evolve.
Test integrity regularly — bit rot and storage corruption happen; checksums + scrubbing help catch them.

8) Metrics to evaluate (what to measure) 📊

Compression ratio (orig_size / compressed_size) or percent saved.
Throughput (MB/s) for both compress and decompress.
CPU and memory usage.
Latency (important for streaming).
Error/corruption rate.
Quality metrics for lossy: PSNR, SSIM, MOS, or domain-specific perceptual scores.

9) Real-world patterns / recipes 🧾

Logs / text archives: Use zstd with medium levels for fast compress + good ratio. Consider pre-tokenization (remove timestamps, normalize common phrases).
Columnar numeric data (Parquet): use delta encoding, bit-shuffle, then apply snappy/zstd.
Images: prefer appropriate formats — photographic images → JPEG/HEIF/AVIF (lossy), line art → PNG (lossless), or WebP/AVIF for better ratios.
Time-series: use delta + variable-byte or Gorilla compression (used in TSDBs), then pack into blocks for random access.
Backups: chunk + dedupe + compress (store chunk metadata + checksums). Use content-addressed storage (CAS) for dedup.

10) Final checklist before deployment ✅🧾

Closing notes — be pragmatic 🎯

Compression is both science and art. Start simple: pick a robust modern compressor (zstd or Brotli for general-purpose), validate with real samples, and gradually add transforms (delta, shuffle, BWT) when you need extra gains. Always preserve metadata and checksums — your future self will thank you when you need to restore data after months or years.

Lakhveer Singh Rajput