Archival Compression Comparisons
December 22, 2025
I need to archive roughly 1 TB of data as a write-once, read-almost-never redundancy layer for files I care about. S3 Glacier Deep Archive is extremely cheap (~$0.002/GB/month), but at that scale it is still worth compressing the data as aggressively as possible beforehand.
The data consists almost entirely of large CSV files containing high-entropy
numeric data. This post documents a few practical benchmarks to determine
the most sensible zstd settings for this workload.
Hardware
- CPU: Intel Core i7-10700K
- RAM: 31.26 GiB
- Disk: Crucial P3 Plus 2 TB NVMe
Dataset 1: Large, Multi-file Corpus
- 16 GiB total
- 318 CSV files
- High-entropy numeric trade data
| Compressor | Command | Time | Compression | Final size |
|---|---|---|---|---|
| zstd | -19 --long=31 -T8 |
41m 40s | 6.45× | 2.70 GiB |
| zstd (ultra) | --ultra -22 --long=31 -T8 |
2h 48m | 6.54× | 2.67 GiB |
The --ultra setting provides only a ~1.3% reduction in size at the
cost of more than a 4× increase in runtime. This is a clear case of diminishing
returns.
Dataset 2: Single-file Benchmark
- 1 GiB total
- Single CSV file of the same type
- Tested with
-T0and--long=31
| zstd level | Time (s) | Final size | Ratio |
|---|---|---|---|
| 1 | 1.4 | 164 MiB | 15.2% |
| 2 | 1.4 | 170 MiB | 15.9% |
| 3 | 1.5 | 177 MiB | 16.4% |
| 4 | 1.7 | 177 MiB | 16.5% |
| 5 | 2.8 | 161 MiB | 15.0% |
| 6 | 3.3 | 150 MiB | 13.9% |
| 7 | 5.0 | 149 MiB | 13.9% |
| 8 | 5.2 | 135 MiB | 12.6% |
| 9 | 6.9 | 136 MiB | 12.6% |
| 10 | 12.0 | 135 MiB | 12.6% |
| 11 | 16.9 | 134 MiB | 12.4% |
| 12 | 19.8 | 134 MiB | 12.5% |
| 13 | 17.0 | 132 MiB | 12.2% |
| 14 | 20.1 | 130 MiB | 12.1% |
| 15 | 28.4 | 130 MiB | 12.1% |
| 16 | 30.6 | 110 MiB | 10.2% |
| 17 | 55.7 | 111 MiB | 10.3% |
| 18 | 76.1 | 108 MiB | 10.0% |
| 19 | 122.3 | 107 MiB | 10.0% |
Conclusion
Strangely, levels 8 to 15 offer almost identical compression (12.6 to 12.1%) but at vastly different speeds (a 5x increase). The highest two levels before ultra (18 and 19) do shave off a bit of final file size, and given my requirements (S3 Deep Archive requires files to be kept 6 months at least), it could make sense to use those settings. But at a 10-20x increase in processing time... I'd rather not brutalise my CPU over the next few days to save a dollar a year at best.