Archival Compression Comparisons

December 22, 2025

I need to archive roughly 1 TB of data as a write-once, read-almost-never redundancy layer for files I care about. S3 Glacier Deep Archive is extremely cheap (~$0.002/GB/month), but at that scale it is still worth compressing the data as aggressively as possible beforehand.

The data consists almost entirely of large CSV files containing high-entropy numeric data. This post documents a few practical benchmarks to determine the most sensible zstd settings for this workload.

Hardware

Dataset 1: Large, Multi-file Corpus

Compressor Command Time Compression Final size
zstd -19 --long=31 -T8 41m 40s 6.45× 2.70 GiB
zstd (ultra) --ultra -22 --long=31 -T8 2h 48m 6.54× 2.67 GiB

The --ultra setting provides only a ~1.3% reduction in size at the cost of more than a 4× increase in runtime. This is a clear case of diminishing returns.

Dataset 2: Single-file Benchmark

zstd level Time (s) Final size Ratio
11.4164 MiB15.2%
21.4170 MiB15.9%
31.5177 MiB16.4%
41.7177 MiB16.5%
52.8161 MiB15.0%
63.3150 MiB13.9%
75.0149 MiB13.9%
85.2135 MiB12.6%
96.9136 MiB12.6%
1012.0135 MiB12.6%
1116.9134 MiB12.4%
1219.8134 MiB12.5%
1317.0132 MiB12.2%
1420.1130 MiB12.1%
1528.4130 MiB12.1%
1630.6110 MiB10.2%
1755.7111 MiB10.3%
1876.1108 MiB10.0%
19122.3107 MiB10.0%

Conclusion

Strangely, levels 8 to 15 offer almost identical compression (12.6 to 12.1%) but at vastly different speeds (a 5x increase). The highest two levels before ultra (18 and 19) do shave off a bit of final file size, and given my requirements (S3 Deep Archive requires files to be kept 6 months at least), it could make sense to use those settings. But at a 10-20x increase in processing time... I'd rather not brutalise my CPU over the next few days to save a dollar a year at best.


← Back to blog