A new efficient lossless compression algorithm

Posted on August 24, 2010 9:45 AM by Andrew

Frank Wood and Nick Bartlett write:

Deplump works the same as all probabilistic lossless compressors. A datastream is fed one observation at a time into a predictor which emits both the data stream and predictions about what the next observation in the stream should be for every observation. An encoder takes this output and produces a compressed stream which can be piped over a network or to a file. A receiver then takes this stream and decompresses it by doing everything in reverse. In order to ensure that the decoder has the same information available to it that the encoder had when compressing the stream, the decoded datastream is both emitted and directed to another predictor. This second predictor’s job is to produce exactly the same predictions as the initial predictor so that the decoder has the same information at every step of the process as the encoder did.

The difference between probabilistic lossless compressors is in the prediction engine, encoding and decoding being “solved” optimally already.

Deplump compression technology is built on a probabilistic discrete sequence predictor called the sequence memoizer. The sequence memoizer has been demonstrated to be a very good predictor for discrete sequences. The advantage deplump demonstrates in comparison to other general purpose lossless compressors is largely attributable to the better guesses made by the sequence memoizer.

Deplump is algorithmically quite closely related to infinite context prediction by partial matching (PPM) although it has differences and advantages which derive from being grounded on a coherent probabilistic predictive model.

I don’t know anything about this, but it looks interesting to me. They also have a cool website where you can compress your own files. According to Frank, it gives you an compressed file that’s quite a bit smaller than what you get from gzip.