Entropy coding for Spam/Ham detection

Wat

I already built a spam filter based on Bayesian estimation which works reasonably well but is

  1. still comparatively slow
  2. stores like 62MB of data for each classification
  3. is relatively complex
  4. requires a persistent server for reasonable performance

None of those is particularly “fun”, so I decided to do something simpler that still kinda works. Enter the Huffman code.

How

Intro

On a high level, a Huffman code is a way of encoding a certain set of data so that it is “optimally” compressed, for some notion of being optimal. In this case, it’s “the smallest size you can get from a prefix-free code”. That means that no symbol generated by a Huffman code is a prefix of an other symbol generated by the same code, so decoding can be done symbol-by-symbol without expensive state tracking, and it means that the code is fairly agnostic to what kind of data it compresses/encodes: there is no special handling like “Replace runs of consecutive bytes by a count followed by the byte”.

When using a Huffman code, usually there is either one pre-generated code using a “common” distribution of underlying data, or the code is generated for each piece of data that needs to be compressed/encoded and is simply transferred alongside the compressed/encoded data.

(I’ll be using “encoded” from now on to mean “compressed/encoded” because that’s just easier to type and read).

Generating a code

Generating a Huffman code works like this:

  1. Count all “symbols” (individual elements the data is made up of) in the data. Usually this means individual bytes, but in my case it’s a sliding 2-byte window. This results in a mapping of symbols to their frequency (i.e. “how often did I see 0x41 in the data?”).
  2. Once all the frequencies are collected, put all of them into a data structure that allows you to quickly grab the two symbols with the lowest frequency. I chose a heap for that, but a list works as well.
  3. As long as that data structure has more than one entry in it:
    1. Remove the two entries with the lowest frequency
    2. Join them to create a new entry. The frequency of the new entry is the sum of the frequencies of its children.
    3. Add the new entry to the data structure and go back to step 1.

The result is a tree of symbols that has the symbol with the highest frequency (i.e. the lowest entropy) near its root, and symbols with lower frequency (i.e. higher entropy) further down the tree. This is the Huffman code for the data that the frequencies were collected from.

To update an existing code, we can just reconstruct the frequencies from the tree by taking the leaf nodes of the tree as the initial frequency counts, update them with the counts of some additional text, and re-generate the tree using the new frequencies.

Using a code

To encode something with the code, the following strategy is used for each symbol in the input to generate its encoded representation:

  1. Start at the root of the tree.
  2. Go down one level in the tree: choose the child that has the symbol as one of its leaves. Make a note of the direction (left vs. right child) you took.
  3. If the current tree node has children, go back to step 2.
  4. Convert the right-left-right-left pattern generated by steps 2 and 3 into a binary representation by using, for example, 0 to represent “went left in the tree” and 1 to represent “went right in the tree”.

The pattern generated by this algorithm is the encoding of a symbol. Its length is determined by the entropy of the symbol in the original data (which we used to collect frequencies). A high entropy (i.e. low frequency, we didn’t see it that often in the original data) means that the symbol contains a lot of information: seeing it tells us more about the data that we encoded than seeing a symbol with high frequency (i.e. low entropy).

The sum of the encoded lengths of the symbols that make up stream of data tells us how closely the distribution of symbols in the stream matches the distribution of symbols that was used to generate the code.

Using it to classify text

The key for text classification is now to generate multiple codes, representing classes of text such as “ham” and “spam”, and comparing how well the codes compress a piece of text that we want to classify.

That’s then output as the presumed class for the text, since entropy-wise it matches best with the corpus of text that the code was trained on.

An example looks like this:

; cat hello.msg | classify -class /tmp/ham.gob -class /tmp/spam.gob
spam 80408 0.505279
ham  82124 0.494721

This means that the text in hello.msg could be compressed to 80408 bits with the spam.gob classifier and to 82124 with the ham.gob classifier.

Neat tricks I used

As alluded in the section about generating a code, the classifier doesn’t actually consume single bytes as the units of data that make up a piece of text. It uses a sliding window of two bytes to generate a sequence of fragments to count/compress in an attempt of capturing the connection between elements in the text. If this weren’t done, the classifier would see no difference between the texts abcdef and fedcba which are wildly different but have the same frequency counts.

Since I’m only interested in the length of the encoded text, I don’t need to track the individual symbols that the code would map the text fragments to. It’s enough to just store the length that the code produces for each of the 16 bit integers (two consecutive bytes) that the code has seen while being updated.

The stored code thus consists of two arrays of 65536 64-bit integers, one for the encoded length and one for the frequency of each 16 bit pattern (2 bytes) seen while generating/updating the code, and a tiny bit of metadata that tracks how many bytes the code has seen already, the last byte the code has seen during updates so that consecutive updates can be treated as reading from a single piece of text without gaps. The code also stores the highest length of any encoded pattern it has seen so far (+1), which is used by the code when it is asked for the encoded length of a 2-byte tuple it hasn’t yet seen.

This all comes down to 131087 bytes of actual state that needs to be tracked for each code. Encoding this state with Go’s encoding/gob adds a bit of overhead, so the classifier stores roughly 140kb per code. Quite an improvement over my other classifier’s ~62MB, I’d say.

Why

TBD

But

TBD

Code

Here ya go.

TODO

  • Terminology
  • Pictures