DSPAM does an excellent job of filtering spam out of my e-mail. I’ve been trying for two years to tweak it to do a good job of adaptive ranking of articles in RSS feeds. It hasn’t worked and now I’m trying a home-grown solution.
DSPAM does Bayesian classification (among several algorithms) and is tweaked for spam filtering. Part of the problem is it does classification, a yes/no decision. I need ranking, this is more interesting than that. Basic mismatch. And it has been optimized for e-mail. It recognizes e-mail headers and bodies and treats them differently. Not needed and even detrimental. The result was wild jumps in rankings of articles and occasional strange result, e.g., “Sponsored Link” articles had sunk to the bottom of the heap where I wanted them and stayed there for months. Suddenly they were scattered all through the rankings and while I could downgrade individual items, new “Sponsored Link” articles continue to show up all over.
The new algorithm uses several ideas from DSPAM, bi-grams (word pairs as well as individual words) and the basic database structure (an article has many words/word pairs). Rather than decide ahead of time what makes a good scoring algorithm, the database stores all actions on an article and all word/word-pairs. The actions are:
- clicks thru
- votes up – I am more interested in this article than it’s current ranking
- votes down – I am less interested in this article than it’s current ranking
- hide – stop showing this article (e.g. duplicates)
- expires – article fell off RSS feed without ever being read.
Each work/word-pair also records how many times it has occurred. The current scoring algorithm is:
sum((click + (ups – downs)/2)/occurrences)
# of word/word-pairs
This works fairly well. It doesn’t have the wild jumps on up/down voting an item and articles I truly have no interest in continue to cluster at or near the bottom of the rankings.
After there is several weeks data, I will pick some stories throughout the rankings give them scores and then trying various curve fitting methods to find a better ranking algorithm.