Archive for July, 2008

Amethyst dumps DSPAM

Wednesday, July 30th, 2008

DSPAM does an excellent job of filtering spam out of my e-mail. I’ve been trying for two years to tweak it to do a good job of adaptive ranking of articles in RSS feeds. It hasn’t worked and now I’m trying a home-grown solution.

DSPAM does Bayesian classification (among several algorithms) and is tweaked for spam filtering. Part of the problem is it does classification, a yes/no decision. I need ranking, this is more interesting than that. Basic mismatch. And it has been optimized for e-mail. It recognizes e-mail headers and bodies and treats them differently. Not needed and even detrimental. The result was wild jumps in rankings of articles and occasional strange result, e.g., “Sponsored Link” articles had sunk to the bottom of the heap where I wanted them and stayed there for months. Suddenly they were scattered all through the rankings and while I could downgrade individual items, new “Sponsored Link” articles continue to show up all over.

The new algorithm uses several ideas from DSPAM, bi-grams (word pairs as well as individual words) and the basic database structure (an article has many words/word pairs). Rather than decide ahead of time what makes a good scoring algorithm, the database stores all actions on an article and all word/word-pairs. The actions are:

  • clicks thru
  • votes up – I am more interested in this article than it’s current ranking
  • votes down – I am less interested in this article than it’s current ranking
  • hide – stop showing this article (e.g. duplicates)
  • expires – article fell off RSS feed without ever being read.

Each work/word-pair also records how many times it has occurred. The current scoring algorithm is:

sum((click + (ups – downs)/2)/occurrences)
# of word/word-pairs
This works fairly well. It doesn’t have the wild jumps on up/down voting an item and articles I truly have no interest in continue to cluster at or near the bottom of the rankings.

After there is several weeks data, I will pick some stories throughout the rankings give them scores and then trying various curve fitting methods to find a better ranking algorithm.

Duty Cycle isn’t Panacea

Tuesday, July 22nd, 2008

I wrote Duty Cycle to throttle back CPU intensive programs that boost my laptop’s temperatures beyond what I was comfortable with.  It works fine for kernel builds and most other things I tried it on.

Lately I’ve been making some changes to Amethyst, a Ruby on Rails app, that require significant changes to the database — changing the primary key of some tables, merging all uppercase/lowercase versions of a word into a single record, etc.  Guess what, some of the table have 1/3 million records and the changes take time and CPU power, i.e., the laptop heats up.  So I killed the conversion program, and restarted it with Duty Cycle’s default 50% duty cycle.  The CPU usage drops from 99% to 98%!  Huh!  Oh, the conversion program is being throttled by Duty Cycle, but it’s mostly making calls to the MySQL database server which is doing most of the CPU intensive work.  Cutting the duty cycle back to 5% still loads the system significantly, but the temperature stays under 70°C.