Archive for the ‘Amethyst’ Category

Amethyst

Wednesday, February 7th, 2007

Amethyst is my re-working of Amphetadesk, an RSS reader, in Ruby on Rails. The two major changes are aging of scores and sorting items by relevance.

Amphetadesk ranked feeds by the absolute number of clicks thru. If a feeds was really compelling and then the author dropped off the face of the earth, converted to radical fruitinarism, or simply ceased to be as interesting, the feed stayed high ranking for a long time afterwards. So every night, a job runs that multiplies the channel scores by 1/2-7, effectively halving the value of a click with every passing week. Feeds rise and fall as I show interest in them. There is a certain amount of self fulfilling prophecy, I am more likely to see and click thru an item near the top, so the channel stays near the top. However, a little discipline to scroll down part or all way several times a week generally keeps some balance.

A problem is that some feeds have lots of low interest items and rise to the top on sheer number of items. There are other feeds that have only a few posts, but interesting ones. My current solution is to value a click-thru by 1+1/log2(N), where N is the number of items currently in the feed. Log(N) is approximated by the number of significant bits, e.g. 5 has 3 significant bits , 64 has 6 significant bits. A click thru on a item in a feed with 5 items is valued at 1.33, with 64 items, 1.16. This helps but isn’t agressive enough. This is partly because it is based on how many items are in a feed at once, not on how many there are in a given period of time.
Besides the default display of channels sorted by clicks, there is a display of items sorted by interest. How to measure interest and have it track my interests and adapt as those interests change? Sounds like the problem of classifying spam. So I am using Bayesian filtering, specifically the DSPAM libraries. This works somewhat. However, spam/ham is a binary decision. I want a ranking. And false positives are not a killer. There are some parameters the library lets me tweak. First is turn off the doubling of the value of negatives (i.e., this word does not appear in spam). Also I am not interested in how the items are formatted so strip out the HTML, CSS, and Javascript before evaluating an item. Also just score the middle to avoid headers and footers with site info, links, etc. (This is surprisingly hard to automate.)

So far, the interest score are near or at the max, or near or at the minimun. There aren’t many items in the middle. So interesting items score nearly the same and uninteresting items score nearly the same. Not much time savings, I still have to wade through dozens or hundreds of title looking for the interesting things. And I find a significant number of interesting items near the bottom.

My next tweak is to change the main display to sort items within feeds in order of interest. Previously they were unsorted which resulted them being displayed oldest first.