Archive for February, 2007

TiddlyWikis and GTD

Saturday, February 24th, 2007

David Allen wrote a very useful book called, “Getting Things Done” (GTD). I am in the midst of implementing his system.

Jeremy Ruston wrote TiddlyWiki, a Wiki in a single file of HTML, CSS, and Javascript. No server needed. You just point the browser at the file and it saves entries in the file itself. This is so brilliant that I don’t even ask myself, “Why didn’t I think of that?”

Several bright folk have taken the two concepts and run with it. A single file Wiki to support most of the bookkeeping features of GTD in a Wiki, action lists, projects, contexts, calendars, etc.

  • TiddlyWiki – the original. Great for brainstorming, assembling ideas and research on a topic. Or whatever else you use a Wiki for. Currently at version 2.1.3.
  • GTD TiddlyWiki – an early fork of TiddlyWiki for GTD.
  • GTD TiddlyWiki Plus – an adaption of TiddlyWiki that claims to track it so you can benefit from updates.
  • Monkey GTD – biggest feature/difference is the dashboard, a synopsis of the state of your projects, lists, etc. See also http://wiki.43folders.com/index.php/Monkey_GTD.
  • d3 – a “kinkless” GTD system. A little more graceful and attractive variant. Like GTD TiddlyWiki Plus, an adaption rather than a fork of TiddlyWiki so you can update easily. Or perhaps more accurately described as a packaging of TiddlyWiki 2.1.3 and version 1.1.0 of the GTD plugins by Tom Otvos. Which means you can install other TiddlyWiki plugins. My current favorite.

See also GTD Wannabe Reference Pages for some discussion/comparison of the last two.  TiddlyWikis fit on a USB flash drive so you can go really lightweight.

You and Your Research – Richard Hamming

Monday, February 19th, 2007

Richard Hamming gave a fascinating talk at Bell Labs many years ago on his observations about what lead scientists to do great work.  There are many great ideas here, several of which I have also stumbled upon too.  Probably the best is seeing defects as opportunities for breakthroughs instead of just accepting them as problems to be worked around.

Amethyst and DSPAM

Monday, February 19th, 2007

As mentioned in other posts, I’m finding that e-mail spam filtering and the Amethyst adaptive RSS sorting have differing configuration needs. I can work around some with profiles, but not all parameters can be in a profile. DSPAM is really aimed at e-mail and not all design decisions appropriate for my RSS reader.

I’ve decided to fork DSPAM, one stock instance for e-mail and a hacked version for Amethyst. I could use another library or even write my own, but there is just too much code that works fine (e.g. tokenization). Much of the necessary changes can accomodated with another configuration. The configuration file name appears to be hardcoded at the library configure/build time, so just making a library instance with a different configure file takes care of a lots of the problem. It should be fairly easy to cut out the e-mail specific parts like the special treatment of e-mail headers. Another possibility is to return scores from several different scoring algorithms with one call. This makes evaluating scoring algorithms simpler, just change some Ruby on Rails code which is loaded on every invocation instead of changing the DSPAM code and rebuilding. It makes some sense to allow sorting by scoring algorithm, so no code changes are needed. If one algorithm turns out to have much better results, dump the others.

A reason for trying other scoring algorithms is the realization that spam filtering is basically a binary decision, is it or isn’t it spam. I am looking for something more granular, how interesting is this item likely to be on a scale of 1 to 100 or something equivalent.

DSPAM database reset

Monday, February 19th, 2007

Looking at the results of the e-mail filtering so far and thinking about the consequences of training not working as expected, I’ve decided to dump the database contents and start over. Since little or no training has happened, the contents are essentially noise.

This also allows me to use a better database schema. DSPAM uses the CRC64 hash to identify a token instead of the token contents. Earlier versions of MySQL don’t have a 64 bit integer type, so a string with the CRC64 value is used. A real 64 bit integer type uses less space and probably compares faster. The DSPAM database gets fairly big (more than 1 million rows, even if purging after 45 days), so on a laptop with only 30GB disk, this is helpful.

DSPAM Problem Solved

Saturday, February 17th, 2007

The investment e-mail newsletter came out again and again it was classified as spam. I dug into it further. DSPAM does have some tools to debug it. I picked the domain name it comes from and looked at its stats, 5 spam, 1 innocent. I attempted to retrain DSPAM that it was ham (non-spam). No change in the stats.

I looked at the man pages, the Web site, and the Wiki site to no avail. Nothing jumped out at me. I did find some holes in the documentation. I am using DSPAM for both e-mail and Amethyst, my adaptive RSS reader. There are some conflicts in configuration, some of which I can work around with profiles. Several of the man pages don’t mention the “–profile=” option, though they will accept and obey it.

I kept digging and finally noticed that the example in the README.txt file uses a different order of the options. I tried it and voila, the stats changed. This is the line that works:

dspam --profile=Email --user 'jeff' --class=innocent --source=error

Nothing in the documentation mentions that option order is significant. The wrong order silently fails. This looks like a bug to me, so I’ll submit it to the DSPAM author.

DSPAM Update II

Tuesday, February 13th, 2007

It has been a month since I installed DSPAM. At this point very little spam is getting through to my inboxes. However, too much innocent mail is ending up in the slop bucket with the real spam. It is interesting to look in the headers where DSPAM dumps the guilty words. Sending from Windows is spammy (e.g., “charset=windows” is over 99% spammy). Investment is spammy (there is a rash of pump and dump stock scams currently).

There are some troubling items. The investment newsletter author’s name is 99% spammy as is the Website where it is posted. Either the ham untraining is not working or he has some unsavory business partners.

DSPAM does have an auto whitelist feature, but it is not clear if a sender can be whitelisted after being tagged as a spammer, even if the hit is later reversed.

Amethyst

Wednesday, February 7th, 2007

Amethyst is my re-working of Amphetadesk, an RSS reader, in Ruby on Rails. The two major changes are aging of scores and sorting items by relevance.

Amphetadesk ranked feeds by the absolute number of clicks thru. If a feeds was really compelling and then the author dropped off the face of the earth, converted to radical fruitinarism, or simply ceased to be as interesting, the feed stayed high ranking for a long time afterwards. So every night, a job runs that multiplies the channel scores by 1/2-7, effectively halving the value of a click with every passing week. Feeds rise and fall as I show interest in them. There is a certain amount of self fulfilling prophecy, I am more likely to see and click thru an item near the top, so the channel stays near the top. However, a little discipline to scroll down part or all way several times a week generally keeps some balance.

A problem is that some feeds have lots of low interest items and rise to the top on sheer number of items. There are other feeds that have only a few posts, but interesting ones. My current solution is to value a click-thru by 1+1/log2(N), where N is the number of items currently in the feed. Log(N) is approximated by the number of significant bits, e.g. 5 has 3 significant bits , 64 has 6 significant bits. A click thru on a item in a feed with 5 items is valued at 1.33, with 64 items, 1.16. This helps but isn’t agressive enough. This is partly because it is based on how many items are in a feed at once, not on how many there are in a given period of time.
Besides the default display of channels sorted by clicks, there is a display of items sorted by interest. How to measure interest and have it track my interests and adapt as those interests change? Sounds like the problem of classifying spam. So I am using Bayesian filtering, specifically the DSPAM libraries. This works somewhat. However, spam/ham is a binary decision. I want a ranking. And false positives are not a killer. There are some parameters the library lets me tweak. First is turn off the doubling of the value of negatives (i.e., this word does not appear in spam). Also I am not interested in how the items are formatted so strip out the HTML, CSS, and Javascript before evaluating an item. Also just score the middle to avoid headers and footers with site info, links, etc. (This is surprisingly hard to automate.)

So far, the interest score are near or at the max, or near or at the minimun. There aren’t many items in the middle. So interesting items score nearly the same and uninteresting items score nearly the same. Not much time savings, I still have to wade through dozens or hundreds of title looking for the interesting things. And I find a significant number of interesting items near the bottom.

My next tweak is to change the main display to sort items within feeds in order of interest. Previously they were unsorted which resulted them being displayed oldest first.

USB and Nomadic Computing

Monday, February 5th, 2007

When I lived in a house, the computer stayed connected to printers, UPSes, PDA cradles permanently. Now, the laptop comes and goes. If after resuming from suspend to disk I decide I need to print or backup the PDA, it is nice to be able to hotplug the device into the laptop. There are other benefits to USB, but hotplugging is very nice for computing on the go.