Archive for the ‘Amethyst’ Category

Fragment Caching Not Worth It

Tuesday, January 20th, 2009

After adding fragment caching to Amethyst, I found it did a couple of things I didn’t like. They would have been obvious had I thought more about it.  I like the “10 minutes ago” style of time stamps.  Makes it easy and fast to scan for recent or old articles.  Well, the 10 minutes ago gets cached and quickly becomes stale.  I can move it out of the cached fragment, but it looks out of place.

Article fragment caching only sped up display by a factor of 2 (i.e. 54 seconds for the full display dropped to under 30 seconds).  Paging looks really unnatural when pages are different sizes.  However, “endless paging” (appending pages as you scroll down to the end) works moderately well.  And works well with “10 minutes ago” style timestamps.

So I’ve dropped fragment caching.

Amethyst and Fragment Caching

Monday, December 29th, 2008

Amethyst quickly grows to be a CPU hog, even for a single user. I’m now working my way through speed ups. Database access has been tweaked a lot, yielding better than 10x speed ups in certain areas. (Bypassing ActiveRecord and JOIN tables for direct SQL and keeping the necessary data in the article to reconstruct the join on the fly. It works, I measured.) This was successful enough that rendering is now the bottleneck.

My first at fragment caching was by feed (each with up to 25 unread articles visible). The speed up was significant, a refresh was 3-4 times faster than the initial view. However, each channel is updated once an hour (1/12 of channels every 5 minutes). So the fragment cache quickly grows stale and is totally out of date in an hour.

The next stab was caching article fragments. Articles can persist for months and only need to be rendered once until some action is taken (click-thru, vote up/down, or hide, i.e. mark as seen but not read). The article fragment cache grows stale much slower. However, there are 12 times as many articles as channels. The speed up is less impressive, 2 times faster refreshes.

All in all, I’m sticking with the article fragment cache. (Note: all work has been done with the memory storage mechanism, essentially a hash.) I’ll be posting the details of how to cajole the 3 fragment caching mechanisms that don’t explicitly implement timestamp expiry to do it anyway (mem_cache already does). I tried posting the whole mess earlier and WordPress was barfing on permission problems. Hopefully it can handle smaller bites

Bug in FeedTools

Wednesday, September 10th, 2008

I’ve been chasing a bug in Amethyst for over a week. Sometimes accented characters cause problems, sometimes they don’t. MySQL will often complain about a duplicate key in a multi-row INSERT. The INSERT is correct, but the duplicate key doesn’t appear in any of INSERTs! With enough examples I figured out that they are all the prefix of the key from one of the INSERTS, up to the first accented character. But not all accented characters cause problems, though all accented characters on one feed do cause problems. I’ve been tracing the incoming data starting with the data on the network and working my way through the system.  In  chatting with Greg Foster of  the Consumers Union at the Lone Star Ruby Conference, the topic came up and he mentioned that in spite of various feeds proclaiming that they are UTF-8, sometimes they contain Latin-1 characters. He said he had a conversion routine in Ruby he’d send me. Sure enough, I looked closer and there they were! Depending on what I used to look at them, they might render as expected, or as a ‘\361′ sequence.

Later in the conference I became impatient to fix the problem and Googled “Latin1 conversion UTF8 Ruby”. At the top of the list was
How-to fix ruby’s FeedTools latin-1 parsing. There is a bug in FeedTools, it converts numeric HTML entities under 256 to Latin1 characters instead of UTF-8 characters.  The blog entry includes some code to monkey patch FeedTools to correct the problem. I dropped the code in, deleted the corrupt data, refreshed the feed, and voila, problem fixed.

I’m Greedy, Getting Ready wasn’t enough

Wednesday, August 20th, 2008

In the previous post I discussed how on the way to a planned re-implementation in C of some database access, achieved a nearly order of magnitude speed up by bypassing the Ruby on Rails ActiveRecord database access class and generated the SQL myself. Faster but still some performance problems. Benchmarking, profiling, and online research turned up hints that relational database joins are powerful, but slow. So I spent some time thinking about how to speed it up. There are several approaches, some I found in DSPAM, some I found online, and some that came to me in the middle of the night.

The associations are created (and destroyed) all at once. Rather than store them as many records in the database, how about one record, or even a field in the article record/row itself. DSPAM does this, and it is fast enough. Doing it in pure Ruby/Rails would involve marshaling an array or hash of string and integer pairs. Since Ruby allows the elements of an array (or hash) to not all be the same type, type information is required for each value. This could be slow. In this case all elements are the same, and the information is already in the item record, the article description itself. Can Rails/Ruby regenerate the information on the fly faster than it can read it from disk? The answer is yes. If I think outside the database and the usual way of doing things.

Associations/join table typically contain the record ID (the primary key by convention/default in Rails) of both sides of the association. But what if instead of using the ID as the lookup key, the word/token string itself is the lookup key. In MySQL, my benchmarking found that string lookup is approximately 10% slower than integer key lookup (both with indexes). With string lookup, the join table can be discarded completely.

When optimizing it must be kept in mind what is the expensive/scarce resource(s). For many projects, it is the programmer’s time (i.e, programmer salary and overhead, or time to market is the scarce resource). However, user’s time/patience must be kept in mind too. Moving away from DSPAM (in C) to a pure Rails and ActiveRecord solution put the latter on the critical path.

I have made the simplifying assumption that as few database calls as possible is a plausible way to go, keeping in mind that large joins are also expensive. With the string indexed lookup I eliminated the join table. By accumulating all changes and rolling all inserts into one SQL INSERT statement, all updates into one SQL UPDATE statement, I gained almost as much speedup as the first optimization. The combination is around 16 to 50 times, depending whether you measure elapsed time or CPU (times in seconds):

user system total real
AR 7.280000 0.230000 7.510000 ( 11.215389)
SQL 0.800000 0.140000 0.940000 ( 3.713782)
SQL2 0.150000 0.000000 0.150000 ( 0.686940)

Better than I expected. Maybe it’s time to stop bit-twiddling and get back to adding features.

When Optimization is No Longer Premature

Wednesday, August 13th, 2008

Pre-mature optimization is the root of all programming evil.

– Donald E. Knuth

The latest change to Amethyst, dropping DSPAM and using a simple frequency of use scoring, is working well. Except it is slow. I’ve tweaked the database access from Ruby on Rails about as much as is feasible (mostly combining multiple updates into a single SQL statement, e.g. UPDATE tokens SET occurrences = occurrences + 1 WHERE id IN (1,2, 3, 100, 1001, 1234)). By dropping down from ActiveRecord instances to SQL and pre-computing in the background where straightforward I’ve speeded up the Web/user interface to where the response is acceptable and the delays are where expected (i.e., searches but not click-throughs, up and down votes, etc.).

However the fetches of the RSS feeds and updates of the database take a long time. When my laptop has been off or off-line for several hours getting caught up is a problem. With DSPAM (written in C), it was possible to update in one big gulp. I’ve spread the update over four cron job invocations and it is still a problem. With all the code in Ruby, updates can take several minutes, frequently more than the five minute interval between cron jobs that do the updates. The result is multiple entries for the same blog story. I’ve add the use of a lockfile to prevent multiple instances of the cron job, but there are still occasional conflicts and unexplained errors.  And I like to not wait 20 minutes for the updates to complete.

Since I know from experience with DSPAM that dropping into C will speed it up enough and the current slowness is a problem, it seems to me that optimization is no longer pre-mature. And the place to start seems to be the DSPAM database access code without the scoring code.

Updates as details emerge. The DSPAM code is a bit of a slog, it has more options than you can shake a stick at.

lockfile

Thursday, August 7th, 2008

The change from DSPAM to a home-grown solution for ranking the RSS feed articles in Amethyst has generally been a change for the better. In spite of several bugs that skewed the statistics, it generally behaves as desired. Except it is much more CPU intensive. DSPAM is written in C and eats a fair amout of CPU time. Amethyst is entirely in Ruby and Ruby on Rails. When the laptop has been off for several hours, it really eats up the CPU cycles when up and reconnected to the Internet. So much so that I’ve had to spread the catchup over 20 minutes instead of letting it do it all in one go. The reason is that the RSS feed refreshes run as a cronjob and don’t always complete in the 5 minutes between invocations. So there are multiple copies trying to update the database at once, leading to duplicates. Not good.

So I did a little searching around. There is a Ruby Gem, lockfile, for this and similar problems. It creates a a file, the lockfile, and runs a program if the lockfile doesn’t already exist. It deletes the lockfile when the program completes. If the lockfile already exists, it can retry for an period of time or a number of retries. It is NFS filesystem safe and has rudimentary stale lockfile detection (based on the age of the lockfile).

So far it has blocked just one invocation of the RSS feed refresh cronjob and I’m not seeing the duplicate key errors I had been. So far, so good.

Amethyst dumps DSPAM

Wednesday, July 30th, 2008

DSPAM does an excellent job of filtering spam out of my e-mail. I’ve been trying for two years to tweak it to do a good job of adaptive ranking of articles in RSS feeds. It hasn’t worked and now I’m trying a home-grown solution.

DSPAM does Bayesian classification (among several algorithms) and is tweaked for spam filtering. Part of the problem is it does classification, a yes/no decision. I need ranking, this is more interesting than that. Basic mismatch. And it has been optimized for e-mail. It recognizes e-mail headers and bodies and treats them differently. Not needed and even detrimental. The result was wild jumps in rankings of articles and occasional strange result, e.g., “Sponsored Link” articles had sunk to the bottom of the heap where I wanted them and stayed there for months. Suddenly they were scattered all through the rankings and while I could downgrade individual items, new “Sponsored Link” articles continue to show up all over.

The new algorithm uses several ideas from DSPAM, bi-grams (word pairs as well as individual words) and the basic database structure (an article has many words/word pairs). Rather than decide ahead of time what makes a good scoring algorithm, the database stores all actions on an article and all word/word-pairs. The actions are:

  • clicks thru
  • votes up – I am more interested in this article than it’s current ranking
  • votes down – I am less interested in this article than it’s current ranking
  • hide – stop showing this article (e.g. duplicates)
  • expires – article fell off RSS feed without ever being read.

Each work/word-pair also records how many times it has occurred. The current scoring algorithm is:

sum((click + (ups – downs)/2)/occurrences)
# of word/word-pairs
This works fairly well. It doesn’t have the wild jumps on up/down voting an item and articles I truly have no interest in continue to cluster at or near the bottom of the rankings.

After there is several weeks data, I will pick some stories throughout the rankings give them scores and then trying various curve fitting methods to find a better ranking algorithm.

Duty Cycle isn’t Panacea

Tuesday, July 22nd, 2008

I wrote Duty Cycle to throttle back CPU intensive programs that boost my laptop’s temperatures beyond what I was comfortable with.  It works fine for kernel builds and most other things I tried it on.

Lately I’ve been making some changes to Amethyst, a Ruby on Rails app, that require significant changes to the database — changing the primary key of some tables, merging all uppercase/lowercase versions of a word into a single record, etc.  Guess what, some of the table have 1/3 million records and the changes take time and CPU power, i.e., the laptop heats up.  So I killed the conversion program, and restarted it with Duty Cycle’s default 50% duty cycle.  The CPU usage drops from 99% to 98%!  Huh!  Oh, the conversion program is being throttled by Duty Cycle, but it’s mostly making calls to the MySQL database server which is doing most of the CPU intensive work.  Cutting the duty cycle back to 5% still loads the system significantly, but the temperature stays under 70°C.

Mea Cupla, Bad Engineer – No Donut!

Monday, July 2nd, 2007

I ought to know better by now. Measure, then cut. I have been working on Amethyst, an adaptive RSS reader using Bayesian classification to (try to) train a program to show me news in most interesting to least interesting order. I have tweaked and fiddled, etc. Last night I coerced DSPAM into coughing up the most/least interesting phrases that it is using to classify. For RSS feeds with a lot of low ranking stories, the significant words and phrases are bits of HTML tags, font names, etc. I.e., it is the formatting that is being graded, not the content. Extracting the content from the HTML is a non-trivial problem. I think just a list of words to ignore may be an easier solution.

If I had looked at the actual data, I would have saved myself a lot of wasted effort.

Amethyst and DSPAM

Monday, February 19th, 2007

As mentioned in other posts, I’m finding that e-mail spam filtering and the Amethyst adaptive RSS sorting have differing configuration needs. I can work around some with profiles, but not all parameters can be in a profile. DSPAM is really aimed at e-mail and not all design decisions appropriate for my RSS reader.

I’ve decided to fork DSPAM, one stock instance for e-mail and a hacked version for Amethyst. I could use another library or even write my own, but there is just too much code that works fine (e.g. tokenization). Much of the necessary changes can accomodated with another configuration. The configuration file name appears to be hardcoded at the library configure/build time, so just making a library instance with a different configure file takes care of a lots of the problem. It should be fairly easy to cut out the e-mail specific parts like the special treatment of e-mail headers. Another possibility is to return scores from several different scoring algorithms with one call. This makes evaluating scoring algorithms simpler, just change some Ruby on Rails code which is loaded on every invocation instead of changing the DSPAM code and rebuilding. It makes some sense to allow sorting by scoring algorithm, so no code changes are needed. If one algorithm turns out to have much better results, dump the others.

A reason for trying other scoring algorithms is the realization that spam filtering is basically a binary decision, is it or isn’t it spam. I am looking for something more granular, how interesting is this item likely to be on a scale of 1 to 100 or something equivalent.