Amethyst and Fragment Caching

December 29th, 2008

Amethyst quickly grows to be a CPU hog, even for a single user. I’m now working my way through speed ups. Database access has been tweaked a lot, yielding better than 10x speed ups in certain areas. (Bypassing ActiveRecord and JOIN tables for direct SQL and keeping the necessary data in the article to reconstruct the join on the fly. It works, I measured.) This was successful enough that rendering is now the bottleneck.

My first at fragment caching was by feed (each with up to 25 unread articles visible). The speed up was significant, a refresh was 3-4 times faster than the initial view. However, each channel is updated once an hour (1/12 of channels every 5 minutes). So the fragment cache quickly grows stale and is totally out of date in an hour.

The next stab was caching article fragments. Articles can persist for months and only need to be rendered once until some action is taken (click-thru, vote up/down, or hide, i.e. mark as seen but not read). The article fragment cache grows stale much slower. However, there are 12 times as many articles as channels. The speed up is less impressive, 2 times faster refreshes.

All in all, I’m sticking with the article fragment cache. (Note: all work has been done with the memory storage mechanism, essentially a hash.) I’ll be posting the details of how to cajole the 3 fragment caching mechanisms that don’t explicitly implement timestamp expiry to do it anyway (mem_cache already does). I tried posting the whole mess earlier and WordPress was barfing on permission problems. Hopefully it can handle smaller bites

Testing Rails Apps and Off-line Indexing Search Engines

December 9th, 2008

For a variety of technical reasons, most of the full-text search engines available for Ruby on Rails do off-line indexing. (Changes to the indexed tables are added to a queue that is processed in a cronjob, i.e., changes do not show up immediately in indexes). Examples of off-line indexing search engines are Xapian, Sphinx, and Hyper-Estraier. I think all three retrieve records for indexing directly from the database. This causes problems in testing.

To speed up testing, Rails does not commit any changes to the database made from individual tests. The lot is discarded in a rollback of a transaction started at the beginning of each test. Fast, but programs outside the Rails stack do not see the changes. Even after I jumped through hoops to run the index update program within a test.

Loading the fixtures, indexing them, and then running the tests works if fixtures are all the tests search for. In MySQL, the statement “SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED” will work around the limitation, but it’s non-portable and I’d have to maintain some hacked third-party code. No thanks.

Conflict between assert{2.0} and ActiveSupport in Rails

November 16th, 2008

I really like the assert{2.0} gem for testing and a first cut at debugging my Rails code. And many of the mix-in methods in Rails’ ActiveSupport gem make programming easier and the application friendlier. However, while integrating Sphinx/Ultrasphinx, a full-text search plugin, I discovered a nasty conflict between assert{2.0} and ActiveSupport, both define the method in_groups_of. Googling found nothing, posting on the rubyonrails-talk Google group turned up nothing so I dug into the code.

It appears that the assert{2.0} code is a subset of the functionality in the ActiveSupport code, so I just commented it out. So far, everything seems to work okay.

Swimming without Shorts

November 16th, 2008

Warren Buffet is reputed to have said, “You don’t know who’s swimming naked until the tide goes out.” There is a style of hedge funds called “long-short”. Being “long” a stock means betting for it, usually by owning it. Being “short” a stock means betting against it. The idea is that the fund makes money on both the ups and downs. Done well, the returns are much more consistent. In wild bull markets the fund makes less than the market as a whole. In downturn it makes more (usually by losing less).

In the recent downturn, several “long-short” funds that had been doing well were revealed to have really been “long-only”.  Or to paraphrase Buffet’s folksy metaphor, “You don’t know who’s swimming without shorts until the tide goes out.”

Silly Money – Part 1

November 10th, 2008

The Big Picture, Barry Ritholtz’s blog, has a wonderful comedy video out of England entitled Silly Money. It is funny and truer than 99% of the “serious” reporting on the credit crisis. Be warned, it’s 48 minutes long and you won’t watch “just the first little bit”. My wife and I watched the whole thing, past when we were supposed to be doing something else.

This link claims to be part one. We will be waiting for part two.

Financial Leverage – the Basics

November 6th, 2008

As a professional programmer I expect to make money doing what I do. At first I just threw the leftovers in the corner of my 401(k)/IRA and went on programming. In time it became big enough that that felt irresponsible. So I got serious about putting it to work making more money. In the Western world almost all ways of using money to make money involve leverage (see the first comment for a different way). Leverage has been in the news more than bit lately and I am trying to educate myself what it is and what it means for my savings. This and subsequent posts are thinking out loud about leverage in order to understand it. You are welcome to listen in.

The basics are simple

  • raise some seed money
  • borrow additional money at a low rate
  • invest or loan it at a higher rate
  • pocket the difference

Leverage is the ratio of the seed money to the borrowed money. The total return on the seed money is the investment’s rate plus the leverage ratio times the difference between the two interest rates. More leverage equals higher return.

The financial leverage idea is common to a lot of different institutions: retail banks, investment banks, stock brokerages, mutual funds, etc. However, the names of things can differ. To be consistent, I’ll call the seed money capital, the borrowed money the liabilities, and the investment or loan the asset. They must all balance out: assets minus liabilities equals capital.

As they say, show me the money. A group of ten friends pool their money, $10K each = $100K. The local bank pays 3.5% on $10K 12 month Certificate of Deposit (CD) but 4.0% on $100K 12 month CDs. The balance sheet for this unleveraged investment is:

Assets: $100K 12 month CD
Liabilities: -$0K
Capital: $100K seed money

At year’s end the Profit and Loss (P&L sheet) is:

Income: $4K $100K for 1 year at 4%
Expenses: -$0K
Profit: $4K

The profit can be distributed (i.e., dividends), kept for another round of investing (retained earnings), or both capital and profits handed out to everyone and the deal is done (liquidation). By pooling their capital they have made more (4%) than they could separately (3.5%). This is basic capitalism, using money (capital) to make money instead of using work (labor) to make money. If you have a saving account, CD, an IRA or 401(k), own stock, bonds, or a mutual fund, you are at least part capitalist.

Now let’s look at what the bank does with the money from the CD. The World’s Smallest Bank (WSB) was founded by ten investors who each put in $10K. (Beware: this example is decidedly not realistic.) With this equity in hand, they sold 9 $100K 12 month CDs that paid 4%. They took this $1,000,000 and made an interest only home loan to J. Big Bucks for a McMansion on the lake at 6%. The balance sheet is (dollar amounts in balance sheets are typically in thousands):

Assets: $1,000  the McMansion
Liabilities: $900  the nine $100K 12 month CDs
Capital: $100  seed money

At the end of the first year, J. Big makes an interest payment of $60K. The P&L sheet is:

Income: $60  6% of $1M
Expenses: $36  9 $100K CDs at 4%
Profit: $24

The year-end balance sheet is:

Assets: $1,000  the McMansion
$24  profit, i.e., cash
Liabilities: $900   nine $100K 12 month CDs
Capital: $124 seed money + profit

The investor’s initial $100K has grown in one year to $124, a return of 24%. With a leverage ration of 9 to 1 the investors boosted the 2% difference between the CDs and the home loan by 12 times. Nice money if you can get it. A more realistic example would include employee salaries, office rent, office supplies, utilities, and other expenses.

More leverage will boost the return even more, but leverage at most financial institutions is capped by law in the 10–15 to 1 range. The upsides and downsides of leverage will be covered in posts about the good, the bad, and the ugly. To get an idea of what the ugly looks like, in 2004 five of the largest investment companies (Lehman, Goldman Sachs, Morgan Stanley, Merrill Lynch, and Bear Stearns) convinced the Securities and Exchange Commission (SEC) that they big enough and sophisticated enough to handle more leverage. In exchange for more oversight (which didn’t happen!) the leverage caps were removed. In the last year, Bear Sterns and Merrill Lynch were taken over by JP Morgan and Bank of America, respectively, to keep them from crashing in public. Lehman is in bankruptcy. Goldman Sachs and Morgan Stanley may yet survive, though Goldman’s stock is down by half from a year ago and Morgan Stanley is down by better than two-thirds. Leverage multiplies the returns on the way up and the losses on the way down. Stay tuned for details.

Bug in FeedTools

September 10th, 2008

I’ve been chasing a bug in Amethyst for over a week. Sometimes accented characters cause problems, sometimes they don’t. MySQL will often complain about a duplicate key in a multi-row INSERT. The INSERT is correct, but the duplicate key doesn’t appear in any of INSERTs! With enough examples I figured out that they are all the prefix of the key from one of the INSERTS, up to the first accented character. But not all accented characters cause problems, though all accented characters on one feed do cause problems. I’ve been tracing the incoming data starting with the data on the network and working my way through the system.  In  chatting with Greg Foster of  the Consumers Union at the Lone Star Ruby Conference, the topic came up and he mentioned that in spite of various feeds proclaiming that they are UTF-8, sometimes they contain Latin-1 characters. He said he had a conversion routine in Ruby he’d send me. Sure enough, I looked closer and there they were! Depending on what I used to look at them, they might render as expected, or as a ‘\361′ sequence.

Later in the conference I became impatient to fix the problem and Googled “Latin1 conversion UTF8 Ruby”. At the top of the list was
How-to fix ruby’s FeedTools latin-1 parsing. There is a bug in FeedTools, it converts numeric HTML entities under 256 to Latin1 characters instead of UTF-8 characters.  The blog entry includes some code to monkey patch FeedTools to correct the problem. I dropped the code in, deleted the corrupt data, refreshed the feed, and voila, problem fixed.

Less is More

September 2nd, 2008

I am preparing for a talk at the Lone Star Ruby Conference on Ruby and Rails on NetBeans, a IDE written in Java.  It’s only ten minutes long and so a bit of a squeeze.  With practice and dropping all the “You knows…” and such I can shorten it, but I still have a 20 minute talk.  Listening to a podcast last night by the Common Craft folk, I was reminded that it is more about what you leave out than what you leave in.  It was reinforced this morning by a video on the Guardian’s site on New Orleans and Hurricane Gustave.  There is no voice over, just the sound of the wind and pictures of trees, sofas, and people being pushed about by the wind.  The last third is President Bush boarding a plane.  Again no voice over, just the whine of an idling jet.  Powerful.

I’m Greedy, Getting Ready wasn’t enough

August 20th, 2008

In the previous post I discussed how on the way to a planned re-implementation in C of some database access, achieved a nearly order of magnitude speed up by bypassing the Ruby on Rails ActiveRecord database access class and generated the SQL myself. Faster but still some performance problems. Benchmarking, profiling, and online research turned up hints that relational database joins are powerful, but slow. So I spent some time thinking about how to speed it up. There are several approaches, some I found in DSPAM, some I found online, and some that came to me in the middle of the night.

The associations are created (and destroyed) all at once. Rather than store them as many records in the database, how about one record, or even a field in the article record/row itself. DSPAM does this, and it is fast enough. Doing it in pure Ruby/Rails would involve marshaling an array or hash of string and integer pairs. Since Ruby allows the elements of an array (or hash) to not all be the same type, type information is required for each value. This could be slow. In this case all elements are the same, and the information is already in the item record, the article description itself. Can Rails/Ruby regenerate the information on the fly faster than it can read it from disk? The answer is yes. If I think outside the database and the usual way of doing things.

Associations/join table typically contain the record ID (the primary key by convention/default in Rails) of both sides of the association. But what if instead of using the ID as the lookup key, the word/token string itself is the lookup key. In MySQL, my benchmarking found that string lookup is approximately 10% slower than integer key lookup (both with indexes). With string lookup, the join table can be discarded completely.

When optimizing it must be kept in mind what is the expensive/scarce resource(s). For many projects, it is the programmer’s time (i.e, programmer salary and overhead, or time to market is the scarce resource). However, user’s time/patience must be kept in mind too. Moving away from DSPAM (in C) to a pure Rails and ActiveRecord solution put the latter on the critical path.

I have made the simplifying assumption that as few database calls as possible is a plausible way to go, keeping in mind that large joins are also expensive. With the string indexed lookup I eliminated the join table. By accumulating all changes and rolling all inserts into one SQL INSERT statement, all updates into one SQL UPDATE statement, I gained almost as much speedup as the first optimization. The combination is around 16 to 50 times, depending whether you measure elapsed time or CPU (times in seconds):

user system total real
AR 7.280000 0.230000 7.510000 ( 11.215389)
SQL 0.800000 0.140000 0.940000 ( 3.713782)
SQL2 0.150000 0.000000 0.150000 ( 0.686940)

Better than I expected. Maybe it’s time to stop bit-twiddling and get back to adding features.

When Getting Ready is Enough

August 14th, 2008

Years ago I heard a tale, perhaps apocryphal, of a business consultant who’s client wanted to go computerized. He knew that the computer salesman was overselling and would under-deliver, but nothing he could say changed their mind, they had to computerize their accounting. So he told them they needed to regularize their accounting, clean it up, and put standard procedures in place, “for the computer”. He had been telling them that for years, but now he had a lever. So for months there was a big push to clean up their accounts so they could go “computerized”. When everything was ready for the computer, he sat down with the company owner and showed him that they were already realizing all the benefits claimed for computerization without the costs of the computer.

As I noted in a previous post, my adaptive RSS reader, Amethyst, is running too slow. So in preparation of moving the CPU/time intensive operations into C, I centralized them in one class and started working out the raw SQL that the C code would need. It’s ugly but it works, it approximately 7-10 times faster, has the potential for even more speed, and is done in less time than researching relational database access from C would have taken. Yes, I can probably get even more speed by going to C, but right now it isn’t worth the additional time. Plus the framework I developed for a realistic benchmark can also be used by the test code. Nice.