In the previous post I discussed how on the way to a planned re-implementation in C of some database access, achieved a nearly order of magnitude speed up by bypassing the Ruby on Rails ActiveRecord database access class and generated the SQL myself. Faster but still some performance problems. Benchmarking, profiling, and online research turned up hints that relational database joins are powerful, but slow. So I spent some time thinking about how to speed it up. There are several approaches, some I found in DSPAM, some I found online, and some that came to me in the middle of the night.
The associations are created (and destroyed) all at once. Rather than store them as many records in the database, how about one record, or even a field in the article record/row itself. DSPAM does this, and it is fast enough. Doing it in pure Ruby/Rails would involve marshaling an array or hash of string and integer pairs. Since Ruby allows the elements of an array (or hash) to not all be the same type, type information is required for each value. This could be slow. In this case all elements are the same, and the information is already in the item record, the article description itself. Can Rails/Ruby regenerate the information on the fly faster than it can read it from disk? The answer is yes. If I think outside the database and the usual way of doing things.
Associations/join table typically contain the record ID (the primary key by convention/default in Rails) of both sides of the association. But what if instead of using the ID as the lookup key, the word/token string itself is the lookup key. In MySQL, my benchmarking found that string lookup is approximately 10% slower than integer key lookup (both with indexes). With string lookup, the join table can be discarded completely.
When optimizing it must be kept in mind what is the expensive/scarce resource(s). For many projects, it is the programmer’s time (i.e, programmer salary and overhead, or time to market is the scarce resource). However, user’s time/patience must be kept in mind too. Moving away from DSPAM (in C) to a pure Rails and ActiveRecord solution put the latter on the critical path.
I have made the simplifying assumption that as few database calls as possible is a plausible way to go, keeping in mind that large joins are also expensive. With the string indexed lookup I eliminated the join table. By accumulating all changes and rolling all inserts into one SQL INSERT statement, all updates into one SQL UPDATE statement, I gained almost as much speedup as the first optimization. The combination is around 16 to 50 times, depending whether you measure elapsed time or CPU (times in seconds):
|
user |
system |
total |
real |
| AR |
7.280000 |
0.230000 |
7.510000 |
( 11.215389) |
| SQL |
0.800000 |
0.140000 |
0.940000 |
( 3.713782) |
| SQL2 |
0.150000 |
0.000000 |
0.150000 |
( 0.686940) |
Better than I expected. Maybe it’s time to stop bit-twiddling and get back to adding features.