The WegoWise Development Blog

A Memcached Trick and an Old Watchdog

| Comments

This article describes some code we developed to fix an issue with our Memcached environment. This took the form of adding a method to a class that we didn't own. The article then describes the way that we used the Watchdog gem to ensure that our class extension won't cause problems further down the road.

The problem

We use Memcached both for page fragment caching and for caching large collections of objects that require significant work to generate and fetch. For example, we have a page where we need to display a dot on a scatter plot for every building owned by the user. Associated with each dot is a small amount of work to 1) get associated data out of the database and 2) perform calculations on the data to get derived values that are used in the display of the scatter plot. This work is negligible for small numbers of buildings, but for users with thousands of buildings the work adds up to unacceptably long request times (actually they hit our self-imposed server timeouts first). For these larger accounts we simply can't afford to perform all those database queries and calculations during the server request cycle.

A scatter plot showing building values

Instead we cache all that work server-side so that a request for the same data can be served up directly without the need to fetch most data from the database or perform any calculations. We also employ an aggressive cache-warming strategy so that data is usually pre-cached when a user requests it.

There's much more to say about how to implement effective caching in an enterprise Rails application. Here we just want to talk about a specific pitfall we ran into with our caching setup.

For our caching we use the built-in caching mechanism in Rails along with the dalli gem, which provides a Memcached caching store called ActiveSupport::CacheDalliStore. This allows us to cache objects using the following simple API:

def meaning_of_the_universe
a_bunch_of_work_to_calculate # => 42
end
Rails.cache.fetch('meaning_of_the_universe') { meaning_of_the_universe } # => 42

The way this works is that if the key "the_meaning_of_the_universe" is found in Memcached (called a cache hit) then we can skip the expensive calculation contained in the block. If the key isn't set (called a cache miss) then we do the work but then send the value to Memcached so that we can avoid the work the next time around. We can also do Rails.cache.read, which will read the value but do nothing if there is a cache miss, and Rails.cache.write, which will write the value without checking first to see if it already exists.

Every time you check the value of a cache, or write to it, you are making a request to a Memcached server (and a cache miss followed by a write will actually be two requests). Unless your Memcached server is running on the same machine as your application server there is going to be some network overhead associated with performing a lookup. The more keys you check for the more that network IO will add to the time of your request, which will eventually defeat the purpose of having a cache in the first place.

To avoid this it can make sense to perform a Rails.cache.read_multi operation, where you pass a list of keys and get back a hash of keys and values, with the value set to nil where there was a cache miss. This was the strategy we took on our page with the thousands of buildings displayed on a scatter plot. The code looked (very roughly) something like this:

buildings_hash = Rails.cache.read_multi(*building_keys)
buildings_hash.each do |key, value|
buildings_hash[key] = Rails.cache.fetch(key) unless value
end

That is, you perform a bulk read on the keys and then go back and fill in the ones that were missing, instead of checking each one.

In general this strategy has worked out well for us. The problem (finally) that we ran into is that with our particular hosted Memcached provider and very large accounts we would get strange behavior. Sometimes the page would load right away and other times it would take minutes to generate, with no rhyme or reason. When we looked into things it turned out that sometimes the #read_multi operation was failing and we were treating the result as an empty hash. This meant that the page had to go recalculate and write the values for all of those buildings, even though the values actually existed in Memcached.

The solution

For whatever reason our hosted Memcached provider couldn't handle the large bulk read request we were sending them. The obvious solution was to break up the request into smaller chunks. To this end we created the following logic:

def read_multi_chunked(cache_keys, chunk_size: nil)
return {} if cache_keys.empty?
chunk_size ||= ENV.fetch('MEMCACHED_READ_MULTI_CHUNK_SIZE') { 1_000 }
chunked_keys = cache_keys.in_groups_of(chunk_size, false)
chunked_keys.reduce({}) do |hash, chunk|
result = begin
Rails.cache.read_multi(chunk)
rescue Dalli::RingError
{}
end
hash.merge(result)
end
end

A quick explanation of the code:

  • take in a list of cache keys
  • get the chunk size from a parameter or from an environment variable
  • chunk the keys
  • perform read_multi operations for each chunk
  • combine the results into a single hash

In practice this new method has the same method signature (when the chunk size parameter is omitted) and output as the original read_multi method.

Next problem

The only problem remaining was deciding where to put this method. We try to keep methods out of the global scope, so we wanted to put it on some logical module or class. We discussed briefly the possibility of overriding ActiveSupport::DalliStore#read_multi and using alias trickery to call the original read_multi inside the new implementation, but decided against it on moral grounds. We decided, however, that it would make sense to put the method on ActiveSupport::DalliStore as an additional method, so that any code that was previously using Rails.cache.read_multi could just switch over to using Rails.cache.read_multi_chunked. In other words we could do this:

module ActiveSupport
module Cache
class DalliStore
def read_multi_chunked(cache_keys, chunk_size: nil)
return {} if cache_keys.empty?
chunk_size ||= ENV.fetch('MEMCACHED_READ_MULTI_CHUNK_SIZE') { 1_000 }
chunked_keys = cache_keys.in_groups_of(chunk_size, false)
chunked_keys.reduce({}) do |hash, chunk|
result = begin
read_multi(chunk)
rescue Dalli::RingError
{}
end
hash.merge(result)
end
end
end
end
end

(Note that we can drop the reference to Rails.cache and just use read_multi now, since an instance of DalliStore is what Rails.cache referred to previously).

We liked this because it allowed us to look for our read_multi_chunked operation in a familiar place, but we were also aware that we had opened us up to a somewhat unlikely but nevertheless troubling risk that is inherent in all such extensions: what if the maintainers of Dalli decided that they needed a method that could handle chunked read_multi operations and what if they implemented a method with the same name? We wouldn't necessarily notice the change when we updated the gem. There might be subtle differences or implications of their implementation that could affect our application. What if the maintainers updated internal code to use their new method, which would now be using our own implementation instead?

That might sound like a lot of ifs and in this case it's hard to think of a way that there would be a catastrophic effect (since it's unlikely that "read_multi_chunked" could mean something radically different), but it highlights the general issue with creating extensions on classes that you don't own. If nothing else, you might want to know about and use the maintainer's version of the method because it could be more efficient and you can stop maintaining your own version.

The extended solution

WegoWise maintains and uses a gem called Watchdog (written by a former Wegonaut, Gabriel Horner) that you can use to put an end to all this fretting about hypotheticals involving collisions with upstream code. By now it's a pretty old dog (last updated in 2011) but that's because what it does is fairly simple. Watchdog will check to make sure that your modification to a class doesn't collide with any existing methods. It performs this check at runtime so the moment upstream code creates a method that you are extending you will get an exception.

To use Watchdog you need to convert your extension into a mixin module. This is necessary because Watchdog needs to hook into the extend_object callback method that is run when extend is called on a class. This is what our code looks like with Watchdog:

module Ext
module ActiveSupport
module CacheDalliStore
extend Watchdog
def read_multi_chunked(cache_keys, chunk_size: nil)
return {} if cache_keys.empty?
chunk_size ||= ENV.fetch('MEMCACHED_READ_MULTI_CHUNK_SIZE') { 1_000 }
chunked_keys = cache_keys.in_groups_of(chunk_size, false)
chunked_keys.reduce({}) do |hash, chunk|
result = begin
read_multi(chunk)
rescue Dalli::RingError
{}
end
hash.merge(result)
end
end
end
end
end
module ActiveSupport
module Cache
class DalliStore
include Ext::ActiveSupport::CacheDalliStore
end
end
end

The new version is certainly more verbose, but we think that's acceptable.

If it wasn't clear by now, we are okay with occasionally modifying classes that we don't own. Sometimes monkeypatching (or freedom patching!) can be the most elegant solution to a problem, especially when you want to make a change to an interface that is called in many places throughout your codebase.

That said, we don't think that class modification should be your first, or even second, option when thinking about a software problem. There are usually better approaches, such as creating a wrapper class that you can define your own methods on, or creating a subclass of the class that you want to modify. If, after careful consideration, you still decide that you want to modify that class, we think it's reasonable for there to be some ceremony and cruft involved.

Watchdog makes extending an upstream class considerably less dangerous, for those occasions when a patch seems like an appropriate option.

Comments