CACHE_RESULTS - Transparent general caching of function results

This Matlab function will evaluate a specified function with the provided arguments, then save the results to disk in a file named according to the first argument. On a subsequent call, it will check for such a file, and, if the arguments match, return the result read from file instead of re-evaluating the function. In a situation where the same function is being evaluated on the same data in multiple places (or mulitple times), and where the evaluation is at least moderately computationally expensive, cache_results can be used as a transparent replacement that will avoid redundant computations.

When the first argument is a string (for instance, a file name), this name is used (with small modifications) as the name of the cached results file (within a specific cache directory). When the first argument is some other Matlab type, it is converted into a hexadecimal hash (using Jan Simon's DataHash) which is then used as the cache file name.

Although (sometimes) only the first function argument is used to construct the cache file name, all the arguments to the function are recorded in the file, along with the corresponding output. When a cache file is found, cache_results then performs an exact comparison of all the arguments (as recorded in the cache file) with all the arguments provided for the current invocation, and only uses the cached result if the arguments match. If no match is found, the provided function is executed. This new result is then added to the cache file (along with the distinct set of arguments), so that in future both results will be returned from cache. Thus, while individual cache files distinguish only the first argument, the system can handle any number of different argument sets associated with this first argument. However, the search through different argument sets within a cache file is linear (at present), so the most efficient way to use the function.

While largely transparent, using a results cache of this kind has a few drawbacks. If the function does not give deterministic results given its input arguments, then caching the result will "freeze" the output, changing its behavior. If the first argument refers to a file on disk, but that input file itself is changed, cache_results will not know to re-evaluate the function on the new file. If the function itself is changed (so that even when evaluated with the same input arguments, its result will be different), cache_results will still return the cached value. There's some code within the function to check the modification date of both cache file and function which could possibly force a recalculation when the function is modified, but this behavior is currently disabled.

Example Usage
Requirements
Getting the code
Changes
Acknowledgment

Example Usage

Here's an example of using cache_results to avoid recalculating a spectrogram:

% first, clear the cache
system('rm -rf cache/specgram');

% load some data
[d,sr] = wavread('example.wav');
disp('** Raw spectrogram:');
% Here's the function we want to cache.  I'm using a really high
% overlap to make it slow to compute.
tic; D = specgram(d,512,sr,512,504); toc
subplot(211)
imagesc(20*log10(abs(D))); axis xy
% The first time we do this with the cache, it has to evaluate the
% function so it is no faster:
disp('** Cache_results first time:');
% The function is passed as a function pointer (@funcname);
% arguments are passed as a struct array.
tic; D2 = cache_results(@specgram, {d,512,sr,512,504}, '', '', 1); toc
disp(['Peak diff = ',num2str(max(abs(D2(:)-D(:))))]);
% the same result

% Now, if we do it again, it's much faster
disp('** Cache_results second time:');
tic; D2 = cache_results(@specgram, {d,512,sr,512,504}, '', '', 1); toc
max(abs(D2(:)-D(:)));

% Doing it with a different first argument (even just a little bit
% different) leads to a different cache file:
disp('** Slightly different 1st arg data:');
tic; D3 = cache_results(@specgram, {d(1:end-1),512,sr,512,504}, '', '', 1); toc
tic; D3 = cache_results(@specgram, {d(1:end-1),512,sr,512,504}, '', '', 1); toc
subplot(212)
imagesc(20*log10(abs(D3))); axis xy

** Raw spectrogram:
Elapsed time is 4.795183 seconds.
** Cache_results first time:
creating ./cache/specgram ... 
saved to ./cache/specgram/78857fbda0c1f2b9367b3f0e9676106d.mat
Elapsed time is 9.924937 seconds.
Peak diff = 0
** Cache_results second time:
loading from ./cache/specgram/78857fbda0c1f2b9367b3f0e9676106d.mat
Elapsed time is 1.248805 seconds.
** Slightly different 1st arg data:
saved to ./cache/specgram/25cc52156ad6b956e1f821b630e203c6.mat
Elapsed time is 9.095687 seconds.
loading from ./cache/specgram/25cc52156ad6b956e1f821b630e203c6.mat
Elapsed time is 1.331610 seconds.

Requirements

When the first argument is not a string, or if the total argument set comprises more than 256 bytes, cache_results will used Jan Simon's DataHash to summarize the arguments. This function needs to be installed. Also, DataHash relies on the Java Virtual Machine, so Matlab must be running with Java enabled.

Getting the code

The latest version of this code is available at http://www.ee.columbia.edu/~dpwe/resources/matlab/cache_results/

Or you can just copy it from this link: cache_results.m .

Changes

2012-07-11 v0.1 Original release

Acknowledgment

This work was supported by DARPA under the RATS program via a subcontract from the SRI-led team SCENIC. My work was on behalf of ICSI.