Implementing a content cache

on 11th April
  • content cahce
  • php caching
  • cache php page
  • cache theory
  • cache system design
There are a million ways to implement a PHP based content cache so why would we need another one. The only plausible reason I can think of is that most of the PHP caching implementations seem to overcomplicate the matter. In this article, we'll focus, not specifically on PHP caching but, on caching as a general system design principle and how it may be implemented. The code is written in PHP but the concepts can be applied to any server side script no matter what the language. A cache is a very simple thing. In essence, you do something, you save the result and then next time you're requested to do the same thing, you just fetch the result rather than working it out again. It's analogous to remembering that 4 x 4 = 16 rather than working it out every time.

Before we get started, this article is about server side caching, i.e. it's not the caching that your browser implements (and so we won't be messing around with the PHP header function too much), it's the caching that you, as a PHP developer, implement without having to rely on individual web users browser settings.

Why might we want to cache our content? It's quicker right? Well, not necessarily. If you load a 10 line HTML document with no embedded PHP, Cold Fusion, ASP &c. code in it, the time taken to create a file handle, write the contents of the file to the server, check whether a cache file exists, read the contents of the file and then output the file contents will largely outweigh just outputting the file contents. So, the question begs, what's the point of caching content?

At Celtic Productions, we always attempt to separate business logic, data and presentation as much as is possible given time and budget constraints. This is in line with good software engineering practices, i.e. that changes in 1 of those layers doesn't necessarily mean changes in the others. As a result of this we often implement custom CMS' (content management systems) where we restrict the input of, for example, HTML entities and instead opt for design independent logical markups or objects. This renders the system completely flexible if the presentation layer needs to be changed. The data doesn't need to be rehashed as it is, to all intents and purposes, independent of what it will eventually become. This system design principle enables the same content to be delivered over different channels (i.e. phone, internet, text-only media) once each server of the content understands the logical markups.

Note: To some extent, this is the theory behind HTML itself, i.e. the <em> tag is meant to be used to logically markup something which requires emphasis. All the above CMS' do is take that same concept to the next level.

Eh, relevance? Well, something has to ultimately interpret the logical markups. With HTML, the browser is capable of this but if you have complicated logical entities and objects in a content management system (for example, you might have a CMS object called 'EmployeeOfTheMonth' which renders young Jimbo in the content with all of his monthly sales conversions) you will invariably need a 'content parser'.

Imagine the scenario then where in order to collate all of Jimbo's sales data for the month, you need to query 20 or so related database tables, each of which contain 10's of 1000's of records. Then, continuing down that line of thinking, imagine you have not just an 'EmployeeOfTheMonth' object in your CMS, you also have a 'ManagerOfTheMonth', 'MostRequestedItem', 'UpAndComingEmployee' &c. each of which require a similar number of queries to the database in order to piece them together. All of a sudden, your pages are taking a fairly unacceptable length of time to load!

That's where the content cache comes in. It most certainly is quicker in this instance, and in any instance where there is a large amount of processing taking place, to read a 'finished version' of the page off the file system than prepare the page 'fresh' each time it is requested. So, how do we go about doing this?

The reason I find most caching systems out there so complicated and never really suitable for my needs is because they often require you to write all of your content into a variable, i.e. $content = ... and code that I work on almost always consists of a mixture of the following:

1) print statements
2) echo statements
3) $content = <<<CONTENT statements
4) <?=$myvar;?> statements

And so storing everything into a variable before printing it out would take me decades to re-write. The solution is to simply take a step back and say to yourself, well, it all reaches the browser at some point so let me just capture what the browser gets sent and fortunately PHP allows us a very simple way to do that in the form of an output buffer.

Here's the PHP centric theory. Create 2 files for inclusion in all pages that you would like to cache, cache_header.php and cache_footer.php. cache_header.php will look something like this:


// Should we turn on caching, 1 = Yes, 0 = No
// Caching is good for production scripts but
// you might want to turn it off if you are
// still testing your script

$caching_enabled = 1;

// If caching is enabled, enter the cache block
// I highly recommend having whatever is in the
// if statement the same in both your header
// and footer file. If they are difference, there
// is of course a chance that the output buffer
// will start but not finish (blank pages anyone!)

if($caching_enabled)
{

// We generate the cache filename based on the
// requested URL. Assuming you aren't caching
// session specific pages or POST data, this is
// a great way to quickly tie a unique request
// to a unique cache file. To generate a unique
// fingerprint if you are using session or POST
// data, implode $_REQUEST / $_POST / $_SESSION
// and md5 the result

$filename = "%%-".md5($_SERVER['REQUEST_URI'])."-%%.html";

// Set the $cachefile variable to the full path
// to your cache directory, make sure it's writable!

$cachefile = "/my/cache/directory/".$filename;

// The next few lines are possibly the only lines
// that you will want to change, they determine when
// the cache is invalid. In this example we are using
// a flat 30 minutes but this definitely requires more
// discussion, see below...

$cachetime = 30 * 60;

// If our $cachefile exists and this particular page
// has been cached with the last half an hour, the
// cache is valid and we should use it.

if (file_exists($cachefile) &&
time() - $cachetime < filemtime($cachefile))
{
// Simply include the $cachefile here and kill
// the script. Wow, that page loaded quickly!

include($cachefile);
exit;
}

// This is important; this is effectively an else
// statement here, i.e. if the cache wasn't valid
// we should start the output buffer here to collect
// the 'cacheable' contents of the page

ob_start();
}
?>


cache_footer.php should look something like this:


if($caching_enabled)
{
// $cachefile has already been set so just
// open a write handle to it and throw out
// the contents of the output buffer into it

$fp = fopen($cachefile, 'w+');
fwrite($fp, ob_get_contents());

// Close up the cache file and flush/echo/print
// the contents of the output buffer to the client
// If you don't flush the contents the generated
// page will be empty!

fclose($fp);
ob_end_flush();
}


To enable caching then on an existing page, you might have something like the following:


include_once("cache_header.php");
?>
Cacheable page content consisting of print statements, echo statements, statements and even the much coveted <<include_once("cache_footer.php");
?>


Finally, the only thing you should give consideration to is how long should your cache be valid for. This depends on how often the content is going to change. Content that changes less frequently should have a cache with a longer life span while content that changes more frequently should have a cache with a fairly short life span. If that explanation isn't good enough and you only want to invalidate the cache when the content has actually changed, good for you, that's probably the best implementation but it does require a bit more effort.

Firstly, you need to consider the dependencies of the page you are caching, i.e. how does the page get its content, is it from a number of modifiable includes, is it from a database driven CMS, is it a combination of a CMS and other data? The only way to be really sure that your cache is invalidated each time the content of the page is updated is to trigger a flag when something that could affect the final page content is added/updated/deleted. For example, you might create a table in your database which has 2 fields, md5_page_identifier and is_dirty. The md5 page identifier could hold the md5'd version of the requested URL while is_dirty is an indication of whether the page potentially needs to be re-cached. You could then set the "is dirty" flag to '1' each time page dependent data is updated (using either database triggers or programmatic triggers in your CMS). You can then display the cached page if is_dirty is set to '0' and re-cache the page if it is set to '1'. Just remember that if you re-cache the page, you should set is_dirty to '0' to ensure that subsequent calls use the cached version of the page.

That's it, I hope this article has been useful. There's a lot of extra information in there and a few system design principles but if you are looking for a quick and clean way to implement a content cache, the information is right there for you in the above code snippets.

As with all articles on Celtic Productions, this article is protected by international copyright laws. It may be linked to (we are of course most grateful of links to our articles), however, it may never be reproduced without the prior express permission of its owners, Celtic Productions.