Title: Implementing a content cache
Posted: Wed, 11th April 2007
Category: System DesignThere are a million ways to implement a PHP based content cache
so why would we need another one. The only plausible reason I can
think of is that most of the PHP caching implementations seem to
overcomplicate the matter. In this article, we'll focus, not
specifically on PHP caching but, on caching as a general system
design principle and how it may be implemented. The code is
written in PHP but the concepts can be applied to any server side
script no matter what the language. A cache is a very simple
thing. In essence, you do something, you save the result and then
next time you're requested to do the same thing, you just fetch
the result rather than working it out again. It's analogous to
remembering that 4 x 4 = 16 rather than working it out every
time.
Before we get started, this article is about server side caching,
i.e. it's not the caching that your browser implements (and so we
won't be messing around with the PHP header function too much),
it's the caching that you, as a PHP developer, implement without
having to rely on individual web users browser settings.
Why might we want to cache our content? It's quicker right? Well,
not necessarily. If you load a 10 line HTML document with no
embedded PHP, Cold Fusion, ASP &c. code in it, the time taken to
create a file handle, write the contents of the file to the
server, check whether a cache file exists, read the contents of
the file and then output the file contents will largely outweigh
just outputting the file contents. So, the question begs, what's
the point of caching content?
At Celtic Productions, we always attempt to separate business
logic, data and presentation as much as is possible given time
and budget constraints. This is in line with good software
engineering practices, i.e. that changes in 1 of those layers
doesn't necessarily mean changes in the others. As a result of
this we often implement custom CMS' (content management systems)
where we restrict the input of, for example, HTML entities and
instead opt for design independent logical markups or objects.
This renders the system completely flexible if the presentation
layer needs to be changed. The data doesn't need to be rehashed
as it is, to all intents and purposes, independent of what it
will eventually become. This system design principle enables the
same content to be delivered over different channels (i.e. phone,
internet, text-only media) once each server of the content
understands the logical markups.
Note: To some extent, this is the theory behind HTML itself, i.e.
the <em> tag is meant to be used to logically markup
something which requires emphasis. All the above CMS' do is take
that same concept to the next level.
Eh, relevance? Well, something has to ultimately interpret the
logical markups. With HTML, the browser is capable of this but if
you have complicated logical entities and objects in a content
management system (for example, you might have a CMS object
called 'EmployeeOfTheMonth' which renders young Jimbo in the
content with all of his monthly sales conversions) you will
invariably need a 'content parser'.
Imagine the scenario then where in order to collate all of
Jimbo's sales data for the month, you need to query 20 or so
related database tables, each of which contain 10's of 1000's of
records. Then, continuing down that line of thinking, imagine you
have not just an 'EmployeeOfTheMonth' object in your CMS, you
also have a 'ManagerOfTheMonth', 'MostRequestedItem',
'UpAndComingEmployee' &c. each of which require a similar number
of queries to the database in order to piece them together. All
of a sudden, your pages are taking a fairly unacceptable length
of time to load!
That's where the content cache comes in. It most certainly is
quicker in this instance, and in any instance where there is a
large amount of processing taking place, to read a 'finished
version' of the page off the file system than prepare the page
'fresh' each time it is requested. So, how do we go about doing
this?
The reason I find most caching systems out there so complicated
and never really suitable for my needs is because they often
require you to write all of your content into a variable, i.e.
$content = ... and code that I work on almost always
consists of a mixture of the following:
1)
print statements
2)
echo statements
3)
$content = <<<CONTENT statements
4)
<?=$myvar;?> statements
And so storing everything into a variable before printing it out
would take me decades to re-write. The solution is to simply take
a step back and say to yourself, well, it all reaches the browser
at some point so let me just capture what the browser gets sent
and fortunately PHP allows us a very simple way to do that in the
form of an output buffer.
Here's the PHP centric theory. Create 2 files for inclusion in
all pages that you would like to cache, cache_header.php and
cache_footer.php. cache_header.php will look something like this:
// Should we turn on caching, 1 = Yes, 0 = No
// Caching is good for production scripts but
// you might want to turn it off if you are
// still testing your script
$caching_enabled = 1;
// If caching is enabled, enter the cache block
// I highly recommend having whatever is in the
// if statement the same in both your header
// and footer file. If they are difference, there
// is of course a chance that the output buffer
// will start but not finish (blank pages anyone!)
if($caching_enabled)
{
// We generate the cache filename based on the
// requested URL. Assuming you aren't caching
// session specific pages or POST data, this is
// a great way to quickly tie a unique request
// to a unique cache file. To generate a unique
// fingerprint if you are using session or POST
// data, implode $_REQUEST / $_POST / $_SESSION
// and md5 the result
$filename = "%%-".md5($_SERVER['REQUEST_URI'])."-%%.html";
// Set the $cachefile variable to the full path
// to your cache directory, make sure it's writable!
$cachefile = "/my/cache/directory/".$filename;
// The next few lines are possibly the only lines
// that you will want to change, they determine when
// the cache is invalid. In this example we are using
// a flat 30 minutes but this definitely requires more
// discussion, see below...
$cachetime = 30 * 60;
// If our $cachefile exists and this particular page
// has been cached with the last half an hour, the
// cache is valid and we should use it.
if (file_exists($cachefile) &&
time() - $cachetime < filemtime($cachefile))
{
// Simply include the $cachefile here and kill
// the script. Wow, that page loaded quickly!
include($cachefile);
exit;
}
// This is important; this is effectively an else
// statement here, i.e. if the cache wasn't valid
// we should start the output buffer here to collect
// the 'cacheable' contents of the page
ob_start();
} ?>
cache_footer.php should look something like this:
if($caching_enabled)
{
// $cachefile has already been set so just
// open a write handle to it and throw out
// the contents of the output buffer into it
$fp = fopen($cachefile, 'w+');
fwrite($fp, ob_get_contents());
// Close up the cache file and flush/echo/print
// the contents of the output buffer to the client
// If you don't flush the contents the generated
// page will be empty!
fclose($fp);
ob_end_flush();
}
To enable caching then on an existing page, you might have
something like the following:
<?php
include_once("cache_header.php");
?>
Cacheable page content consisting of print statements, echo
statements, <?=$myvar;?> statements and even the much
coveted <<<CONTENT statement.
<?php
include_once("cache_footer.php");
?>
Finally, the only thing you should give consideration to is how
long should your cache be valid for. This depends on how often
the content is going to change. Content that changes less
frequently should have a cache with a longer life span while
content that changes more frequently should have a cache with a
fairly short life span. If that explanation isn't good enough and
you only want to invalidate the cache when the content has
actually changed, good for you, that's probably the best
implementation but it does require a bit more effort.
Firstly, you need to consider the dependencies of the page you
are caching, i.e. how does the page get its content, is it from a
number of modifiable includes, is it from a database driven CMS,
is it a combination of a CMS and other data? The only way to be
really sure that your cache is invalidated each time the content
of the page is updated is to trigger a flag when something that
could affect the final page content is added/updated/deleted. For
example, you might create a table in your database which has 2
fields,
md5_page_identifier and
is_dirty. The md5 page identifier could hold the
md5'd version of the requested URL while
is_dirty is
an indication of whether the page potentially needs to be
re-cached. You could then set the "is dirty" flag to '1' each
time page dependent data is updated (using either database
triggers or programmatic triggers in your CMS). You can then
display the cached page if
is_dirty is set to '0'
and re-cache the page if it is set to '1'. Just remember that if
you re-cache the page, you should set
is_dirty to
'0' to ensure that subsequent calls use the cached version of the
page.
That's it, I hope this article has been useful. There's a lot of
extra information in there and a few system design principles but
if you are looking for a quick and clean way to implement a
content cache, the information is right there for you in the
above code snippets.
As with all articles on Celtic Productions, this article is
protected by international copyright laws. It may be linked to
(we are of course most grateful of links to our articles),
however, it may never be reproduced without the prior express
permission of its owners, Celtic Productions.
del.icio.us
.
Digg It
.
BlinkList
.
Fark
.
Google
.
Ma.gnolia
.
Netvouz
NewsVine
.
RawSugar
.
Shadows
.
Stumble
.
Technorati