I have migrated my blogs to http://pranshujain.wordpress.com this article goes here.
In my previous post on cache implementation I had talked about where to keep the cache. Now I will talk about how to access the cache and how to expire the same.
Before we go into those topics, it is very important to consider the tolerance of stale data for maximum optimization.
Lets say we cannot tolerate stale data at all - lets say the application is for selling a hotel room in Thai Bhat converted to Sterling. Now, depending on the rate I get from my bank and the date of stay, I get a price.
So every time anyone requests for a room, I have to check the price like
PriceInGBP=PriceInTHB* ( Select ExchangeRate where fromCurrency=THB and ToCurrency=GBP and ValidityFromDate<=12Jan and ValidityToDate >= 12Jan and isLive=True)
- Since I am selling in future so I have to look up rates on that date.
Now if I cannot tolerate stale data, the best I can do is to put a trigger on the exchange rate table to update a ExchangeRateVersionNumber Table. The exchange rate version number is updated on any change on the entire exchange rate table.
Thus my application changes to
If CachedExchangeRateVersionNumber = select * from ExchangeRateVersionNumber , used Cached exchange rate, Else the above query to fetch the exchange rate.
Here we see that the query on the database, hence load on the database is much smaller aiding scalabililty.
However, atleast one query needs to be fired everytime.
Now lets assume that we could tolerate stale data for 1 minute,
we could change the getExchangeRate to
If CacheExchangeRateTimeStamp > = currentDateTime - 1 minute, then use cached , otherwise the above.
This is going to result in a significantly faster execution times as it will save the round trips to the database.
Lets now take the scenario where the exchange rate table is updated by a feed from my bank, instead of being entered manually. Here, If I cache, I have to tolerate stale data - even if that is for a minute. There can be no trigger ( unless supported by my bank) which can help check validity.
Now lets come back to accessing the cache. We want to read from the cache simultaneously via multiple threads, however when the cache is being updated, we need all the threads to stop.
While implementing such lookup - we need to be careful that we are not locking/synchronizing while reading - i.e. we must not force different objects to read in sequence.
In the typical implementation, I would prefer a hashtable as it already provides such thread-safety ( provided there is only one writing thread).
The granularity of this lock should be the same as granularity of cache update.
Now looking at expiry:
We have already discussed that in some cases, we need to have a Garbage collector like daemon clearing the cached items. This will also be required in case of LRU cache.
If the number of items in cache are limited, we dont need such daemon. The expiry checking happens while fetching from it.
In some other cases, we will expire and seed cache at pre-defined times.
Here are some questions which I answered in a recent post on MSDN forums:
1. At what moment of time should we invalidate the cache like at the start of transaction , just before updating the database or any other time
It really depends on what kind of data you are caching. If you are caching flight information as the expedia example above, the transaction is not in your control hence you have no option but to have a slight tolerance for stale data.
Assuming your own system is updating the data in cache and assuming that it is in a common datastore like a SQL database. If I also assume that we are talking about a clustered environment : there are multiple instances of the cache - one in each app-pool x each machine . Hence the component which is updating the data which is eventually cached has no way of reliably marking the cache invalid in all the different instances of the in-memory cache.
This forces all the different caches to have their own "listeners" checking for cache update.If I donot talk patterns here and only talk implementation - If you were having an application which can tolerate stale data, you could have a background thread updating the cache periodically from the database. IF you were having an application which cannot tolerate stale cache, you will have to check the DB status before each read from the cache. Now in DB you could use triggers to populate cache status in a single cell table ( lets say last updated timestamp) which the cache compares with itself and updates the cache if required, This way, the load on the database is much lesser than what it would be if it were to retrieve the entire cache.
So in this case, updating the cache invalid flag as a part of the atomic transaction will help.
2 When should we lock the cache
In either case, we should lock the cache at the time update check or update is happening. Now the granularity of the lock depends on the granularity in which we want to refresh the data.
3. how should we overcome the data retrieval latency issue which ultimately lead to stale data for some time in cache
For an externally maintained data, it may not be possible. For a self maintained data, sample implementation is discussed above.
4. if a thread is making some changes in a data in database how can we make sure that cache gets refreshed for the other thread to get fresh value .
I think I got your dilemma now - I would say that keep the cache as a static hashtable or equivalent and always get data from it for all your threads ( something like (MyObject)Cache.getFromCache(itemid) ) - lock the getFromCache method when you are checking for updates.
However, if you are looking at a code which is definitely going to be non clustered, you could put data in threads, have events and delegates to trigger cache - have an implementation of observer pattern or state machine. State machine if you have a dependancy like
Thread A has cached data-> Class B depends on that state of class A-> class C depends on state of class B
where you want the change to be propagated all the way to C before you release the locks.
5. Should we update the cache inside the transaction or after completion of transaction (in both scenarios of wanting and not wanting stale data)
It should be a part of the atomic transaction in either case.
6.If one thread is reading or writing a value to cache, how to block other threads to access that cache object.
If we keep the cache in one static class, and implement a lock on the read from cache method **when the update or check for update is happening** ( I am not talking of a synchronous one at a time access). Now you could do an explicit lock of the object/method or you could use a dataset like Hashtable.Synchronized - which already have the implementation - which locks all getters when setters are happening.
I would say for following will fit your purpose best- based on what I think your requirement is ( self maintained data in database, potential clustering, potentially no tolerance for stale data, TTL and not LRU)
1) Chose something like a static synchronized hashtable for your cache
2) For a cluster - you would need to poll to get changes.
3) Poll for changes at a timeout or at every get depending on whether you can tolerate stale data or not
4) Define a coarse granularity of cache expiry (otherwise polling for cache expiry can offset any gains got by having the cache in the first place).
If I were writing a generic cache - I would probably go for three different implementations for - externally maintained items, self maintained items and LRU cache - and Maybe a fourth one for non-clustered applications - like Games.
Do post a comment and I will try to keep up with responses.
Showing posts with label architecture. Show all posts
Showing posts with label architecture. Show all posts
August 26, 2006
Cache - the concepts
I have migrated my blogs to http://pranshujain.wordpress.com . this entry goes here.
In the early days of Internet, when memory was expensive and CPUs were less powerful, and the dreams were big and budgets were small - caching was perhaps the biggest buzz thing in software architecture. Today, we have at least 10x more powerful CPUs, memory is at least 10 times cheaper, and hardware and software can also be scaled many times over ( who would have imagined windows supporting 64 processors? ) - Cache is still there - and is supported out of the box in some web programming languages!!
I came across this question in MSDN architecture form - which triggered this note.
There are three distinct kinds of scenario which need different kinds of caching and cache expiry.
1) Most Frequently Used: Imagine that you are building an application like Amazon.co.uk - you have millions of items in the catalogue, you can fetch details about all of them in runtime from the database. Here, we know that the book descriptions are not going to change very often, hence we know that we do not need to hit the database every time we need to show details of a book / or other item. However, the sheer size of the database is so large that we cannot ( no not even now) contemplate keeping all of it in memory of the application server. The solution here is to keep the most frequently used items in the Cache and remove the rest. The cache expiry algorithm here would be to remove the Least Recently Used item from the cache. These kinds of cache are called LRU cache for least recently used. It is funny how its named after the mechanisms of expiry and not on how elements are cached.
If we think a bit more, we will realize how old this concept it. Computer architecture - memory management - virtual memory uses this concept - a quick look at wikipedia for Virtual Memory shows that this concept has been around from 1959! phew!
2) Time to Live: There are many data for which the resources are abundant and we can cache and manage all in memory. If you have subscribed to RSS feeds via any of the readers -like http://www.live.com/ or www.google.co.uk/ig you know that the data you see when you launch these pages may not be the latest. These readers do not go to the source very frequently to get an update. They get the feeds and keep it in memory for about 2 hours or whatever is the time you specify. This kind of caching is named "Time to Live" or TTL - again because the item in memory has a certain time to live before it expires. You can see this kind of cache on ebay ( you will notice that while viewing a list of items, the bid price does not always reflect what is happening inside - esp for items which are ending soon).
3) Event based: here we cache the data till an external event forces the cache to be expired. some of the online travel sites - cache the hotel availability information, till they get an error while booking saying that there are no rooms available. This is a trigger for cache expiry. Similarly the B2C new sites have explicit "publishing" action which clears the cache and lets the users see the latest articles.
In all three scenarios, the characteristics of cache are in fact the characteristics of cache expiry.
The next thing to consider is the Granularity of cache.
Many times, information comes in Packets, and hence cache should also expire in packets.
Lets say you have an application showing departure boards of all London airports. Now its likely that you will get a feed every "x" minutes from different airports. And when you get that, you want to expire the status of all flights of the particular airport and reload them. It might be too much of an effort to compare individual flights and change selectively. In some cases, this processing may out-weigh the benefit of the cache in the first place.
It is usually easy to identify the parameters for caching - and for a different value of any of those parameters, we can assign a different "cache set". For example, search results will depend on the query parameters typed by the user. It may also depend on the language option specified by the user and any other structured search field selected by the user. So - caching in that case will have all the structured search options as parameters - and if any of that changes - it can not use the cached value and has to do a query.
I will talk more about caching and cache implementation in my next post.
In the early days of Internet, when memory was expensive and CPUs were less powerful, and the dreams were big and budgets were small - caching was perhaps the biggest buzz thing in software architecture. Today, we have at least 10x more powerful CPUs, memory is at least 10 times cheaper, and hardware and software can also be scaled many times over ( who would have imagined windows supporting 64 processors? ) - Cache is still there - and is supported out of the box in some web programming languages!!
I came across this question in MSDN architecture form - which triggered this note.
There are three distinct kinds of scenario which need different kinds of caching and cache expiry.
1) Most Frequently Used: Imagine that you are building an application like Amazon.co.uk - you have millions of items in the catalogue, you can fetch details about all of them in runtime from the database. Here, we know that the book descriptions are not going to change very often, hence we know that we do not need to hit the database every time we need to show details of a book / or other item. However, the sheer size of the database is so large that we cannot ( no not even now) contemplate keeping all of it in memory of the application server. The solution here is to keep the most frequently used items in the Cache and remove the rest. The cache expiry algorithm here would be to remove the Least Recently Used item from the cache. These kinds of cache are called LRU cache for least recently used. It is funny how its named after the mechanisms of expiry and not on how elements are cached.
If we think a bit more, we will realize how old this concept it. Computer architecture - memory management - virtual memory uses this concept - a quick look at wikipedia for Virtual Memory shows that this concept has been around from 1959! phew!
2) Time to Live: There are many data for which the resources are abundant and we can cache and manage all in memory. If you have subscribed to RSS feeds via any of the readers -like http://www.live.com/ or www.google.co.uk/ig you know that the data you see when you launch these pages may not be the latest. These readers do not go to the source very frequently to get an update. They get the feeds and keep it in memory for about 2 hours or whatever is the time you specify. This kind of caching is named "Time to Live" or TTL - again because the item in memory has a certain time to live before it expires. You can see this kind of cache on ebay ( you will notice that while viewing a list of items, the bid price does not always reflect what is happening inside - esp for items which are ending soon).
3) Event based: here we cache the data till an external event forces the cache to be expired. some of the online travel sites - cache the hotel availability information, till they get an error while booking saying that there are no rooms available. This is a trigger for cache expiry. Similarly the B2C new sites have explicit "publishing" action which clears the cache and lets the users see the latest articles.
In all three scenarios, the characteristics of cache are in fact the characteristics of cache expiry.
The next thing to consider is the Granularity of cache.
Many times, information comes in Packets, and hence cache should also expire in packets.
Lets say you have an application showing departure boards of all London airports. Now its likely that you will get a feed every "x" minutes from different airports. And when you get that, you want to expire the status of all flights of the particular airport and reload them. It might be too much of an effort to compare individual flights and change selectively. In some cases, this processing may out-weigh the benefit of the cache in the first place.
It is usually easy to identify the parameters for caching - and for a different value of any of those parameters, we can assign a different "cache set". For example, search results will depend on the query parameters typed by the user. It may also depend on the language option specified by the user and any other structured search field selected by the user. So - caching in that case will have all the structured search options as parameters - and if any of that changes - it can not use the cached value and has to do a query.
I will talk more about caching and cache implementation in my next post.
Subscribe to:
Posts (Atom)