Collection Caching: Findings and Wishlist

wfassett · **Posted:** Tue Aug 26, 2008 2:00 pm

I haven't gone far down the design path yet, but I'd like to just put this out there, because I'm wondering if (a) anyone is doing something similar and (b) what you guys think is the best way to handle this.

This is really about doing caching intelligently. A while back, I spent a long time experimenting with how NH handles caching / invalidation, because I never found much documentation about it. Meant to write it up and then didn't, so I'll note a few findings here (because maybe I got some of this wrong. I'm not an NH insider and I'm hoping someone on the inside can enlighten me) before actually asking my question.

General Lessons Learned

1. The hard thing about caching of any kind is usually invalidating the cache -- making sure you do it often enough, but not unnecessarily.

2. Entity Caching is fairly straightforward / simple. Collection Caching is not. (by "Collections" I mean both mapped Persistent Collections and HQL / Criteria queries)

NH caches entities separately from collections, meaning that cached collections are usually just a list of IDs, and NH uses them to create proxies that will eventually be initialized.

That means that when a collection is loaded from the cache, it will essentially call Session.Load() for every item in the collection. This is very different from NH's approach to retrieving an uncached collection from the DB, which is to load both the collection and the associated entities at the same time. The risk here is that caching collections could actually hurt performance if you don't think through it. There are a couple of inferences that should be made here:

You should usually enable Entity Caching on any entities that are contained in cached collections.
Batch-loading is very helpful when caching collections / queries.

3. Although people say that Queries are a viable replacement for Collections, Persistent Collections and Queries are apples/oranges.

Batch Loading

I haven't seen many people talk about batch-loading, which is a shame, because it's great. It works like this: the NH session keeps track of every proxy it generates, and when one of them is initialized, it loads a batch of them at the same time, using a load query like "select * from Foo where id in (1,2,3,4,5,...)".

So if:

1. I've set the Foo class to have a batch size of 20
2. I've got a collection that will contain 100 Foo instances when loaded

Then:

1. Loading my collection from the DB will (depending on fetch strategy) probably issue an outer join query that loads 100 Foos in 1 SQL statement
2. If I load the collection from the cache and the Foos themselves are not in the cache (and that does happen), it will load 100 Foos using 5 SQL statements, whereas without batch loading, NH would issue 100 SQL statements, loading the entities one at a time.

FYI - I noticed a bug in NH a while back in that the NH Query Cache was not doing any batch loading after retrieving cached query results. This is a huge deal, IMO. I logged a bug and a patch for it, but it appears to have stalled and not been addressed in the new NH 2.0GA release, so if you care about this issue, you may want to take a look at:

http://jira.nhibernate.org/browse/NH-1247

IQuery / IFilter / ICriteria / ISQLQuery Caching

Let's just call this "query" caching.

NH's support for query caching is great. It can cache paged queries (very important for us) and allows for you to specify custom cache regions.

NH is fairly sophisticated about invalidating cached queries. It basically keeps track of what "spaces" the query got data from (i.e. the Foo space), and whenever changes are made in that space, all related query caches are invalidated. For example, if I have a query that loads Foo instances, its cache will be invalidated every time anyone saves or updates a Foo, meaning that there is a very low chance of me getting stale data, but it also means a lower probability of my cache being hit.

Unfortunately there are some drawbacks to query caching:

1. Cache invalidation happens very often, usually more often than desired
2. If I want to explicitly manage invalidating the cache myself, there appears to be no way to tell NH to NOT invalidate all my Foo queries every time I modify anything related to Foo.
3. There is no way to explicitly manage results, i.e. adding / removing items from a specific set of query results. All you can do is invalidate the cache and requery the DB. This is a fairly minor complaint most of the time, but it's important to understand when comparing to the benefits of persistent collections.

Persistent Collection Caching

NHibernate's Persistent Collections (i.e. Bag, List, Set) are the most efficient way to cache a query and manage changes to it, because you can explicitly add/remove items to entity1.Collection without invalidating the cached entity2.Collection. This makes for some really effective caching.

The drawbacks of persistent collection caching are:

1. Cached Persistent Collections (including Inverse ones) must be explicitly updated or they will get stale.

This isn't that big a drawback and is arguably a strength, but it's important to understand the implications.

Some people think they don't need to update the inverse side of a relationship and can often get away with it in a short-session app like a web app. But it just won't work if the collections are cached.

Similarly, NH persistent collections ignore changes in their "entity space". In other words, if myFoo.Bars is a cached, inverse collection and then create and save a new myBar = new Bar { Foo = myFoo }, it will not result in my collection being updated. I have to explicitly call myFoo.Bars.Add(myBar) in order to update the collection. Usually this is not a big deal, but if I make a change that affects an unknown number Bar collections, I should probably just tell the SessionFactory to evict the entire cache for that collection definition.

2. Cached Persistent Collections cannot currently retrieve data in pages (by specifying startRow, maxResults). Yes, I can create an IFilter around a collection and get pages that way, but it basically turns into query caching at that point, rather than Persistent Collection caching. For that reason I say that true Persistent Collection Caching does not support paging data.

Wish List

In a nutshell: I want / need a cached collection where I can:

Get (cached) paged (or you could call it "batched) results AND
Explicitly add/remove items directly to the cached collection without requiring a DB query to reload it AND
Do a configurable "contains" query that would check cached result pages to see if the entity is there and if still not found, issue a DB query in the form of (1) a select that returns a single row in order to confirm containment or (2) a paged query that incrementally loads pages of remaining items until found.

Tell me if that sounds crazy. I think it's a legitimate need, and it sounds feasible to me. Also, most of those features exist in some form distributed throughout NH, but there's no place where I get all of them, unless I'm missing something (Am I??)

The big question is how best to implement it. I could either do it inside of NH or outside of it.

1. Outside of NH, I could just write a little CachingCollectionHelper that used IQuery, ICriteria, or IFilters to load paged datasets and IDs and then took care of the caching/hydrating myself, rather than rely on NH. Whenever updating inverse collections, I would need to explicitly add/remove stuff from my CachingCollectionHelper. Drawbacks would be that I'd have to explicitly manage all my helpers at the application layer, might have to write an IInterceptor in order to really honor the Session lifecycle with transactions / flushing etc. But it might be easier than the alternative:

2. Write something that runs inside of NH. The NH team seems to recommend not using Persistent Collections when performance is important (http://www.hibernate.org/117.html#A10), but when it comes to intelligent caching I think there's more of an argument to have something that's plugged in to the whole NH ecosphere than just a Helper class that invokes a bunch of queries. Inside of NH, I could try to write my own CollectionPersister, which might be the right way to go, but sounds incredibly daunting. I'm wondering if anyone has ever done this / if there's a good sample / starting point for me to look at. Obviously it would be difficult (or impossible?) to provide mapping support for it, so I'd probably have a lot of custom classes extending it.

Thoughts anyone?

wolli · **Posted:** Thu Aug 28, 2008 6:20 am

Nice one. I suggest that you post this to the nhusers group at google. There are all of the developers watching. And maybe this could be moved to a blog or FAQ.

Nels_P_Olsen · **Posted:** Tue Sep 02, 2008 6:32 pm

Specifically regarding the hassle of having to manually both assign parents to collection items and add them to the collection, i.e. having to code pairs for statements such as:

anotherBar.Foo = someFoo;
someFoo.Bars.Add(anotherBar);

For reasons of reliability and simplicity rather than performance, we considered it unacceptable to have to do the above code all the time, or to resort to the commonly proposed solution of making a special Add method, e.g.:

someFoo.AddBar(anotherBar);

We solved this by using our own collection type which knows how to manage "item owner" (parent) properties. Our collection constructor takes the following extra information:

1. The instance of the owning object (someFoo in this case)
2. The name of the collection property on the owning object ("Bars" in this case)
3. The name of the collection item's property that points back to it ("Foo" in this case)

This way, application-level code only needs to one of the following, and the other will automatically happen:

anotherBar.Foo = someFoo;
someFoo.Bars.Add(anotherBar);

The same convenience is provided for removing items. Only one of the following statements needs to be coded, and the other will happen automatically:

anotherBar.Foo = null;
someFoo.Bars.Remove(anotherBar);

Getting this support requires (1) a special collection class, and (2) extra code to in all entity collection properties to wrap NH's collection with the custom one, and (3) mapping collections to use field access so that NH doesn't touch the properties that wrap its collections.

For us, it was worth it, since our entity classes are all generated so the extra code needed for the collection properties never needs to be manually written.

exp2000 · **Joined:** Tue Oct 16, 2007 9:45 am **Posts:** 93

We have taken a different approach to the problem. We have created an observable collection that throws an event when item is added/removed from the collection. Parent object attaches eventhandlers that take care of setting/removing reeferences to the item. Also aditional logic can be then placed in onItemRemove/onItemAdded in parent object.
Property setters of both Parent/and child have synch logic. Mapping files uuse field access strategy.

Nels_P_Olsen · **Posted:** Fri Sep 12, 2008 9:30 am

Actually sounds like your approach is very similar, the only difference being that the parent and child respond to events to do the sync, while in our case owner/owned (parent/child) sync is a feature of the collection itself.

BTW, our base collection class is also observable, it's a general feature we can't live without ...

exp2000 · **Joined:** Tue Oct 16, 2007 9:45 am **Posts:** 93

Cool, good to know that other people do similar things. Also our collection is Querable via NH to Linq and allows for writing a query like indexer
example
LineItems items = invoice.lineItems[ i > i.amount > 50]