Updating a single value across a large number of Documents

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

The scenario is this. Imagine a Facebook like Wall where users can post thousands of messages and with each post their image is shown. This image filename is stored in a field with the Post document. Now imagine the user uploading a new photo.

The standard way to make sure that photo field gets updated is to have a OneToMany relation between the User and the Post Entity. And the corresponding Post Entity having a ContainedIn for the User reference.
But do you really want to instantiate thousands of objects in order to update one field? And do you want to risk Hibernate's magic from deciding it needs to instantiate this collection ever?

The Massindexer and the corresponding filter functionality HSEARCH-499, hopefully part of 4.1.0.Final, offer a good alternative to this scenario. You can just run the MassIndexer against the Post index filtered on the user. However, that will still generate quite a lot of database traffic.

So I'm wondering if a more fine tuned, lower level approach is possible? Just updating the single field in one fell swoop across all off the Lucene Documents that match user.id:x. In Lucene I believe this is achieved by retrieving the document, deleting the field and adding it with the new value.

Does dit mash with Hibernate Search?

Cheers,
Marc

sanne.grinovero · **Posted:** Thu Mar 29, 2012 7:21 pm

Hi Marc,
well that would be really, really nice. But Lucene doesn't allow you to update a single field: you'll need the original (pre-tokenized) values for all the fields, and rebuild all the Documents instances.

So very fine grained is not possible, but you can still get some good efficiency:

Quote:

But do you really want to instantiate thousands of objects in order to update one field? And do you want to risk Hibernate's magic from deciding it needs to instantiate this collection ever?

Enable a second level cache. You should really have all relations lazily loaded, if possible, and with a second level cache you'll not reload the same value more than once. You'll generate some temporary objects in memory, but they are very short lived and the GC knows how to optimize that. Unfortunately Lucene requires all the objects, so that's what we feed it.

Quote:

The Massindexer and the corresponding filter functionality HSEARCH-499, hopefully part of 4.1.0.Final

Could you explain that? I'm missing something..

Lucene developers are working on a special kind of join which would enable you to split a Document in two (but not more than two), so we'll need to update only one of the document parts. When that will be ready, we'll think how to expose it in some way which makes sense without the user having to know all the Lucene index format details.