Is Hibernate Search too complex for clustered applications?

mdoraisamy · **Joined:** Sun Jun 11, 2006 12:28 pm **Posts:** 2

Context: I had made some comments on serverside about choosing database instead of JMS for storing lucene index, since i felt that JMS introduces too much of complexity.
http://www.theserverside.com/news/threa ... d_id=47238
This post is meant to be a follow-up discussion of the same thread.

1)If the indexes are stored as tuples (unlike file system which gets locked completely) i do not see any reason why it can't scale.
2)If you are justifying JMS architecture for scaling for search indexing (which could be less than 10% of persistence need in an app), why are we not doing that for all persistence needs (including hibernate)?
3)For all this complexity, basic transactional integrity is not guaranteed. While hibernate search talks about transaction, if the transaction fails, integrity goes for a toss (with lucence lock exception) and indexing for subsequent users/threads fail.

thanks,
mani

emmanuel · **Posted:** Sat Oct 27, 2007 1:02 pm

Hi Mani
Concerning your first point, I don't know a single implementation of this tuple approach. AFAIK, the JDBCDirectory implementations out there all depend on a pessimistic lock at the global level and store each segment in a blob, so essentially mimic the file structure. Feel free to give it a try but I suspect that you will need to change how Lucene interacts with the Directory. It might not even be possible since Lucene does more that just storing the documents: it computes statistics, creates its dictionary etc etc and store these metadata. So some sort of aggregation is necessary.
Also remember that a given index reader is guaranteed to see the state at the time of opening (and not the subsequent data changes), this means you might have to go up to the serializable isolation level depending on how your database structure is modeled and would defeat your scalability goal.

2. First of all, I consider the index as a non vital piece of information and practically speaking most usage of a full text search engine can suffer some delays between the data change and it's propagation to the index. I also am not a big fan of not been able to store my data (customer order), just because my auxiliary indexing system is unavailable.
But imagine a bank system, I cannot afford a delay between a withdrawal operation and accessing the account amount to verify if another withdrawal is allowed, hence I will not use this asynchronous architecture.
Plus contrary to Lucene indexes, most data are accessed with the read committed or repeatable read isolation level which is much more scalable than the serializable isolation level Lucene rely on. There is no need to go asynchronous to scale up.

3. I am not sure I follow you.
Do you mean the Lucene indexing can fail due to lock exceptions? Actually no because this architecture prevents concurrent access to IndexWriters.
But let's say something else fails, like not being able to access the file system. In this case the JMS message will be rollbacked and reapplied later on. So no indexing operation is lost.

So to summarize, JDBCDirectories as they have been implemented so far do not fundamentally scale out because of the global lock Lucene mandates. Most use blobs which is also not known as the best piece of scalability in an RDBMS system (when they even work). Data and index (meta)data are different beasts wrt their importance, there is nothing wrong in dealing with them differently. And frankly setting up a JMS queue and MDB do not take long, I don't see the complexity here.

Lath · **Joined:** Tue Aug 21, 2007 5:57 pm **Posts:** 7

I think that there are many applications where users expect search to return results that are up to date. If I create a new piece of information, then search for it (or tell my friend to search for it), I expect it to be found, and if it's not, I'll get worried.

Is there a way to use Hibernate search that is Distributed (clustered), Scalable (plenty of updates and queries), and Up-to-Date (live, current search results)?

The way I see it, you have to choose 2 out of the 3.

A) Scalable and Up-to-Date: Use a local file index on a single box. Can do plenty of updates / queries per second and is always up to date. Search will scale with your app just fine, however your app might not scale too far if confined to one box.
B) Distributed and Scalable: Use mirrored master/slave files on separate boxes to store your index with a JMS asynchronous update system. However, your searches may be 30 minutes behind reality.
C) Up-to-Date and Distributed: Use a database to store your index. However, reading and writing large BLOBs (from what I've read) won't scale too far before slowing down on many updates or queries.

Is there a way to store indexes with HSearch that is (D) Up-to-Date, Scalable, and Distributed?

emmanuel · **Posted:** Wed Oct 31, 2007 1:02 am

Can you list your many applications or at least the fields where it is needed to have a full text search feature up to date in a "transactional way"? Reminder: your SQL queries will be up to date.

Your question actually is not specific to Hibernate Search but more generally to Lucene. Let's put a disclaimer first: you can deal with plenty of queries and updates on a single Lucene box. To define plenty we have to implement your app and run it in your real environment, any other assumptions are more or less BS.

That being said, theoretically speaking you cannot have scalable (plenty of updates) and up-to-date at the same time in Lucene since you can only have one writer at the same time and the beast is thread safe.
So A is not exactly right

B is right, but depending on how long the copy takes 30 mins could be 1min.

C: using a DB (regardless of blobs) faces the same problem as using a shared file system: you need a pessimistic lock mechanism and you need remote communication. So you face the scalability issue potentially and your queries will be slower (due to the remote IO).

Now, can you name me a single RDBMS system that is D? So really the search for D is a mind exercise.

The reasons behind the asynchronous model with local Directory replication are:
- query performance (local IO)
- user response time (do not penalize the main process with indexing work)
- file based locks and NFS don't always play well together
It happens to be a system that can be made distributed easily.

Lath · **Joined:** Tue Aug 21, 2007 5:57 pm **Posts:** 7

Thanks for your quick response, Emmanuel.

You're correct that I should have given more specific information on the requirements of the application I'm working on. I think a single Lucene box would absolutely be able to scale to handle the quantity of updates and queries that I expect. I'm not looking for something that necessarily stores the index in a distributed fashion, but something that would integrate with a distributed application using Hibernate.

I will try to outline our requirements more specifically. We have an application that is running distributed on several boxes. We're using Hibernate for persistence. Right now we're handling some text queries via SQL clauses WHERE someField = '%XYZ%'. As you pointed out, the SQL query is always up to date, which is great. However, this query is already slow and I anticipate will become slower as our dataset grows.

I'm looking for a search solution that will be faster (say <200ms) as well as up to date (say within 5 seconds). Hibernate Search looks great in how it plugs in to the existing persistence layer, updates the data on successful transactions, and will run on a distributed application. The only place where it seems to fail our requirements is in having up to date queries.

One thing I'm considering is using Hibernate Search to manage distributed updates, using the JMS solution to update a master Lucene file based index, but using a different layer to run queries. For a query put to any box, the app would pass the query to some service on the box with the master index, perhaps Solr. This way, the index data doesn't need to be transferred (as it would for a JDBCDirectory), merely the query and results.

Alternatively, I could separate the query system entirely. I could run a box with a service like Solr. When a transaction is committed, I could use the callbacks to submit updates to the service, and later perform queries via the service. Unfortunately, it seems like I would be duplicating a portion of what has already been done with Hibernate Search.

So it would be great to have a query engine that is not itself distributed, but that could serve queries and integrate with an existing, distributed, Hibernate-based application. I would love to hear other suggestions as well.

Thanks for all the hard work you've put in to this Emmanuel.

emmanuel · **Posted:** Wed Oct 31, 2007 6:19 pm

Hibernate Search is quite flexible and can be adjusted to a lot of different architectures (even if I only promote 2 of them for simplicity).

I would give this solution a try:

Use the event system to capture all changes and propagate them through JMS (as you said).
Instead of using the Directory replication (really a couple of specialized DirectoryProviders), just use the simple providers and point all the slaves to the master directory. For pure queries, there is no lock involved. Your queries will be slower than with a local directory but it might just work perfectly fine for your requirements. Hibernate Search caches the IndexReader for better performance out of the box, but you can alter the caching strategy to fit your needs in a more fine grained way (let's say refresh the indexReader every 5 secs :) )
http://www.hibernate.org/hib_docs/search/reference/en/html_single/#search-architecture-readerstrategy

Lath · **Joined:** Tue Aug 21, 2007 5:57 pm **Posts:** 7

Thanks for the suggestion. I will give that a try.