Async queue implementation in HS

FroMage · **Joined:** Tue Apr 01, 2008 11:10 am **Posts:** 69

Hello,

Looking at the code (investigating the behaviour) of HS's DocumentBuilder I notice that even when the queue is asynchronous, it's only the Lucene processing which is asynchronous: the entity is scanned immediately in the current transaction. DocumentBuilder.addWorkToQueue loads the Document from the entity synchronously.

Is there a reason why this can't be postponed asynchronously too? Just recording which entity ID needs to be reindexed asynchronously and loading it later from the worker thread?

sanne.grinovero · **Posted:** Fri May 15, 2009 6:25 am

Hi,
the Document is not being built immediately but at transaction commit, if the commit is succesful.
This way the state of the object going to be indexed is coherent with the state of the current transaction (which is going to be closed), and many objects could already be initialized. If you would postpone this operation asynchronously you would have to start a new transaction and reload information from database.

Also the Document being built is not Analyzed at commit time, but in the backend (asynchronously if you want it to): so actually nothing expensive is happening besides making sure that all needed data is loaded and bridged to string form. The "bridge" operation can't be postponed, as it might need to load more lazy fields and sot it needs to stay in the same transaction.

FroMage · **Joined:** Tue Apr 01, 2008 11:10 am **Posts:** 69

Well, I understand why it may be desirable to load things from the same transaction, but this is not asynchronous: I currently have a loop which inserts several objects through a DAO, each in its own transaction, and for each object I get to pay the HS cost. Not asynchronous to me. Also not asynchronous for a single call since the indexing is done at the end of transaction, but (I'm not 100% sure here), when called from a web services endpoint, before the reply is sent (since it's in the same transaction, and the reply should not be sent before the transaction is complete).

A real asynchronous backend would even be able to delay the indexing such that consecutive updates of the same entity in X minutes would only be indexed once, thereby saving tons of pointless indexing (think about search engines reindexing your web page every time you save your html file).

Obviously my problem is not general, doesn't apply to every case, possibly even only applies in relatively small number of cases, but still it may be useful to make that an option. I thought when I told HS to be asynchronous it would be, but it's not what I call asynchronous in my case.

sanne.grinovero · **Posted:** Tue May 19, 2009 3:30 am

Quote:

I currently have a loop which inserts several objects through a DAO, each in its own transaction, and for each object I get to pay the HS cost.

So they are new objects; are they linked to existing objects which are going to be indexedEmbedded? if not, then all the cost you are paying is the conversion into string form, all the rest of processing is async; otherwise you are loading objects and should tune that (caching?).
Did you profile the application? can you tell what kind of "cost" you are speaking about?
In the version in trunk you can keep the indexing work enabled (the in sync part of the "cost") but disable the backend (the actual index writes which is async), I've made this only to let people verify the impact of in-sync work.
It should be negligible, unless they contain huge and hard-to-convert information like whole books in pdf or office formats (there are solutions for this case too, using a lazy fields bridge).

Quote:

A real asynchronous backend would even be able to delay the indexing such that consecutive updates of the same entity in X minutes would only be indexed once, thereby saving tons of pointless indexing

You're right, expect improvements on this topic in the next release: we are experimenting with some across-transaction whole queue optimizations. This kind of optimization is applied already to the unique transaction, I'd suggest you to try saving more than one new object per transaction. This is recommended anyway for all kinds of batch processing in hibernate.

FroMage · **Joined:** Tue Apr 01, 2008 11:10 am **Posts:** 69

s.grinovero wrote:

So they are new objects; are they linked to existing objects which are going to be indexedEmbedded? if not, then all the cost you are paying is the conversion into string form, all the rest of processing is async; otherwise you are loading objects and should tune that (caching?).

@IndexedEmbedded and @ContainedIn.
The point is not that HS shouldn't be loading all these objects, the cost has to be paid obviously since we want to index. I'm just saying this isn't very asynchronous, especially if it triggers lots of DB fetching.

s.grinovero wrote:

Did you profile the application? can you tell what kind of "cost" you are speaking about?

I'd say it's not related. I noticed the synchronous cost of HS for two distinct reasons in two distinct places. In the first case tons of SELECT statements while doing a simple REST call. In the second case a deadlock whose stack frame reveals is happening while HS is indexing my things in my transaction.

I don't believe the deadlock is HS's fault but rather related to something which went wrong in my transaction, but the point is that the deadlock is gone when I disable HS, so now I have to find out what it was. I'm not entirely sure but I suspect the problem would not occur if HS did this indexing in a transaction I didn't fuck up.

The cost of the sychronous indexing is not even an issue, it's just not asynchronous. All that is asynchronous is the Lucene index update, all the model/DB fetching/analysis + bridge invocation is synchronous.

s.grinovero wrote:

Quote:

A real asynchronous backend would even be able to delay the indexing such that consecutive updates of the same entity in X minutes would only be indexed once, thereby saving tons of pointless indexing

You're right, expect improvements on this topic in the next release: we are experimenting with some across-transaction whole queue optimizations. This kind of optimization is applied already to the unique transaction, I'd suggest you to try saving more than one new object per transaction. This is recommended anyway for all kinds of batch processing in hibernate.

Glad to hear there's going to be some work on that, is there a JIRA issue somewhere I can follow?

Saving more than one object per transaction is entirely besides the point. Think of a highly concurrent REST endpoing receiving hundreds of calls per second, many operating on the same entity: it's just madness to index the entity for every update. HS should have an option to wait for things to settle down before indexing (async with some delay and grouping).

Thanks for your replies in any case :)

FroMage · **Joined:** Tue Apr 01, 2008 11:10 am **Posts:** 69

I'm back with numbers on what I'm doing.

I'm inserting 1000 entities and following the batch insert advices in the hibernate manual by flushing and clearing my session every 100 entities. I get a rate of 30 entities per second with HS disabled entirely. When I enable it (with the same 100 batch_size) I get delays when flushing my session while HS is indexing, and my insertion rate drops at 15 entities per second.

Maybe there's a setting I've fucked up (I tried setting the HS batch_size greater than the JDBC batch size up to 1000 and it did help me up to 18 entities per second), but otherwise I think I'll keep on patching my HS (already super patched with speed improvements and new features that are in JIRA but never got picked up) and add a real async option myself if I can manage it.

Alternatively is there a way to disable HS indexing entirely in a single Session? This way I can then schedule a timer with the list of IDs to index asynchronously.

sanne.grinovero · **Posted:** Fri May 22, 2009 12:14 pm

Hi Stéphane,
one year ago I was also complaining about some performance issues, when I managed to argument with Emmanuel (project leader) about some patches I wanted to integrate I've become a contributer and after a while commiter.
I've since then contributed more than 50 bugfixes and improvements, but I've a "daywork" too to manage, so sorry if I'm sometimes slow to answer but I'm very interested in your findings if you share them.

If you have code patches for open JIRA's it would be great if you could attach them, I'm sure Emmanuel will consider them for integration, or give some feedback when he'll be back (from much deserved vacations).

If you are deepening yourself in the Search code I'd really suggest to checkout the trunk, where new features and speed improvements are coming. These patches are about performance and I've several experiments which are going to be committed during june:
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-327
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-218 (partially committed already, discussions and test welcome)

I think you could help with this one, as it's not complex and you can look to the RAMDirectoryProvider
for an example of similar code:
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-275
https://www.hibernate.org/462.html

Quote:

When I enable it (with the same 100 batch_size) I get delays when flushing my session while HS is indexing, and my insertion rate drops at 15 entities per second.

You're loading additional data for @IndexedEmbedded right? looks reasonable, you're limited by the delay the database introduces when having to execute the second query. This doesn't mean you're limited to 15entities/seconds of course, if you start 10 different tests in parallel you're probably going to scale to 150entities/second, as the time your system is waiting for the database answer it's idle, you're not burning resources. I do exploit this concept for some impressive numbers for the automatic indexing routines (HSEARCH-218): 12000 entities/second for a simple graph, 4000 entities / second having 7 kinds of collections embedded (on a laptop).
You could try "isolating" the performance using the new blackhole in trunk, it doesn't have any other practical purpose than to make this kind of measurements. It's easily backported too: http://fisheye.labs.jboss.com/viewrep/Hibernate/search/trunk/src/main/java/org/hibernate/search/backend/impl/blackhole/BlackHoleBackendQueueProcessorFactory.java?r=16310

To express ideas about new features and improvement, it's better if you share them on the hibernate-dev list, so that we are sure that others will see them too.

FroMage · **Joined:** Tue Apr 01, 2008 11:10 am **Posts:** 69

s.grinovero wrote:

Hi Stéphane,
one year ago I was also complaining about some performance issues, when I managed to argument with Emmanuel (project leader) about some patches I wanted to integrate I've become a contributer and after a while commiter.
I've since then contributed more than 50 bugfixes and improvements, but I've a "daywork" too to manage, so sorry if I'm sometimes slow to answer but I'm very interested in your findings if you share them.

Hey, I'm never complaining when volunteers reply, don't apologize :)

s.grinovero wrote:

If you have code patches for open JIRA's it would be great if you could attach them, I'm sure Emmanuel will consider them for integration, or give some feedback when he'll be back (from much deserved vacations).

I have already sent several patches that helped us a lot:

- http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-5 in order to stop propagating index changes to @ContainedIn entities when the @IndexedEmbedded's index hasn't changed. This one made our code go from O(N) to O(1) when modifying non-indexed properties in @ContainedIn entities. This effectively means we simply cannot not use HS without this patch because if we have N objects in our database, adding a new one will cost N :(
- http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-183 fixes a bug where @IndexedEmbedded entities with no prefix would cause our index to be corrupted and delete random index documents
- http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-185 which enables us to select what we embed otherwise we end up with corrupted indexes.

In effect all these patches sent by me a while ago never got picked up by HS and I've had several talks about them with Emmanuel who had arguments against them. We cannot use HS without those patches so we have a forked version in house which everyone uses. I always sent my patches to open source projects, but without any feedback in return there is little motivation for me to keep sending patches in or modify those patches so that they are accepted. Indeed a discussion about why they still are not in and what I have to change to get them included would be a very good sign.

s.grinovero wrote:

If you are deepening yourself in the Search code I'd really suggest to checkout the trunk, where new features and speed improvements are coming. These patches are about performance and I've several experiments which are going to be committed during june:
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-327
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-218 (partially committed already, discussions and test welcome)

I think you could help with this one, as it's not complex and you can look to the RAMDirectoryProvider
for an example of similar code:
http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-275
https://www.hibernate.org/462.html

I'd be happy to help, but I've looked at them and they do not seem directly related to my problem. My employer allows me to contribute to open source projects in my work time provided the work is tightly linked with our problems at hand. I'm going to try your fixes since they look very useful, but what I need right now is to be able to stop HS indexing while I do my batch upload to make it really asynchronous. I don't need to make it go faster, but to get out of the way ;)

Free time is something I just don't have anymore :(

s.grinovero wrote:

Quote:

When I enable it (with the same 100 batch_size) I get delays when flushing my session while HS is indexing, and my insertion rate drops at 15 entities per second.

You're loading additional data for @IndexedEmbedded right?

Not even in this case!

s.grinovero wrote:

looks reasonable, you're limited by the delay the database introduces when having to execute the second query. This doesn't mean you're limited to 15entities/seconds of course, if you start 10 different tests in parallel you're probably going to scale to 150entities/second, as the time your system is waiting for the database answer it's idle, you're not burning resources. I do exploit this concept for some impressive numbers for the automatic indexing routines (HSEARCH-218): 12000 entities/second for a simple graph, 4000 entities / second having 7 kinds of collections embedded (on a laptop).

This is a user uploading an excel spreadsheet where excel rows map to DB entities. This is J2EE and I don't think it's a wise place to start putting threads unfortunately. Also this is running in a single transaction which I need in case I roll back. I am very interested in your automatic indexing though since when we upgrade our web application our production DB is fairly small (100k records tops) and reindexing takes about 20 minutes. But once again I hope HS is going to manage the threads itself as it shouldn't be a user problem.

s.grinovero wrote:

You could try "isolating" the performance using the new blackhole in trunk, it doesn't have any other practical purpose than to make this kind of measurements. It's easily backported too: http://fisheye.labs.jboss.com/viewrep/Hibernate/search/trunk/src/main/java/org/hibernate/search/backend/impl/blackhole/BlackHoleBackendQueueProcessorFactory.java?r=16310

I can try, but I have a real suspicion that it's the frontend bothering me, not the backend ;)

s.grinovero wrote:

To express ideas about new features and improvement, it's better if you share them on the hibernate-dev list, so that we are sure that others will see them too.

Ah, perhaps this is why I had so little feedback from JIRA ;)
I'll look there, thanks a lot for all your answers once again :)