Indexing Hangs Thread in addWorkToQueue on TX Commit

jnadler · **Posted:** Tue Jul 27, 2010 4:23 pm

HSearch version 3.2.0.Final

Very odd. At one of my customers this problem is consistently repeatable. A process runs that adds a lot of new @Indexed entities each of which has at least one @IndexedEmbedded entity. Consistently one thread gets hung - this thread is using 100% of one CPU core. It appears to be stuck that way forever now.

If they restart their server, the next time the process runs we get stuck in the exact same place.

Here's the thread dump in question in case anyone has a suggestion:

Code:

"schedulerFactoryBean_Worker-5" prio=10 tid=0x000000000d6d9800 nid=0x5ca3 runnable [0x0000000042786000..0x0000000042787c10]
   java.lang.Thread.State: RUNNABLE
   at org.hibernate.search.engine.DocumentBuilderIndexedEntity.addWorkToQueue(DocumentBuilderIndexedEntity.java:319)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.addWorkForEmbeddedValue(DocumentBuilderContainedEntity.java:726)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processSingleContainedInInstance(DocumentBuilderContainedEntity.java:709)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processContainedInInstances(DocumentBuilderContainedEntity.java:664)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processSingleContainedInInstance(DocumentBuilderContainedEntity.java:705)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processContainedInInstances(DocumentBuilderContainedEntity.java:659)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.addWorkToQueue(DocumentBuilderContainedEntity.java:612)
   at org.hibernate.search.backend.impl.BatchedQueueingProcessor.addWorkToBuilderQueue(BatchedQueueingProcessor.java:270)
   at org.hibernate.search.backend.impl.BatchedQueueingProcessor.processWorkByLayer(BatchedQueueingProcessor.java:248)
   at org.hibernate.search.backend.impl.BatchedQueueingProcessor.prepareWorks(BatchedQueueingProcessor.java:147)
   at org.hibernate.search.backend.impl.PostTransactionWorkQueueSynchronization.beforeCompletion(PostTransactionWorkQueueSynchronization.java:70)
   at org.hibernate.search.backend.impl.EventSourceTransactionContext$DelegateToSynchronizationOnBeforeTx.doBeforeTransactionCompletion(EventSourceTransactionContext.java:144)
   at org.hibernate.engine.ActionQueue$BeforeTransactionCompletionProcessQueue.beforeTransactionCompletion(ActionQueue.java:530)
   at org.hibernate.engine.ActionQueue.beforeTransactionCompletion(ActionQueue.java:211)
   at org.hibernate.impl.SessionImpl.beforeTransactionCompletion(SessionImpl.java:563)
   at org.hibernate.jdbc.JDBCContext.beforeTransactionCompletion(JDBCContext.java:229)
   at org.hibernate.transaction.JDBCTransaction.commit(JDBCTransaction.java:142)
   at org.hibernate.ejb.TransactionImpl.commit(TransactionImpl.java:76)
   at org.springframework.orm.jpa.JpaTransactionManager.doCommit(JpaTransactionManager.java:467)
   at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:754)
   at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:723)
   at org.springframework.transaction.interceptor.TransactionAspectSupport.commitTransactionAfterReturning(TransactionAspectSupport.java:375)
   at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:120)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:89)
   at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
   at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
   at $Proxy103.processEvent(Unknown Source)
   at com.attensa.core.job.AggregationJobImpl.executeInternal(AggregationJobImpl.java:75)
   at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
   at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
   at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)

sanne.grinovero · **Posted:** Wed Jul 28, 2010 12:06 pm

Quote:

DocumentBuilderIndexedEntity.java:319

sorry there's no code in 3.2.0.Final at that line .. please explain: if there's an issue I'd like to track it.

jnadler · **Posted:** Wed Jul 28, 2010 6:25 pm

Well I didn't expect that - I checked and you're correct. Perhaps somehow this customer has an older hsearch jar on their classpath. I'll check that possibility first.

jnadler · **Posted:** Fri Jul 30, 2010 3:45 pm

We verified that the correct version of Hibernate search is in use:
hibernate-search-3.2.0.Final.jar 457,393

Stopped and redeployed the server with a fresh copy of the app, still seeing the exact same problem. Ran the scheduled job where we see the issue. Same thing, still on line 319.

Thread hung, 100% CPU, stack trace:

Code:

"schedulerFactoryBean_Worker-2" prio=10 tid=0x00002aab793e2800 nid=0x596a runnable [0x0000000042fe4000]
   java.lang.Thread.State: RUNNABLE
   at org.hibernate.search.engine.DocumentBuilderIndexedEntity.addWorkToQueue(DocumentBuilderIndexedEntity.java:319)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.addWorkForEmbeddedValue(DocumentBuilderContainedEntity.java:726)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processSingleContainedInInstance(DocumentBuilderContainedEntity.java:709)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processContainedInInstances(DocumentBuilderContainedEntity.java:664)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processSingleContainedInInstance(DocumentBuilderContainedEntity.java:705)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.processContainedInInstances(DocumentBuilderContainedEntity.java:659)
   at org.hibernate.search.engine.DocumentBuilderContainedEntity.addWorkToQueue(DocumentBuilderContainedEntity.java:612)
   at org.hibernate.search.backend.impl.BatchedQueueingProcessor.addWorkToBuilderQueue(BatchedQueueingProcessor.java:270)
   at org.hibernate.search.backend.impl.BatchedQueueingProcessor.processWorkByLayer(BatchedQueueingProcessor.java:248)
   at org.hibernate.search.backend.impl.BatchedQueueingProcessor.prepareWorks(BatchedQueueingProcessor.java:147)
   at org.hibernate.search.backend.impl.PostTransactionWorkQueueSynchronization.beforeCompletion(PostTransactionWorkQueueSynchronization.java:70)
   at org.hibernate.search.backend.impl.EventSourceTransactionContext$DelegateToSynchronizationOnBeforeTx.doBeforeTransactionCompletion(EventSourceTransactionContext.java:144)
   at org.hibernate.engine.ActionQueue$BeforeTransactionCompletionProcessQueue.beforeTransactionCompletion(ActionQueue.java:530)
   at org.hibernate.engine.ActionQueue.beforeTransactionCompletion(ActionQueue.java:211)
   at org.hibernate.impl.SessionImpl.beforeTransactionCompletion(SessionImpl.java:563)
        ...

jnadler · **Posted:** Fri Jul 30, 2010 7:44 pm

We've narrowed this down a bit. The problem only occurs during transactions that add a large amount of luceneWork.

In one case we were able to reproduce the problem by updating about 1000 @Indexed entities (each with average two @IndexedEmbedded entities included within) in a single transaction.

The problem appears to be that the algorithm used in addWorkToQueue is inefficient for large N. In our case it can get so bad that this runs for days.

If I understand correctly, this seems to be a known issue per this comment in the source code:

Code:

//TODO with the caller loop we are in a n^2: optimize it using a HashMap for work recognition

Of course a possible workaround for us is to try to find a way to break the work into multiple smaller transactions. This would be a massive change to our system but it might be worthwhile for us to make the change in any case just because of other issues (like locking+concurrency) introduced by transactions that are too large.

sanne.grinovero · **Posted:** Sat Jul 31, 2010 5:50 am

thanks for the valuable insight, I've opened an issue to track this: http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-570

If you happen to take a closer look to the addWorkToQueue method feel free to suggest a patch ;)

jnadler · **Posted:** Sat Jul 31, 2010 5:31 pm

We should discuss design a bit. I might be able to provide a patch but my initial idea is a very big change. Because I am not familiar with hsearch internals I am hesitant to attempt such a large change.

I think that it might be worth considering that List<LuceneWork> is not the best data structure given the way this data is used. To make it more refactorable in the future it would be best to create a new class LuceneWorkQueue that abstracts operations on a queue and hides the storage implementation.

Doing this would of course would have a very big impact.

Based on the logic in addWorkToQueue it seems that the best data structure might be:
Map<Class, Map<Serializable, LuceneWork>>

Outer map key is entityClass, inner map key is id. No looping would be needed in the main part of DocumentBuilderIndexedEntity. What do you think? Do you have a suggestion that is easier?

jnadler · **Posted:** Sat Jul 31, 2010 7:46 pm

Also in the plan I've described, LuceneWorkQueue would need to maintain internally both the Map of Maps and a List. The List is needed to preserve ordering.

emmanuel · **Posted:** Fri Oct 08, 2010 7:50 am

Hi jnadler,
Sorry we dropped the ball on this for so long. If you are still in the game, I would appreciate your ideas and even better patch :)

List<LuceneWork> is a a semi-public class used when jobs are serialized / deserialized and in a few other semi-public APIs like WorkQueue. But internally we can have an intermediate structures and it seems you want the optimized structure within BatchedWueueingProcessor#prepareWorks

We have a copy of Hibernate Search on GitHub and it's trivial to do a fork and play with the source code http://github.com/emmanuelbernard/hibernate-search

Let me know if you are still interested.

Emmanuel

jnadler · **Posted:** Fri Oct 08, 2010 12:12 pm

Hi Emmanuel,

Thanks for the follow up. I'm happy to work on this but it will take me some time: Wife's due for a baby any day now and I'll be out from work for a bit.

If I recall correctly my idea was that before this loop starts, we transform the data structure to Map<Class, Map<Serializable, LuceneWork>> - outer map key is entityClass, inner map key is id.

Once this is done, no looping would be needed in the main part of DocumentBuilderIndexedEntity, it's just a couple of map lookups. For a given class, for a given key, get its LuceneWork.

Does this make sense at least abstractly? I'm not confident with the HSearch internal design so I'd love some validation from you guys before I build the patch.

Thanks again,

Jeff

jnadler · **Posted:** Fri Oct 08, 2010 12:14 pm

For anyone else having a similar problem: I worked around this in my application by breaking the work up into multiple smaller transactions. Needless to say this is easier on the DB as well. Just wanted to make it clear that for most apps it probably isn't strictly necessary to do such large transactions.

In my case the only downside is the need to catch any exceptions and do some specialized clean-up to preserve atomicity.

emmanuel · **Posted:** Fri Oct 08, 2010 12:22 pm

Yes it made sense to me at least :)

I wonder if we should have something that kicks in on for big lists of work. Ie the regular work for small and medium lists and the extra data structure creation for the big lists. I'm a bit concerned that the structure creation brings overhead and nothing more for most use cases.

jnadler · **Posted:** Fri Oct 08, 2010 12:29 pm

Makes sense. I'm a little nervous about having two distinct code paths for this core functionality, it's always easy for someone to change the 'main' path in the future and forget about the 'big data' path.

Still I understand the concern about overhead for the typical transaction with perhaps 10 or less LuceneWork items.