Hi Sanne,
Sorry for the slow reply but the issue seems to manifest it slightly different every time, and the Exceptions are numerous.
Again all I'm doing is re-indexing all of the documents. The app consuming from JMS is doing all the writing, and I do a soft kill of the one of the read-only nodes. The same patterns occur if I have a node go suspect just because the system the process is running on is under high load(might need to do some groups tuning there).
To simplify things I've reduced the number of threads reading from JMS to 1, and also have changed the async worker thread pool to 1(but I think thats on the side publishing index changes to JMS, not on the applying index changes side).
Things start off after the kill with a suspect Exception, usually along with an EOF Exception on the side writing to the index:
Quote:
06 Dec 2011 21:13:35,014 ERROR @Lucene Merge Thread #3 org.hibernate.search.exception.impl.LogErrorHandler Exception occurred org.apache.lucene.index.MergePolicy$MergeException: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: m1229-16089
org.apache.lucene.index.MergePolicy$MergeException: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: m1229-16089
at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:517)
at org.hibernate.search.backend.impl.lucene.overrides.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:49)
Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: m1229-16089
06 Dec 2011 21:13:58,430 ERROR @Lucene Merge Thread #1 org.hibernate.search.exception.impl.LogErrorHandler Exception occurred org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: Read past EOF: Chunk value could not be found for key _17.cfs|0|NetworkInformation
Then for as long as I leave the writing app running, it seems to loop indefinitely continuing re-indexing work and never seems to finish. Just endless logs along the lines of:
Quote:
06 Dec 2011 23:09:56,720 TRACE @Lucene Merge Thread #16 org.infinispan.util.logging.AbstractLogImpl Invoked with command GetKeyValueCommand {key=_3p.fdt|76|NIC, flags=[SKIP_LOCKING]} and InvocationContext [NonTxInvocationContext{flags=[SKIP_LOCKING]}]
06 Dec 2011 23:09:56,721 TRACE @Lucene Merge Thread #17 org.infinispan.util.logging.AbstractLogImpl Number of entries in context: 1
06 Dec 2011 23:09:56,721 TRACE @Hibernate Search: Directory writer-1 org.infinispan.util.logging.AbstractLogImpl Entry for key _2c.tis|18|Location is null, not calling commitUpdate
06 Dec 2011 23:09:56,722 TRACE @Lucene Merge Thread #17 org.infinispan.util.logging.AbstractLogImpl Updating entry (key=_67.fdt|29|NIC removed=false valid=true changed=true created=true value=[B@3cfbe5d7]
06 Dec 2011 23:09:56,721 TRACE @Lucene Merge Thread #16 org.infinispan.util.logging.AbstractLogImpl Key _3p.fdt|76|NIC is not in context, fetching from container.
06 Dec 2011 23:09:56,722 TRACE @Hibernate Search: Directory writer-1 org.infinispan.lucene.InfinispanIndexOutput Storing segment chunk: _68.fdt|55|NIC
Again I'd expect it to wrap up work that was in progress pretty quickly since the app that was pushing the index work to JMS was the one that was killed. In any case, if I try to start up the read-only app again a series of Exceptions usually occur on the app that is responsible for writing to the index(the one that has been up the whole time):
Quote:
06 Dec 2011 23:10:30,720 ERROR @transport-thread-17 org.infinispan.remoting.rpc.RpcManagerImpl unexpected error while replicating
org.infinispan.manager.NamedCacheNotFoundException: Cache: LuceneIndexesData
at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:123)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144)
Quote:
java.util.concurrent.TimeoutException: Hibernate Search: Directory writer-1 could not obtain shared processing lock after 50 seconds. Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@8d06677[Read locks = 12] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@29f71326[Unlocked]
Quote:
06 Dec 2011 23:12:42,839 ERROR @Hibernate Search: Directory writer-2 org.hibernate.search.backend.impl.lucene.PerDPQueueProcessor Unexpected error in Lucene Backend:
org.hibernate.search.SearchException: Unable to remove class com.ssn.dm.types.location.ServicePoint#2312193 from index.
at org.hibernate.search.backend.impl.lucene.works.DeleteWorkDelegate.performWork(DeleteWorkDelegate.java:91)
at org.hibernate.search.backend.impl.lucene.PerDPQueueProcessor.run(PerDPQueueProcessor.java:106)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at org.hibernate.search.backend.impl.lucene.works.DeleteWorkDelegate.performWork(DeleteWorkDelegate.java:87)
... 4 more
It's almost as if the read-only app that is trying to re-join overwrites some shared state with invalid or uninitialized data. Having the named cache like, LuceneIndexesData, suddenly disappear when re-joining seems really odd. But I'm not sure if thats the root cause given all the errors involved.
Another way this sometimes goes is that I just get endless errors on the writing app side that never seem to resolve:
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: Read past EOF: Chunk value could not be found for key
Regardless the joining app never is able to rejoin. Usually with lock timeout error mentioned above in my first post.
I've also seen some logs about 'DummyTransactions', and SKIP_LOCKING flags, which makes me worried I've got something mis-configured to not be transactionally sound when interacting with infinispan.
I know this is a lot of info but is this making you suspicious of any misconfiguration or ringing any other bells?
Thanks,
Zach