Infinispan+Search: LockObtainFailedException on app restart

pushedbytime · **Posted:** Fri Dec 02, 2011 4:54 pm

I'm using:
hibernate search 3.4.1
hibernate-core: 3.6.7
Infinispan 4.2.1
Jgroups 2.11.1(to avoid some replication time out issues described here that I've seen: http://community.jboss.org/message/635949 )

I have multiple nodes setup, one system is dedicated to read index changes from JMS and replicate them via infinispan to the other nodes(2).

Frequently when I restart one of the systems(via a soft 'kill') I encounter this error which prevents hibernate from bootstrapping successfully:

Quote:

Caused by: org.hibernate.search.SearchException: Could not initialize index
at org.hibernate.search.store.DirectoryProviderHelper.initializeIndexIfNeeded(DirectoryProviderHelper.java:164)
at org.hibernate.search.infinispan.InfinispanDirectoryProvider.start(InfinispanDirectoryProvider.java:89)
at org.hibernate.search.store.DirectoryProviderFactory.startDirectoryProviders(DirectoryProviderFactory.java:144)
at org.hibernate.search.spi.SearchFactoryBuilder.initDocumentBuilders(SearchFactoryBuilder.java:403)
at org.hibernate.search.spi.SearchFactoryBuilder.buildNewSearchFactory(SearchFactoryBuilder.java:262)
at org.hibernate.search.spi.SearchFactoryBuilder.buildSearchFactory(SearchFactoryBuilder.java:144)
at org.hibernate.search.event.FullTextIndexEventListener.initialize(FullTextIndexEventListener.java:151)
at org.hibernate.event.EventListeners$1.processListener(EventListeners.java:198)
at org.hibernate.event.EventListeners.processListeners(EventListeners.java:181)
at org.hibernate.event.EventListeners.initializeListeners(EventListeners.java:194)
... 92 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.infinispan.lucene.locking.BaseLuceneLock@1b6554e8
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1097)
at org.hibernate.search.store.DirectoryProviderHelper.initializeIndexIfNeeded(DirectoryProviderHelper.java:159)

To get this to go away, I have to restart the entire cluster. Waiting doesn't seem to help. I'm using all the default configuration files for jgroups/infinispan(copied them below anyway). The throwing the Exception is Lucene code, which probably makes no assumption that the lock and directory is distributed and that the holder may have died.

Is there a way to configure my way out of this? Sounds like there should be something that is aware of cluster members, that can go out and cleanup locks if a member dies. But without knowing too much about whats going on under the covers that may be an unreasonable assumption.

Other than this infinispan + search is working great, so hoping this isn't a major hurdle.
TIA,
Zach

Code:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="urn:infinispan:config:4.2 http://www.infinispan.org/schemas/infinispan-config-4.2.xsd"
    xmlns="urn:infinispan:config:4.2">

    <!-- *************************** -->
    <!-- System-wide global settings -->
    <!-- *************************** -->

    <global>

        <!-- Duplicate domains are allowed so that multiple deployments with default configuration 
            of Hibernate Search applications work - if possible it would be better to use JNDI to share 
            the CacheManager across applications -->
        <globalJmxStatistics
            enabled="true"
            cacheManagerName="HibernateSearch"
            allowDuplicateDomains="true" />

        <!-- If the transport is omitted, there is no way to create distributed or clustered 
            caches. There is no added cost to defining a transport but not creating a cache that uses one, 
            since the transport is created and initialized lazily. -->
        <transport clusterName="HibernateSearch-Infinispan-cluster"
            distributedSyncTimeout="50000">
            <properties>
             <property name="configurationFile" 
             value="searchChannelConfig.xml"/>
          </properties>
        </transport>

        <!-- Used to register JVM shutdown hooks. hookBehavior: DEFAULT, REGISTER, DONT_REGISTER. 
            Hibernate Search takes care to stop the CacheManager so registering is not needed -->
        <shutdown
            hookBehavior="DONT_REGISTER" />

    </global>

    <!-- *************************** -->
    <!-- Default "template" settings -->
    <!-- *************************** -->

    <default>

        <locking
            lockAcquisitionTimeout="20000"
            writeSkewCheck="false"
            concurrencyLevel="500"
            useLockStriping="false" />

        <lazyDeserialization
            enabled="false" />

        <!-- Invocation batching is required for use with the Lucene Directory -->
        <invocationBatching
            enabled="true" />

        <!-- This element specifies that the cache is clustered. modes supported: distribution 
            (d), replication (r) or invalidation (i). Don't use invalidation to store Lucene indexes (as 
            with Hibernate Search DirectoryProvider). Replication is recommended for best performance of 
            Lucene indexes, but make sure you have enough memory to store the index in your heap.
            Also distribution scales much better than replication on high number of nodes in the cluster. -->
        <clustering
            mode="replication">

            <!-- Prefer loading all data at startup than later -->
            <stateRetrieval
                timeout="20000"
                logFlushTimeout="30000"
                fetchInMemoryState="true"
                alwaysProvideInMemoryState="true" />

            <!-- Network calls are synchronous by default -->
            <sync
                replTimeout="20000" />
        </clustering>

        <jmxStatistics
            enabled="true" />

        <eviction
            maxEntries="-1"
            strategy="NONE" />

        <expiration
            maxIdle="-1" />

    </default>

    <!-- ******************************************************************************* -->
    <!-- Individually configured "named" caches.                                         -->
    <!--                                                                                 -->
    <!-- While default configuration happens to be fine with similar settings across the -->
    <!-- three caches, they should generally be different in a production environment.   -->
    <!--                                                                                 -->
    <!-- Current settings could easily lead to OutOfMemory exception as a CacheStore     -->
    <!-- should be enabled, and maybe distribution is desired.                           -->
    <!-- ******************************************************************************* -->

    <!-- *************************************** -->
    <!--  Cache to store Lucene's file metadata  -->
    <!-- *************************************** -->
    <namedCache
        name="LuceneIndexesMetadata">
        <clustering
            mode="replication">
            <stateRetrieval
                fetchInMemoryState="true"
                logFlushTimeout="30000" />
            <sync
                replTimeout="25000" />
        </clustering>
    </namedCache>

    <!-- **************************** -->
    <!--  Cache to store Lucene data  -->
    <!-- **************************** -->
    <namedCache
        name="LuceneIndexesData">
        <clustering
            mode="replication">
            <stateRetrieval
                fetchInMemoryState="true"
                logFlushTimeout="30000" />
            <sync
                replTimeout="25000" />
        </clustering>
    </namedCache>

    <!-- ***************************** -->
    <!--  Cache to store Lucene locks  -->
    <!-- ***************************** -->
    <namedCache
        name="LuceneIndexesLocking">
        <clustering
            mode="replication">
            <stateRetrieval
                fetchInMemoryState="true"
                logFlushTimeout="30000" />
            <sync
                replTimeout="25000" />
        </clustering>
    </namedCache>

</infinispan>

Code:

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">
   <UDP
         mcast_addr="${search.udp.mcast_addr:FF05::2591}"
         mcast_port="${search.udp.mcast_port:46655}"
         tos="8"
         ucast_recv_buf_size="20000000"
         ucast_send_buf_size="640000"
         mcast_recv_buf_size="25000000"
         mcast_send_buf_size="640000"
         loopback="true"
         discard_incompatible_packets="true"
         max_bundle_size="64000"
         max_bundle_timeout="30"
         ip_ttl="${jgroups.udp.ip_ttl:2}"
         enable_bundling="true"
         enable_diagnostics="false"

         thread_naming_pattern="pl"

         thread_pool.enabled="true"
         thread_pool.min_threads="2"
         thread_pool.max_threads="30"
         thread_pool.keep_alive_time="5000"
         thread_pool.queue_enabled="false"
         thread_pool.queue_max_size="100"
         thread_pool.rejection_policy="Discard"

         oob_thread_pool.enabled="true"
         oob_thread_pool.min_threads="2"
         oob_thread_pool.max_threads="30"
         oob_thread_pool.keep_alive_time="5000"
         oob_thread_pool.queue_enabled="false"
         oob_thread_pool.queue_max_size="100"
         oob_thread_pool.rejection_policy="Discard"
         />

   <PING timeout="3000" num_initial_members="3"/>
   <MERGE2 max_interval="30000" min_interval="10000"/>
   <FD_SOCK/>
   <FD_ALL/>
   <BARRIER />
   <pbcast.NAKACK use_stats_for_retransmission="false"
                   exponential_backoff="0"
                   use_mcast_xmit="true" gc_lag="0"
                   retransmit_timeout="300,600,1200"
                   discard_delivered_msgs="true"/>
   <UNICAST timeout="300,600,1200"/>
   <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="1000000"/>
   <pbcast.GMS print_local_addr="false" join_timeout="3000" view_bundling="true"/>
   <UFC max_credits="500000" min_threshold="0.20"/>
   <MFC max_credits="500000" min_threshold="0.20"/>
   <FRAG2 frag_size="60000"  />
   <pbcast.STREAMING_STATE_TRANSFER/>
   <pbcast.FLUSH timeout="0"/>
</config>

sanne.grinovero · **Posted:** Mon Dec 05, 2011 5:57 pm

Hi Zach,
If you look into org.hibernate.search.store.DirectoryProviderHelper.initializeIndexIfNeeded(Directory)
you'll notice that the IndexWriter is opened only if the IndexReader reports that the index doesn't exist. It's doing that exactly because of clustering.

My guess is that we could have some trouble because of the implementation of org.apache.lucene.index.IndexReader.indexExists(Directory) : it returns false in case of any IOException, without reporting the actual error.
So maybe you're having some different issue, and this Lucene code is hiding it? Did you notice any other error being logged by Infinispan or JGroups before this happens?

Worst case, try having the Infinispan and JGroups log more verbose.

pushedbytime · **Posted:** Tue Dec 06, 2011 7:46 pm

Hi Sanne,

Sorry for the slow reply but the issue seems to manifest it slightly different every time, and the Exceptions are numerous.

Again all I'm doing is re-indexing all of the documents. The app consuming from JMS is doing all the writing, and I do a soft kill of the one of the read-only nodes. The same patterns occur if I have a node go suspect just because the system the process is running on is under high load(might need to do some groups tuning there).

To simplify things I've reduced the number of threads reading from JMS to 1, and also have changed the async worker thread pool to 1(but I think thats on the side publishing index changes to JMS, not on the applying index changes side).

Things start off after the kill with a suspect Exception, usually along with an EOF Exception on the side writing to the index:

Quote:

06 Dec 2011 21:13:35,014 ERROR @Lucene Merge Thread #3 org.hibernate.search.exception.impl.LogErrorHandler Exception occurred org.apache.lucene.index.MergePolicy$MergeException: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: m1229-16089
org.apache.lucene.index.MergePolicy$MergeException: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: m1229-16089
at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:517)
at org.hibernate.search.backend.impl.lucene.overrides.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:49)
Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: m1229-16089
06 Dec 2011 21:13:58,430 ERROR @Lucene Merge Thread #1 org.hibernate.search.exception.impl.LogErrorHandler Exception occurred org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: Read past EOF: Chunk value could not be found for key _17.cfs|0|NetworkInformation

Then for as long as I leave the writing app running, it seems to loop indefinitely continuing re-indexing work and never seems to finish. Just endless logs along the lines of:

Quote:

06 Dec 2011 23:09:56,720 TRACE @Lucene Merge Thread #16 org.infinispan.util.logging.AbstractLogImpl Invoked with command GetKeyValueCommand {key=_3p.fdt|76|NIC, flags=[SKIP_LOCKING]} and InvocationContext [NonTxInvocationContext{flags=[SKIP_LOCKING]}]
06 Dec 2011 23:09:56,721 TRACE @Lucene Merge Thread #17 org.infinispan.util.logging.AbstractLogImpl Number of entries in context: 1
06 Dec 2011 23:09:56,721 TRACE @Hibernate Search: Directory writer-1 org.infinispan.util.logging.AbstractLogImpl Entry for key _2c.tis|18|Location is null, not calling commitUpdate
06 Dec 2011 23:09:56,722 TRACE @Lucene Merge Thread #17 org.infinispan.util.logging.AbstractLogImpl Updating entry (key=_67.fdt|29|NIC removed=false valid=true changed=true created=true value=[B@3cfbe5d7]
06 Dec 2011 23:09:56,721 TRACE @Lucene Merge Thread #16 org.infinispan.util.logging.AbstractLogImpl Key _3p.fdt|76|NIC is not in context, fetching from container.
06 Dec 2011 23:09:56,722 TRACE @Hibernate Search: Directory writer-1 org.infinispan.lucene.InfinispanIndexOutput Storing segment chunk: _68.fdt|55|NIC

Again I'd expect it to wrap up work that was in progress pretty quickly since the app that was pushing the index work to JMS was the one that was killed. In any case, if I try to start up the read-only app again a series of Exceptions usually occur on the app that is responsible for writing to the index(the one that has been up the whole time):

Quote:

06 Dec 2011 23:10:30,720 ERROR @transport-thread-17 org.infinispan.remoting.rpc.RpcManagerImpl unexpected error while replicating
org.infinispan.manager.NamedCacheNotFoundException: Cache: LuceneIndexesData
at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:123)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommand(CommandAwareRpcDispatcher.java:159)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:144)

Quote:

java.util.concurrent.TimeoutException: Hibernate Search: Directory writer-1 could not obtain shared processing lock after 50 seconds. Locks in question are java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock@8d06677[Read locks = 12] and java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock@29f71326[Unlocked]

Quote:

06 Dec 2011 23:12:42,839 ERROR @Hibernate Search: Directory writer-2 org.hibernate.search.backend.impl.lucene.PerDPQueueProcessor Unexpected error in Lucene Backend:
org.hibernate.search.SearchException: Unable to remove class com.ssn.dm.types.location.ServicePoint#2312193 from index.
at org.hibernate.search.backend.impl.lucene.works.DeleteWorkDelegate.performWork(DeleteWorkDelegate.java:91)
at org.hibernate.search.backend.impl.lucene.PerDPQueueProcessor.run(PerDPQueueProcessor.java:106)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
at org.hibernate.search.backend.impl.lucene.works.DeleteWorkDelegate.performWork(DeleteWorkDelegate.java:87)
... 4 more

It's almost as if the read-only app that is trying to re-join overwrites some shared state with invalid or uninitialized data. Having the named cache like, LuceneIndexesData, suddenly disappear when re-joining seems really odd. But I'm not sure if thats the root cause given all the errors involved.

Another way this sometimes goes is that I just get endless errors on the writing app side that never seem to resolve:
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: Read past EOF: Chunk value could not be found for key

Regardless the joining app never is able to rejoin. Usually with lock timeout error mentioned above in my first post.

I've also seen some logs about 'DummyTransactions', and SKIP_LOCKING flags, which makes me worried I've got something mis-configured to not be transactionally sound when interacting with infinispan.

I know this is a lot of info but is this making you suspicious of any misconfiguration or ringing any other bells?

Thanks,
Zach

sanne.grinovero · **Posted:** Wed Dec 07, 2011 2:36 pm

Hi Zach,
interesting report :-/

I'll start from the end:

Quote:

Regardless the joining app never is able to rejoin. Usually with lock timeout error mentioned above in my first post.

If an app is unable to re-join the cluster that sounds pretty much a problem of Infinispan/JGroups, but your configuration looks like sane.
There where many improvements in this area recently and so I know it's annoying but I have to suggest you to try a later version of Infinispan.

Quote:

I've also seen some logs about 'DummyTransactions', and SKIP_LOCKING flags, which makes me worried I've got something mis-configured to not be transactionally sound when interacting with infinispan.

That should be fine, the DummyTransactions are used by Infinispan as a means to implement batching when the batch scope is not correlated to the real transaction manager. SKIP_LOCKING is a flag explicitly used by the Infinispan/Lucene integration code.

There is only one open issue in the Lucene integration bits, and it pretty much looks like similar to your case; it's https://issues.jboss.org/browse/ISPN-939
it was opened quite some time ago, but I was never able to reproduce it; Since you report a similar situation I'm going to give it another look and see if your additional information helps.
If you happen to be able to make a reproducible test, that would be great.

pushedbytime · **Posted:** Wed Dec 07, 2011 3:08 pm

Sanne,

Thanks for the pointer to the ticket. I was mainly concerned about configuration since I was having trouble tracking down any similar issues. Somewhat glad to know I'm not alone.

I need to de-risk for some deadlines, and will be working on a non-infinispan way to distribute the indexes that doesn't involve a SAN(rsync, or zipped up indexes over JMS). After that I'll look into upgrading to Infinispan 5. Looks like there is enough info out there about getting hibernate 3.6.x + hibernate search 3.4.1 + infinispan 5 to work with some minor jar surgery. If that still isn't stable I'll start working on a test case, since I really want this to end up being our longer term solution(easier operationally have this be embedded in the app).

I'll get back to you once I have more information, but it may be awhile. Hope the information in this ticket helps you reproduce issue.

Thanks,
Zach

sanne.grinovero · **Posted:** Fri Dec 09, 2011 11:10 am

Hi Zach,
I've been digging a bit in Infinispan, it seems that 5.0-something does not propagate the suspect exceptions anymore to the "API" layer but correctly cares to compensate for it.
I'll write a test to make sure it doesn't happen again, and see if I can figure out exactly in which Infinispan version this was fixed (as 5.0-something is rather useless I guess..).