Hibernate/Infinispan and replication startup issues

rjd1000 · **Joined:** Thu Jan 03, 2013 11:15 am **Posts:** 1

Hi,

I'm using Hibernate (4.0.0.Final) with Infinispan and JGroups to create a replicated Lucene data cache across a variable number of computers (typically 3 but it all depends on availability). Currently, the logic within the service layer is to index all of the required search entities - using MassIndexer.startAndWait() - only if the data within the cache does not exist. Typically, this will be because all nodes have been taken offline for an upgrade. The re-indexing process may take up to 2 minutes. Any node started after the index has been created should not perform any re-indexing - as it is unnecessary. There is no need (or want) for file system persistence.

Everything appears to start-up fine if I can guarantee the order and timing of when the clustered nodes start. If I start node one, wait until the indexing has completed and then start the second node everything appears to work fine. If I start them at roughly the same time, a number of errors are raised which appear to be related to synchronizing the nodes/clustered caches. On the node that was started first:

Quote:

ERROR: ISPN000136: Execution error
org.infinispan.util.concurrent.TimeoutException: Replication timeout for FMDX2646-52133
at org.infinispan.remoting.transport.AbstractTransport.parseResponseAndAddToResponseList(AbstractTransport.java:71)

org.infinispan.util.concurrent.TimeoutException: Replication timeout for FMDX2646-52133
at org.infinispan.remoting.transport.AbstractTransport.parseResponseAndAddToResponseList(AbstractTransport.java:71)

On the node that was second:

Quote:

-------------------------------------------------------------------
GMS: address=FMDX2646-52133, cluster=HibernateSearch-Infinispan-cluster, physical address=10.214.32.67:7800
-------------------------------------------------------------------
03-Jan-2013 15:30:57 org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted
INFO: ISPN000094: Received new cluster view: [FMF-V3-0790-52063|2] [FMF-V3-0790-52063, FMF-V3-0790-63321, FMDX2646-52133]
03-Jan-2013 15:30:57 org.infinispan.remoting.transport.jgroups.JGroupsTransport startJGroupsChannelIfNeeded
INFO: ISPN000079: Cache local address is FMDX2646-52133, physical addresses are [10.214.32.67:7800]
03-Jan-2013 15:30:57 org.infinispan.factories.AbstractComponentRegistry internalStart
INFO: ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.1.FINAL
03-Jan-2013 15:30:57 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000074: Trying to fetch state from FMF-V3-0790-52063
03-Jan-2013 15:31:16 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000076: Successfully retrieved and applied state from FMF-V3-0790-52063
03-Jan-2013 15:31:16 org.infinispan.factories.AbstractComponentRegistry internalStart
INFO: ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.1.FINAL
03-Jan-2013 15:31:16 org.infinispan.manager.DefaultCacheManager getCache
WARN: ISPN000156: You are not starting all your caches at the same time. This can lead to problems as asymmetric clusters are not supported, see ISPN-658. We recommend using EmbeddedCacheManager.startCaches() to start all your caches upfront.
03-Jan-2013 15:31:16 org.infinispan.util.logging.Log_$logger info
INFO: Using a batchMode transaction manager
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesData to start
03-Jan-2013 15:31:16 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000074: Trying to fetch state from FMF-V3-0790-52063
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesData to start
03-Jan-2013 15:31:36 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000076: Successfully retrieved and applied state from FMF-V3-0790-52063
03-Jan-2013 15:31:36 org.infinispan.factories.AbstractComponentRegistry internalStart
INFO: ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.1.FINAL
03-Jan-2013 15:31:36 org.infinispan.manager.DefaultCacheManager getCache
WARN: ISPN000156: You are not starting all your caches at the same time. This can lead to problems as asymmetric clusters are not supported, see ISPN-658. We recommend using EmbeddedCacheManager.startCaches() to start all your caches upfront.
03-Jan-2013 15:31:36 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:36 org.infinispan.util.logging.Log_$logger info
INFO: Using a batchMode transaction manager
03-Jan-2013 15:31:36 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:36 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000074: Trying to fetch state from FMF-V3-0790-52063
03-Jan-2013 15:31:45 org.infinispan.remoting.transport.jgroups.JGroupsTransport setState
ERROR: ISPN000096: Caught while requesting or applying state
org.infinispan.statetransfer.StateTransferException: java.net.SocketException: Connection reset
at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:292)
at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:244)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:607)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:711)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)

I have a number of questions:

Question 1: Is there a "correct" way to start multiple nodes - a variable number depending on how many might be available at the time - to ensure we can reliably have all nodes collaborating their state at start-up and to avoid these "Replication timeout" errors?

Question 2: What is the best way to determine that there are items in the cache and it's not worth re-indexing? At present I'm looking at the "LuceneIndexesLocking" cache size and seeing if the count is greater than 0. I've looked at a number of the other exposed properties but can not see anything obvious to query.

Question 3: Currently the caches are created automatically via the hibernate/infinispan configuration files. Using this process I get the error:

Quote:

WARN: ISPN000156: You are not starting all your caches at the same time. This can lead to problems as asymmetric clusters are not supported, see ISPN-658....

Is it preferable that I programmatically create the CacheManager or should I let hibernate/infinispan create it as part of its start-up?

Thanks in advance,
Rob.

Hibernate.cfg.xml:

Quote:

<?xml version='1.0' encoding='utf-8'?>

<!DOCTYPE hibernate-configuration PUBLIC
"-//Hibernate/Hibernate Configuration DTD 3.0//EN"
"classpath://hibernate-configuration-3.0.dtd">

<hibernate-configuration>
<session-factory>


<property name="connection.datasource">java:/comp/env/jdbc/blah</property>


<property name="connection.pool_size">1</property>


<property name="dialect">org.hibernate.dialect.Oracle10gDialect</property>


<property name="cache.provider_class">org.hibernate.cache.NoCacheProvider</property>


<property name="show_sql">true</property>


<property name="current_session_context_class">thread</property>


<property name="hibernate.search.default.directory_provider">infinispan</property>


<property name="hibernate.search.infinispan.configuration_resourcename">hibernatesearch-infinispan.xml</property>
<property name="hibernate.search.default.locking_cachename">LuceneIndexesLocking</property>
<property name="hibernate.search.default.data_cachename">LuceneIndexesData</property>
<property name="hibernate.search.default.metadata_cachename">LuceneIndexesMetadata</property>

<property name="hibernate.search.default.exclusive_index_use">false</property>
<property name="hibernate.search.default.locking_strategy">native</property>
</session-factory>
</hibernate-configuration>

Infinispan.xml:

Quote:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:infinispan:config:5.0 http://www.infinispan.org/schemas/infinispan-config-5.0.xsd"
xmlns="urn:infinispan:config:5.0">





<global>


<globalJmxStatistics
enabled="true"
cacheManagerName="HibernateSearch"
allowDuplicateDomains="true" />


<transport
clusterName="HibernateSearch-Infinispan-cluster"
distributedSyncTimeout="20000">

<properties>
<property name="configurationFile" value="infinispan-jgroups.xml" />
</properties>
</transport>


<shutdown
hookBehavior="DONT_REGISTER" />

</global>





<default>

<locking
lockAcquisitionTimeout="20000"
writeSkewCheck="false"
concurrencyLevel="500"
useLockStriping="false" />


<invocationBatching
enabled="true" />


<clustering
mode="replication">


<stateRetrieval
timeout="20000"
logFlushTimeout="30000"
fetchInMemoryState="true"
alwaysProvideInMemoryState="true" />


<sync
replTimeout="20000" />
</clustering>

<jmxStatistics
enabled="false" />

<eviction
maxEntries="-1"
strategy="NONE" />

<expiration
maxIdle="-1" />

</default>














<namedCache
name="LuceneIndexesMetadata">
<clustering
mode="replication">
<stateRetrieval
fetchInMemoryState="true"
logFlushTimeout="30000" />
<sync
replTimeout="20000" />
</clustering>
</namedCache>




<namedCache
name="LuceneIndexesData">
<clustering
mode="replication">
<stateRetrieval
fetchInMemoryState="true"
logFlushTimeout="30000" />
<sync
replTimeout="20000" />
</clustering>
</namedCache>




<namedCache
name="LuceneIndexesLocking">
<clustering
mode="replication">
<stateRetrieval
fetchInMemoryState="true"
logFlushTimeout="30000" />
<sync
replTimeout="20000" />
</clustering>
</namedCache>

</infinispan>

JGroups.xml:

Quote:

<config>
<TCP bind_port="7800" loopback="true" recv_buf_size="20000000"
send_buf_size="640000" discard_incompatible_packets="true"
max_bundle_size="64000" max_bundle_timeout="30" enable_bundling="true"
use_send_queues="false" sock_conn_timeout="300"
skip_suspected_members="true" thread_pool.enabled="true"
thread_pool.min_threads="1" thread_pool.max_threads="25"
thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false"
thread_pool.queue_max_size="100" thread_pool.rejection_policy="run"
oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1"
oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="run" />
<TCPPING timeout="3000"
initial_hosts="${jgroups.tcpping.initial_hosts:10.214.32.67[7800],10.184.217.38[7800],10.184.217.38[7801]}"
port_range="2" num_initial_members="3" />
<MERGE2 max_interval="100000" min_interval="20000" />
<FD_SOCK />
<FD timeout="10000" max_tries="5" shun="true" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true" />
<UNICAST timeout="300,600,1200" />
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000" />
<VIEW_SYNC avg_send_interval="60000" />
<pbcast.GMS print_local_addr="true" join_timeout="3000" shun="true"
view_bundling="true" />
<FC max_credits="2000000" min_threshold="0.10" />
<FRAG2 frag_size="60000" />
<pbcast.STREAMING_STATE_TRANSFER />
</config>

sanne.grinovero · **Posted:** Thu Jan 03, 2013 10:01 pm

Quote:

Question 1: Is there a "correct" way to start multiple nodes - a variable number depending on how many might be available at the time - to ensure we can reliably have all nodes collaborating their state at start-up and to avoid these "Replication timeout" errors?

With Infinispan 5.0 I was also quite annoyed by these same problems, in fact you're highlighting the main reason for which I'd suggest to update to Infinispan 5.1.8 or 5.1.9: 5.1.x handles timing of bootup processed much better.
It might still present some hikkups, to help it it would be a good idea to start a single node first, and then wait it has fully initialized, then start the other nodes.
Infinispan 5.2 addresses this is an very beatufil elegant way and should make all the complexity look simple - unfortunately it's still in beta.

Quote:

Question 2: What is the best way to determine that there are items in the cache and it's not worth re-indexing? At present I'm looking at the "LuceneIndexesLocking" cache size and seeing if the count is greater than 0. I've looked at a number of the other exposed properties but can not see anything obvious to query.

The LuceneIndexesLocking cache might be empty if the index exists but is unlocked. The Metadata cache would be a better choice, but I have to discourage the approach for other reasons: the cache.size() method is not a "distributed" operation, as its javadoc warns it will not count elements stored on a different node.. so you actually don't know.
A simpler idea whould be to perform a query and see if you find anything.
An alternative approach is to configure a CacheStore, so that you don't have to reindex everything at all: have Infinispan persist the index in background and have it re-load in memory when you activate the nodes from the permanent storage.

Code:

Question 3: Currently the caches are created automatically via the hibernate/infinispan configuration files. Using this process I get the error: 

That's not an error but a warning; it's warning you that it detected the timing problem you already know about, since it's your first question ;-)