Hi,
I'm using Hibernate (4.0.0.Final) with Infinispan and JGroups to create a replicated Lucene data cache across a variable number of computers (typically 3 but it all depends on availability). Currently, the logic within the service layer is to index all of the required search entities - using MassIndexer.startAndWait() - only if the data within the cache does not exist. Typically, this will be because all nodes have been taken offline for an upgrade. The re-indexing process may take up to 2 minutes. Any node started after the index has been created should not perform any re-indexing - as it is unnecessary. There is no need (or want) for file system persistence.
Everything appears to start-up fine if I can guarantee the order and timing of when the clustered nodes start. If I start node one, wait until the indexing has completed and then start the second node everything appears to work fine. If I start them at roughly the same time, a number of errors are raised which appear to be related to synchronizing the nodes/clustered caches. On the node that was started first:
Quote:
ERROR: ISPN000136: Execution error
org.infinispan.util.concurrent.TimeoutException: Replication timeout for FMDX2646-52133
at org.infinispan.remoting.transport.AbstractTransport.parseResponseAndAddToResponseList(AbstractTransport.java:71)
org.infinispan.util.concurrent.TimeoutException: Replication timeout for FMDX2646-52133
at org.infinispan.remoting.transport.AbstractTransport.parseResponseAndAddToResponseList(AbstractTransport.java:71)
On the node that was second:
Quote:
-------------------------------------------------------------------
GMS: address=FMDX2646-52133, cluster=HibernateSearch-Infinispan-cluster, physical address=10.214.32.67:7800
-------------------------------------------------------------------
03-Jan-2013 15:30:57 org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted
INFO: ISPN000094: Received new cluster view: [FMF-V3-0790-52063|2] [FMF-V3-0790-52063, FMF-V3-0790-63321, FMDX2646-52133]
03-Jan-2013 15:30:57 org.infinispan.remoting.transport.jgroups.JGroupsTransport startJGroupsChannelIfNeeded
INFO: ISPN000079: Cache local address is FMDX2646-52133, physical addresses are [10.214.32.67:7800]
03-Jan-2013 15:30:57 org.infinispan.factories.AbstractComponentRegistry internalStart
INFO: ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.1.FINAL
03-Jan-2013 15:30:57 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000074: Trying to fetch state from FMF-V3-0790-52063
03-Jan-2013 15:31:16 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000076: Successfully retrieved and applied state from FMF-V3-0790-52063
03-Jan-2013 15:31:16 org.infinispan.factories.AbstractComponentRegistry internalStart
INFO: ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.1.FINAL
03-Jan-2013 15:31:16 org.infinispan.manager.DefaultCacheManager getCache
WARN: ISPN000156: You are not starting all your caches at the same time. This can lead to problems as asymmetric clusters are not supported, see ISPN-658. We recommend using EmbeddedCacheManager.startCaches() to start all your caches upfront.
03-Jan-2013 15:31:16 org.infinispan.util.logging.Log_$logger info
INFO: Using a batchMode transaction manager
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesData to start
03-Jan-2013 15:31:16 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000074: Trying to fetch state from FMF-V3-0790-52063
03-Jan-2013 15:31:16 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesData to start
03-Jan-2013 15:31:36 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000076: Successfully retrieved and applied state from FMF-V3-0790-52063
03-Jan-2013 15:31:36 org.infinispan.factories.AbstractComponentRegistry internalStart
INFO: ISPN000128: Infinispan version: Infinispan 'Pagoa' 5.0.1.FINAL
03-Jan-2013 15:31:36 org.infinispan.manager.DefaultCacheManager getCache
WARN: ISPN000156: You are not starting all your caches at the same time. This can lead to problems as asymmetric clusters are not supported, see ISPN-658. We recommend using EmbeddedCacheManager.startCaches() to start all your caches upfront.
03-Jan-2013 15:31:36 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:36 org.infinispan.util.logging.Log_$logger info
INFO: Using a batchMode transaction manager
03-Jan-2013 15:31:36 org.infinispan.remoting.InboundInvocationHandlerImpl handle
INFO: ISPN000067: Will try and wait for the cache LuceneIndexesLocking to start
03-Jan-2013 15:31:36 org.infinispan.remoting.rpc.RpcManagerImpl retrieveState
INFO: ISPN000074: Trying to fetch state from FMF-V3-0790-52063
03-Jan-2013 15:31:45 org.infinispan.remoting.transport.jgroups.JGroupsTransport setState
ERROR: ISPN000096: Caught while requesting or applying state
org.infinispan.statetransfer.StateTransferException: java.net.SocketException: Connection reset
at org.infinispan.statetransfer.StateTransferManagerImpl.applyState(StateTransferManagerImpl.java:292)
at org.infinispan.remoting.InboundInvocationHandlerImpl.applyState(InboundInvocationHandlerImpl.java:244)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.setState(JGroupsTransport.java:607)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:711)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
I have a number of questions:
Question 1: Is there a "correct" way to start multiple nodes - a variable number depending on how many might be available at the time - to ensure we can reliably have all nodes collaborating their state at start-up and to avoid these "Replication timeout" errors?
Question 2: What is the best way to determine that there are items in the cache and it's not worth re-indexing? At present I'm looking at the "LuceneIndexesLocking" cache size and seeing if the count is greater than 0. I've looked at a number of the other exposed properties but can not see anything obvious to query.
Question 3: Currently the caches are created automatically via the hibernate/infinispan configuration files. Using this process I get the error:
Quote:
WARN: ISPN000156: You are not starting all your caches at the same time. This can lead to problems as asymmetric clusters are not supported, see ISPN-658....
Is it preferable that I programmatically create the CacheManager or should I let hibernate/infinispan create it as part of its start-up?
Thanks in advance,
Rob.
Hibernate.cfg.xml:
Quote:
<?xml version='1.0' encoding='utf-8'?>
<!-- Configuration for ePortal static data Hibernate layer. -->
<!DOCTYPE hibernate-configuration PUBLIC
"-//Hibernate/Hibernate Configuration DTD 3.0//EN"
"classpath://hibernate-configuration-3.0.dtd">
<hibernate-configuration>
<session-factory>
<!-- Database connection settings -->
<property name="connection.datasource">java:/comp/env/jdbc/blah</property>
<!-- JDBC connection pool (use the built-in) -->
<property name="connection.pool_size">1</property>
<!-- SQL dialect -->
<property name="dialect">org.hibernate.dialect.Oracle10gDialect</property>
<!-- Disable the second-level cache -->
<property name="cache.provider_class">org.hibernate.cache.NoCacheProvider</property>
<!-- Echo all executed SQL to stdout -->
<property name="show_sql">true</property>
<!-- Control sessions at thread level -->
<property name="current_session_context_class">thread</property>
<!-- HibernateSearch settings -->
<property name="hibernate.search.default.directory_provider">infinispan</property>
<!-- Infinispan settings -->
<property name="hibernate.search.infinispan.configuration_resourcename">hibernatesearch-infinispan.xml</property>
<property name="hibernate.search.default.locking_cachename">LuceneIndexesLocking</property>
<property name="hibernate.search.default.data_cachename">LuceneIndexesData</property>
<property name="hibernate.search.default.metadata_cachename">LuceneIndexesMetadata</property>
<property name="hibernate.search.default.exclusive_index_use">false</property>
<property name="hibernate.search.default.locking_strategy">native</property>
</session-factory>
</hibernate-configuration>
Infinispan.xml:
Quote:
<?xml version="1.0" encoding="UTF-8"?>
<infinispan
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:infinispan:config:5.0 http://www.infinispan.org/schemas/infinispan-config-5.0.xsd"
xmlns="urn:infinispan:config:5.0">
<!-- *************************** -->
<!-- System-wide global settings -->
<!-- *************************** -->
<global>
<!-- Duplicate domains are allowed so that multiple deployments with default configuration
of Hibernate Search applications work - if possible it would be better to use JNDI to share
the CacheManager across applications -->
<globalJmxStatistics
enabled="true"
cacheManagerName="HibernateSearch"
allowDuplicateDomains="true" />
<!-- If the transport is omitted, there is no way to create distributed or clustered
caches. There is no added cost to defining a transport but not creating a cache that uses one,
since the transport is created and initialized lazily. -->
<transport
clusterName="HibernateSearch-Infinispan-cluster"
distributedSyncTimeout="20000">
<!-- Note that the JGroups transport uses sensible defaults if no configuration
property is defined. See the JGroupsTransport javadocs for more flags -->
<properties>
<property name="configurationFile" value="infinispan-jgroups.xml" />
</properties>
</transport>
<!-- Used to register JVM shutdown hooks. hookBehavior: DEFAULT, REGISTER, DONT_REGISTER.
Hibernate Search takes care to stop the CacheManager so registering is not needed -->
<shutdown
hookBehavior="DONT_REGISTER" />
</global>
<!-- *************************** -->
<!-- Default "template" settings -->
<!-- *************************** -->
<default>
<locking
lockAcquisitionTimeout="20000"
writeSkewCheck="false"
concurrencyLevel="500"
useLockStriping="false" />
<!-- Invocation batching is required for use with the Lucene Directory -->
<invocationBatching
enabled="true" />
<!-- This element specifies that the cache is clustered. modes supported: distribution
(d), replication (r) or invalidation (i). Don't use invalidation to store Lucene indexes (as
with Hibernate Search DirectoryProvider). Replication is recommended for best performance of
Lucene indexes, but make sure you have enough memory to store the index in your heap.
Also distribution scales much better than replication on high number of nodes in the cluster. -->
<clustering
mode="replication">
<!-- Prefer loading all data at startup than later -->
<stateRetrieval
timeout="20000"
logFlushTimeout="30000"
fetchInMemoryState="true"
alwaysProvideInMemoryState="true" />
<!-- Network calls are synchronous by default -->
<sync
replTimeout="20000" />
</clustering>
<jmxStatistics
enabled="false" />
<eviction
maxEntries="-1"
strategy="NONE" />
<expiration
maxIdle="-1" />
</default>
<!-- ******************************************************************************* -->
<!-- Individually configured "named" caches. -->
<!-- -->
<!-- While default configuration happens to be fine with similar settings across the -->
<!-- three caches, they should generally be different in a production environment. -->
<!-- -->
<!-- Current settings could easily lead to OutOfMemory exception as a CacheStore -->
<!-- should be enabled, and maybe distribution is desired. -->
<!-- ******************************************************************************* -->
<!-- *************************************** -->
<!-- Cache to store Lucene's file metadata -->
<!-- *************************************** -->
<namedCache
name="LuceneIndexesMetadata">
<clustering
mode="replication">
<stateRetrieval
fetchInMemoryState="true"
logFlushTimeout="30000" />
<sync
replTimeout="20000" />
</clustering>
</namedCache>
<!-- **************************** -->
<!-- Cache to store Lucene data -->
<!-- **************************** -->
<namedCache
name="LuceneIndexesData">
<clustering
mode="replication">
<stateRetrieval
fetchInMemoryState="true"
logFlushTimeout="30000" />
<sync
replTimeout="20000" />
</clustering>
</namedCache>
<!-- ***************************** -->
<!-- Cache to store Lucene locks -->
<!-- ***************************** -->
<namedCache
name="LuceneIndexesLocking">
<clustering
mode="replication">
<stateRetrieval
fetchInMemoryState="true"
logFlushTimeout="30000" />
<sync
replTimeout="20000" />
</clustering>
</namedCache>
</infinispan>
JGroups.xml:
Quote:
<!-- TCP based stack, with flow control and message bundling. This is usually used when IP
multicasting cannot be used in a network, e.g. because it is disabled (routers discard multicast).
Note that TCP.bind_addr and TCPPING.initial_hosts should be set, possibly via system properties, e.g.
-Djgroups.bind_addr=192.168.5.2 and -Djgroups.tcpping.initial_hosts=192.168.5.2[7800]
author: Bela Ban
version: $Id: tcp.xml,v 1.25 2008/09/25 14:26:07 belaban Exp $
-->
<config>
<TCP bind_port="7800" loopback="true" recv_buf_size="20000000"
send_buf_size="640000" discard_incompatible_packets="true"
max_bundle_size="64000" max_bundle_timeout="30" enable_bundling="true"
use_send_queues="false" sock_conn_timeout="300"
skip_suspected_members="true" thread_pool.enabled="true"
thread_pool.min_threads="1" thread_pool.max_threads="25"
thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="false"
thread_pool.queue_max_size="100" thread_pool.rejection_policy="run"
oob_thread_pool.enabled="true" oob_thread_pool.min_threads="1"
oob_thread_pool.max_threads="8" oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="run" />
<TCPPING timeout="3000"
initial_hosts="${jgroups.tcpping.initial_hosts:10.214.32.67[7800],10.184.217.38[7800],10.184.217.38[7801]}"
port_range="2" num_initial_members="3" />
<MERGE2 max_interval="100000" min_interval="20000" />
<FD_SOCK />
<FD timeout="10000" max_tries="5" shun="true" />
<VERIFY_SUSPECT timeout="1500" />
<BARRIER />
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true" />
<UNICAST timeout="300,600,1200" />
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000" />
<VIEW_SYNC avg_send_interval="60000" />
<pbcast.GMS print_local_addr="true" join_timeout="3000" shun="true"
view_bundling="true" />
<FC max_credits="2000000" min_threshold="0.10" />
<FRAG2 frag_size="60000" />
<pbcast.STREAMING_STATE_TRANSFER />
</config>