rebuild a large index in a distributed way?

bshf · **Joined:** Tue Jun 23, 2009 1:36 pm **Posts:** 2

Hi,

I need to rebuild a large lucene index after a install of new version of a third-party db. I have multiple nodes to distribute the indexing work to. The result should be a single index (or may be sharded).

What is the best way to set this up with hibernate search? I've read "Hibernate search in action", but I didn't find a solution for this kind of problem.

I am thinking to create one central program to put primary key values on a queue, each slave reads pk values from the queue, looks up the data in the db , and writes to a local index. Finally the different indexes are sent to a central node who merges them into one final index (or sharded index), again by using JMS, as described in the book .

all suggestions are welcome!

Bram

bshf · **Joined:** Tue Jun 23, 2009 1:36 pm **Posts:** 2

I found some code to merge multiple indexes into one:
http://www.asteriosk.gr/blog/2009/03/31 ... e-indexes/

or just combine the files, as described here:
http://www.opensubscriber.com/message/l ... 03308.html

so, I could let each "indexing slave" make it's own index, then bring the files together and merge with above mentioned code.

the question is, which architecture (this or previous post) is the most performant?

sanne.grinovero · **Posted:** Fri Jun 26, 2009 4:56 am

Hi bshf,
I've seen your situation several times and have worked since a long time on patches for Hibernate Search to rebuild the indexes in the most efficient way I could find, but did never consider the "multiple nodes" idea.
I'm doing it with threads on one single node so I don't have merging problems: I got the idea when profiling the job, it's clear that the database round-trip is the major bottleneck, so even on a dual core using something like 30 parallel threads does have sense and gives you a sensible speed-up.

The code is ready and working fine on our systems (on production), but is not yet committed as I'm still missing some unit tests (damn hard to design..)
if you would like to try it I can send you the patch, or I'll attach it to JIRA.
It requires latest trunk of both Search and dependencies (so hibernate-core 3.5-SNAPSHOT)

It would be very cool if you could try it and give me some feedback: as I said it works in our environment on our model but nobody else tested it. Would you like to beta-test it and give me feedback? I'll be glad to improve it if needed.

Sanne

sanne.grinovero · **Posted:** Mon Jun 29, 2009 6:29 pm

It's committed in trunk now, no need to send patches.