Hibernate Search - Infinispan - CheckIndex - How to recover?

muhammadilyas · **Joined:** Sun Aug 16, 2015 3:21 am **Posts:** 27

Hi,

If we use Infinispan and store index in DB, can we run CheckIndex if there is any corruption of segments?

Secondly if we are able to remove the corrupted segments can we see which Keys we need to re-index?

And possibly which versions have got this feature?

Problem is if you got millions of documents and because of any EOFException or anything some segment got corrupted. Now if you restart your (master or slave) - it will load from underline cachstore that will fail to start even.

Like RDBMS if some records are faulty we at least get up and running and then we fix them... How to achieve similar here so that we get Live and fix the segments somehow like Lucene CheckIndex etc.

I might be missing something very fundamental please explain if this is the case.

Thanks in advance!

sanne.grinovero · **Posted:** Sat Jan 14, 2017 7:01 pm

Hello,

unlike an RDBMS, the index needs to be strictly in synch with your database to be useful. Data loss in the index is acceptable, as it can be rebuilt from the RDBMS: that puts it in very different light from the database, whose job is to work hard to avoid data loss.

The best strategy to maintain this synchronization is to rebuild the index from the database on errors: recovering a corrupt index would still imply you had some downtime in which updates might have been missed.

So I recommend to wipe your index and rebuild it using the MassIndexer; Something I can't stress enough is to make sure you have tuned the MassIndexer so that recovery can be performed quickly; having dedicated a couple of engineering days to tune the MassIndexer properly is always very useful both for further development (you need to rebuild the index when updating your indexing options) and as a proper plan for disaster recovery.

HTH

muhammadilyas · **Joined:** Sun Aug 16, 2015 3:21 am **Posts:** 27

Hmmmm

I am sure MassIndexer is not always to rebuild all from scratch. For example if we have got a backup of indexes when we restarted server last time (let say a month ago) then we can restore the backup and run MassIndexer for only changes since that restart date on top of existing index...

We can do that easily by restricting data load queries to only consider changes since that date.

This means we can be up and running quicker if we don't need to wipe the whole index that can be a billion of documents. Reloading a month's changes can hardly go to 100K-300K documents.

I am waiting for your reply as that can help designing the stuff properly. So quick response is needed if possible. Many many thanks in advance.

sanne.grinovero · **Posted:** Mon Jan 16, 2017 11:02 am

Quote:

For example if we have got a backup of indexes when we restarted server last time (let say a month ago) then we can restore the backup and run MassIndexer for only changes since that restart date on top of existing index...

I'm not sure which MassIndexer can do that. Not the one I implemented which is now in Hibernate Search, unless you're restricting what Hibernate can see, or mapping a filtered view for these purposes?

That might be an interesting experiment, but I'm not sure how to help you to guarantee that the resulting index will be in sync, unless you can prevent other changes from happening concurrently. We've had some discussions on the mailing list to consider using changeset ids, timestamps or transactions ids but no single solution is safe for general purpose usage. I agree you might be able to build something which works fine for your specific requirements.

Back to your original question about using CheckIndex: sure you can run it, but you'll have to restore your backup in case you find non-recoverable issues as it's not possible to identify from the segment id nor the filename which keys need to be reindexed. I suspect one possible solution would be to check - for each key in the database - if there's a matching document in any other segment, then skip it if there is as there should never be a duplicate. I'm not sure if you can implement this to be efficient enough to be faster than reindexing it all though, as you'll still need to iterate at least all ids: this pre-filtering approach could be a good idea if the indexer has to produce complex Lucene Document and/or load complex object graphs.

HTH

muhammadilyas · **Joined:** Sun Aug 16, 2015 3:21 am **Posts:** 27

Quote:

I'm not sure which MassIndexer can do that. Not the one I implemented which is now in Hibernate Search, unless you're restricting what Hibernate can see, or mapping a filtered view for these purposes?

Yes DB views we are using and that can restrict the Hibernate to see data out of that given date range.

Quote:

That might be an interesting experiment, but I'm not sure how to help you to guarantee that the resulting index will be in sync, unless you can prevent other changes from happening concurrently

Hmmmm... In this case it doesn't matter as translog table (we created to keep the index log - JMS Messages) only contains keys to changed documents. Regardless if new change comes in or not it will load upto-date data from transactional database. If someone changed any document while MassIndexer is running it will be indexed twice but same data will be indexed. I hope MassIndexer can run in parallel to normal index and search actions....

sanne.grinovero · **Posted:** Mon Jan 16, 2017 2:30 pm

Yes that sounds good. And yes, MassIndexer can run in parallel with normal operations. Just make sure to enable the option to not clear the index on job start.

Interesting setup. Thanks for the feedback!