Lost index

daftano · **Joined:** Thu Apr 21, 2016 9:16 am **Posts:** 3

I am working on an application that was super slow to commit data to the index, thus the hibernate.search.default.indexmanager setting was changed to near-real-time, and the performance was greatly improved.

However with that change, if glassfish is killed (kill -9) some items in the lucene index got lost. In some cases 3 weeks of results got lost from the lucene index and only with a full re-index it was possible to get them back in the index. I would say those results were not yet saved to the lucene index as it would happen with a clean shutdown.

We though that "near real time" is something close to "right now" and we're so surprised it could have involved results older than 3 weeks.

Is it possible to control when the lucene index is physically stored on the disk; or that loss of data is a bug?

Cheers,
Davide

platforms: linux ubuntu 12 and 14
database: mysql 5.6.27
mvn dependencies:

hibernate-search configuration:

sanne.grinovero · **Posted:** Fri Apr 22, 2016 8:06 am

Hi Davide,

the near-real-time strategy might buffer writes indefinitely: it flushes when its buffers are full, but there is no time limit for this to happen.

Is it possible that for 3 weeks long there was almost no write activity on your system?
If there's write activity, that would have caused flushing to disk.

You might be interested in this feature, if you can live with asynchronous indexing:
- https://hibernate.atlassian.net/browse/HSEARCH-1693

That policy will trigger a flush periodically. We don't have a policy which applies both NRT optimisations, and yet flushes periodically to disk. It would be quite easy to add one, have a look at:
- org.hibernate.search.backend.impl.lucene.ScheduledCommitPolicy
- org.hibernate.search.backend.impl.lucene.NRTCommitPolicy

I am not sure if this would have solved your problem though: the data which is not flushed on disk could contain critical metadata, and losing that it might have to consider a large segment to be potentially corrupted.
If you need best reliability, don't use NRT.. of course it will be quite slower as it will flush to disk a lot.

BTW the option "thread_pool.size" doesn't need 20 threads. With the latest design, 1 thread should be more than enough when using NRT.
"ram_buffer_size" is very low though. Consider allowing it 128 MB or more, then the non-NRT backend might be quite faster.

Quote:

We though that "near real time" is something close to "right now" and we're so surprised it could have involved results older than 3 weeks.

The "near real time" name is referring to its write performance, as perceived. To achieve this efficiency, it does avoid disk flushes which might lose data.

But I agree that 3 weeks of time is rather extreme, I had never heard of that so it might be case of bad luck with the wrong metadata being lost, and for some reason you application never had the need to flush that little 1MB buffer.

Also: wouldn't you have had to reindex everything even if you had lost only 1 minute of changes?

Thanks for the feedback!

daftano · **Joined:** Thu Apr 21, 2016 9:16 am **Posts:** 3

Thank you for the quick answer!

We used to have a buffer of 256MB, and I changed it to 1 to verify if it was going to flush at some point. But even with over 10K results it wasn't.
I spent most of time looking at sources and trying different possible solutions and it was an interesting diving!

The system is always getting results, and in those weeks it got over 60K new entries.
After many different experiments and tuning it seems that the feature you linked seems working fine for us.

I'll keep you update.
And thank you again for your excellent answer!

Best,
Davide

sanne.grinovero · **Posted:** Tue May 24, 2016 11:26 am

thanks, always nice to hear about happy users :)