Hi Davide,
the near-real-time strategy might buffer writes indefinitely: it flushes when its buffers are full, but there is no time limit for this to happen.
Is it possible that for 3 weeks long there was almost no write activity on your system?
If there's write activity, that would have caused flushing to disk.
You might be interested in this feature, if you can live with asynchronous indexing:
-
https://hibernate.atlassian.net/browse/HSEARCH-1693That policy will trigger a flush periodically. We don't have a policy which applies both NRT optimisations, and yet flushes periodically to disk. It would be quite easy to add one, have a look at:
- org.hibernate.search.backend.impl.lucene.ScheduledCommitPolicy
- org.hibernate.search.backend.impl.lucene.NRTCommitPolicy
I am not sure if this would have solved your problem though: the data which is not flushed on disk could contain critical metadata, and losing that it might have to consider a large segment to be potentially corrupted.
If you need best reliability, don't use NRT.. of course it will be quite slower as it will flush to disk a lot.
BTW the option "thread_pool.size" doesn't need 20 threads. With the latest design, 1 thread should be more than enough when using NRT.
"ram_buffer_size" is very low though. Consider allowing it 128 MB or more, then the non-NRT backend might be quite faster.
Quote:
We though that "near real time" is something close to "right now" and we're so surprised it could have involved results older than 3 weeks.
The "near real time" name is referring to its write performance, as perceived. To achieve this efficiency, it does avoid disk flushes which might lose data.
But I agree that 3 weeks of time is rather extreme, I had never heard of that so it might be case of bad luck with the wrong metadata being lost, and for some reason you application never had the need to flush that little 1MB buffer.
Also: wouldn't you have had to reindex everything even if you had lost only 1 minute of changes?
Thanks for the feedback!