Search indexing performance?

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

Hi there I am just trying to understand the performance implications of doing a full re-indexing of my documents using Hibernate Search. I have roughly 11,000 documents, each containing between 5-20 fields (the dominant documents have < 10), a few are tokenized short strings, most are un-tokenized numbers. None are stored.

I have found on a very fast machine (quad core 3.5ghz, 4Gig of ram, 2xRaptors in Raid-0) that using the FSDirectory a full-reindex takes almost 3 minutes, with a RAMDirectory taking slightly less than half that. This seems very high to me, especially considering this indexing load is purely for testing and I expect orders of magnitude more documents on the deployed server.

Do these numbers sound about like what others are seeing? I am very concerned about the performance and scalability of this Lucene back end when there are 10s to 100s of millions of documents being indexed.

Thanks!

sanne.grinovero · **Posted:** Mon Dec 15, 2008 3:46 pm

Quote:

Do these numbers sound about like what others are seeing?

I've seen people with much worse numbers complaining here, so this is a good start.
Personally I am going way faster, but we can't compare the numbers as most of the complexity depends on your object graph complexity and the possibility to cache parts of it.
Just as a reference, my object graph is quite complex and the time needed to reindex 6 millions of documents is around 20 seconds on my dual core laptop.

What is the code you're using to rebuild the index? What is described in the book is good, and it is full of tips. Hope you've seen the indexwriter settings in the reference too.

You should try to understand what is slowing you down using a profiler; there are 3 main candidates:
1) your object loading from DB. Study better caching and fetching strategies; some helper API stuff will be added; I hope soon but can't make promises as it depends on lots of other changes.
2) Your garbage collector. Make sure your JVM has good sized memory pools and you are not holding references during the whole indexing process.
3) H.Searche's backend is constantly improving on this side there should come another good speedup in 3.1.1 if Emmanuel accepts my proposals; Are you using automatic optimization strategies? (there's currently a bug that makes them trigger even during indexing)

In any case usually it depends too much from your own entity to be able to provide a general-purpose magical speedup, so you should definitely check your object load timings and make sure the data you need is fetched with minimal DB roundtrips (try enabling SQL logging if you can't profile).

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

Hi Sanne, thank you for the reply.

s.grinovero wrote:

Quote:

Do these numbers sound about like what others are seeing?

I've seen people with much worse numbers complaining here, so this is a good start.
Personally I am going way faster, but we can't compare the numbers as most of the complexity depends on your object graph complexity and the possibility to cache parts of it.
Just as a reference, my object graph is quite complex and the time needed to reindex 6 millions of documents is around 20 seconds on my dual core laptop.

Wow! That is very fast. Can you comment on how many fields you are indexing per document? And how deep is your object graph, ie when you load one object via hibernate how many sub-objects are there?

Quote:

What is the code you're using to rebuild the index? What is described in the book is good, and it is full of tips. Hope you've seen the indexwriter settings in the reference too.

Here is the current code I am using, it runs in a class extending the Spring DAO Helper base class. I have verified that I do not have the n+1 selects problem, everything is loaded from a single sql select, and the selects are very very fast, for all intents and purposes instant. My object graph is also not very deep, generally all properties are simple (int/string/long), with one or two one-to-many or many-to-manys, as I mentioned I index between 5-20 fields, the dominant documents only have around 10 fields indexed, with maybe 25% of the fields using the default @Field annotation, the rest using @Field combined with Un-Tokenized.

Code:

  @Override
  public void buildSearchIndex() {
    // If we are not an indexed class return
    if (this.eventClass.getAnnotation(Indexed.class) == null) {
      return;
    }

    getHibernateTemplate().execute(new HibernateCallback() {

      @Override
      public Object doInHibernate(Session session) throws HibernateException,
          SQLException {

        FullTextSession fullTextSession = Search.getFullTextSession(session);

        Criteria crit = fullTextSession.createCriteria(eventClass);

        int flushSize = 10000;
        int pageSize = 1000;
        int i = 0;
        List<E> results = null;
        do {
          crit.setFirstResult(i);
          crit.setMaxResults(pageSize);
          results = crit.list();
          
          for (E entity : results) {
            fullTextSession.index(entity);
          }
          
          // flush the index changes to disk so we don't hold until a commit
          if (pageSize % flushSize == 0) {
            fullTextSession.flushToIndexes();
          }
          
          i += pageSize;
        } while (results.size() > 0);
        
        return null;
      }      
    });    
  }

Quote:

You should try to understand what is slowing you down using a profiler; there are 3 main candidates:
1) your object loading from DB. Study better caching and fetching strategies; some helper API stuff will be added; I hope soon but can't make promises as it depends on lots of other changes.

I believe that this part is very fast, I had some problems getting my profiler to run last night but hope to solve that this evening and try and understand where the slowness is at.

Quote:

2) Your garbage collector. Make sure your JVM has good sized memory pools and you are not holding references during the whole indexing process.

I will check this, can you suggest any easy ways to check if your JVM is experiencing performance problems due to memory parameters?

Quote:

3) H.Searche's backend is constantly improving on this side there should come another good speedup in 3.1.1 if Emmanuel accepts my proposals; Are you using automatic optimization strategies? (there's currently a bug that makes them trigger even during indexing)

Not that I am aware of, when using Luke I have noticed that one or two of my indexes lists it as being optimized, but some do not.

Quote:

In any case usually it depends too much from your own entity to be able to provide a general-purpose magical speedup, so you should definitely check your object load timings and make sure the data you need is fetched with minimal DB roundtrips (try enabling SQL logging if you can't profile).

I believe the DB fetching part is very fast, it is on the same localhost that I am performing these tests, and as mentioned previously being loaded by a single sql statement per batch, on indexed primary keys.

Most everything I am running is stock, here are the only parameters I have tried changing:

Code:

<prop key="hibernate.search.default.indexwriter.batch.ram_buffer_size">64</prop>
<prop key="hibernate.search.default.indexwriter.transaction.ram_buffer_size">64</prop>

This did give me some speedup, I'd be very interested to know what some optimal configuration parameters are however.

I have also noticed a lot of disk activity during indexing which surprises me, the index it spits out is on the order of just over a meg, which seems crazy to me that it takes multiple minutes and tons of disk activity to create.

Thanks!
-David

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

I just realized there is a slight bug in the code:

Code:

          if (pageSize % flushSize == 0) {
            fullTextSession.flushToIndexes();
          } 

should be

Code:

          if (i % flushSize == 0) {
            fullTextSession.flushToIndexes();
          }
 

However I don't believe this should affect it, since we are < 10k documents the flush should never get called anyway.

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

Ok I have made some good progress. I read through the indexing performance section again and noticed the sentence about the need to perform everything in a transaction else index writers have to be setup/torndown on each index call. The following code was able to decrease my indexing time on ~20k documents to about 7 seconds on my machine:

Code:

  @Override
  public void buildSearchIndex() {
    // If we are not an indexed class return
    if (this.eventClass.getAnnotation(Indexed.class) == null) {
      return;
    }

    getHibernateTemplate().execute(new HibernateCallback() {

      @Override
      public Object doInHibernate(Session session) throws HibernateException,
          SQLException {

        int batchSize = 10000;
        int pageSize = 1000;

        FullTextSession fullTextSession = Search.getFullTextSession(session);
        
        Transaction tx = fullTextSession.beginTransaction();

        Criteria crit = fullTextSession.createCriteria(eventClass)
          .setResultTransformer(CriteriaSpecification.DISTINCT_ROOT_ENTITY)
          .setCacheMode(CacheMode.IGNORE)
          .setFetchSize(pageSize)
          .setFlushMode(FlushMode.MANUAL);

        int i = 0;
        List<E> results = null;
        do {
          crit = crit.setFirstResult(i)
            .setMaxResults(pageSize);
          results = crit.list();
          
          for (E entity : results) {
            fullTextSession.index(entity);
          }
          
          // flush the index changes to disk so we don't hold until a commit
          if (i % batchSize == 0) {
            fullTextSession.flushToIndexes();
            fullTextSession.clear();
          }
          
          i += pageSize;
        } while (results.size() > 0);

        tx.commit();
        
        return null;
      }      
    });    
  }

Any other obvious suggestions or things I should check? I still have a feeling it should be even faster.

sanne.grinovero · **Posted:** Tue Dec 16, 2008 5:52 am

Hi, nice posts!

Quote:

I will check this, can you suggest any easy ways to check if your JVM is experiencing performance problems due to memory parameters?

A Profiler should be best, or use jconsole to connect to your own JVM. This very cool tool is part of the jdk, so you already have it: just type jconsole at a shell. As this indexing operation generates lots of garbage, it is usually worth to take a look. There's also some JVM parameter to sysout each automatic garbage collection: it prints out timings so you get an idea of how much time you loose in it (by default it is NOT concurrent with your application, so if you loose 3 seconds doing GC your application is freezed for 3 seconds).

Quote:

using Luke I have noticed that one or two of my indexes lists it as being optimized, but some do not.

You definitely want to find out why it is optimizing some indexes: optimizations is a costly operation and you want it performed only once, at the end. All writing to the index is blocked during this operation.

Quote:

we are < 10k documents the flush should never get called anyway

You are most probably right, but it is still possible your flush size is too big getting you in trouble with available memory: until you flush and clear all entities are kept in your persistence context.

Quote:

...noticed the sentence about the need to perform everything in a transaction...

Yes that is very important.

Quote:

I believe the DB fetching part is very fast, it is on the same localhost that I am performing these tests, and as mentioned previously being loaded by a single sql statement per batch, on indexed primary keys.

You might be right, but you may want to play with your batch size; did you check that all other needed entities are loaded in the same query of the root entity? It is not very easy to reach the point where the bottleneck is Searche's backend, you really need to find out how much time you are wasting in object loading only.

Are you using the "async" option?

It doesn't make much sense to take measures on Java about a 7 seconds run, you'll need to make your test case longer (i.e. X100 the number of documents at minimum); most internal structures are created at first use, so to predict how longer indexing times could scale you need to discard the first flushes; gc may never trigger in short runs: you may fail to predict it's slowing effects on longer (production like) runs. Ideally you should test on runs going on for several hours, but this makes sense of course only when you feel you have found the right settings.

Also keep in mind the backend will get another speedup in some weeks, so you should invest your time in the object loading optimization and memory sizing: 64MB is not very much, I usually set something between 256MB and 1G but your size should fit your little test case at the moment.

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

Great thanks for the reply. I'll have a look at the optimization issue and see if I can dig up why that is happening.

I'm actually not doing async right now, should I be? It occurred to me last night that I could probably index every different persistent class in parallel since they have their own index folders, would that make sense?

sanne.grinovero · **Posted:** Tue Dec 16, 2008 2:08 pm

Async means you just load the data, and don't need to wait for it to be written before going on to load more data.

Different indexes can be updated in parallel, that's right.
Actually when the data loading suffers from high latency or many n+1 queries you may even want to load in parallel the same entity; this concept is more or less that what I am working on, but requires several changes in Search's code to have some benefit.

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

s.grinovero wrote:

Async means you just load the data, and don't need to wait for it to be written before going on to load more data.

Different indexes can be updated in parallel, that's right.
Actually when the data loading suffers from high latency or many n+1 queries you may even want to load in parallel the same entity; this concept is more or less that what I am working on, but requires several changes in Search's code to have some benefit.

So would you actually have two different sessions, one that is doing loading and one that is indexing, both in different threads? Kind of a producer-consumer model?

sanne.grinovero · **Posted:** Tue Dec 16, 2008 2:24 pm

Quote:

Kind of a producer-consumer model?

Yes, but there are not two sessions: you use one, and push the fetched data to Searche's internal threadpool, which doesn't use Sessions at all to write to the index.
There is no change required in your code, just configure it to async.

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

s.grinovero wrote:

Yes, but there are not two sessions: you use one, and push the fetched data to Searche's internal threadpool, which doesn't use Sessions at all to write to the index.
There is no change required in your code, just configure it to async.

Ah ok, I'll give this a whirl and see if it makes a difference.

Out of curiosity is your code or an example of how you are managing 11 million docs indexed in 20 seconds available anywhere? I think this would be a _very_ useful example to have for comparison purposes.

sanne.grinovero · **Posted:** Wed Dec 17, 2008 3:18 pm

Quote:

Out of curiosity is your code or an example of how you are managing 11 million docs indexed in 20 seconds available anywhere? I think this would be a _very_ useful example to have for comparison purposes.

Currently not, I can't publish the 10GB privately held data ;-)

Even worse, I don't work there anymore and this is hindering further improvements as my test benchmark is gone.

I could show some code, but I still hope to release it all during new-year holidays. One of the points which really slows me down is the rule "don't break public APIs"; if I could prepare some "rule breaking" preview around christmas, would you be interested to test patches? (and improve them with better ideas)?
Testing is time-consuming as it needs long runs, different data, different machine types and network latencies (and possibly a profiler): so I could definitely use some help and see if it speeds up also with databases different than my reference ;-)

halcyon · **Joined:** Wed Dec 17, 2003 1:58 pm **Posts:** 102

Sure I'd be happy to help where I can, at the moment our site is in testing and we hope to grow the dataset so we can see what happens when we have a lot more data to index and search.