-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 13 posts ] 
Author Message
 Post subject: Hibernate Search - Very slow indexing
PostPosted: Wed Jun 13, 2007 10:29 am 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
Hello,

I'm having a problem using Hibernate Search, I'm using JBoss 4.0.5.GA with the Seam 1.2.1.GA framework.
What I'm trying to do is to index Books, especially a String property indicating which topics these books are related to.

Here's my code:
Code:
public class Book implements Serializable{
   
   @Id
   @GeneratedValue
   @DocumentId
   private Long id;
   
   @Column(name="ch_mots")
   @Field(name="topicString", index=Index.TOKENIZED)
   private String topicString;
...


Code:
...
@PersistenceContext
   EntityManager em;
...
@org.jboss.annotation.ejb.TransactionTimeout(10000000)
   @TransactionAttribute(TransactionAttributeType.REQUIRES_NEW)
   public String createBookIndexes(){
      int lastIndex = 0;
      int iteration = 1;
      List<Book> bookList;
      
      
      FullTextSession fullTextSession = Search.createFullTextSession(((HibernateEntityManager)em.getDelegate()).getSession());   
      
      Transaction transaction = fullTextSession.beginTransaction();
      
      while(iteration<10){//I'm limiting the amount of Book to index in order to test more easily.
         bookList = em.createQuery("select b from Book b").setFirstResult(lastIndex).setMaxResults(10000).getResultList();

         else{

            for(Book b : bookList){
               fullTextSession.index(b);
            }
            
            lastIndex +=10000; //
            iteration++;
         }
      }
      transaction.commit();


Note that I'm obliged to cut the query into several parts, cause 3,000,000 books are far more than what the 'java heap' can handle. That's why I have that little 'while' with multiple query to the database. Algorithm wise, everything is supposed to work.

Actually, it doesn't create the files when I add Books to the fullTextSession, and neither when I'm committing the transaction. The index files seem to be created a long time after the code has been processed (apart from the Book directory and the two 20bytes base files), depending on how many Books I indexed. And it can apparently take ages, since indexing like 80000 (there's over 3,000,000 in the database) Books already take some minutes before actually creating the index files. Meanwhile, the web page clicked on to execute the code is waiting for an answer from the server.
I find this behaviour really strange, because by using plain Lucene code, it doesn't take long. Actually, it doesn't take time at all, it creates and updates files as soon as I add one Book to the IndexWriter. I know Search is working with a batch, but to me it looks really strange that the file creation isn't launched as soon as the transaction is commited.
Anyway, eventually, the index files are created since I already completed the indexing of about 50-60k Books successfully and I can search in these files without a problem.


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 13, 2007 11:29 am 
Hibernate Team
Hibernate Team

Joined: Mon Aug 25, 2003 9:11 pm
Posts: 4592
Location: Switzerland
Your code looks incomplete and broken. This is my indexing routine:

Code:
    public int batchSize = 50;

    /**
     * Runs asynchronously and re-indexes the given entity class after purging the index.
     *
     * @param entityClass the class to purge and re-index
     * @param progress a value holder that is continously updated while the asynchronous procedure runs
     */
    @Asynchronous
    public void rebuildIndex(Class entityClass, Progress progress) {
        log.info("asynchronously rebuilding Lucene index for entity: " + entityClass);

        UserTransaction userTx = null;

        try {
            progress.setStatus("Purging index");
            log.debug("deleting indexed documents");
            userTx = (UserTransaction)org.jboss.seam.Component.getInstance("org.jboss.seam.transaction.transaction");
            userTx.begin();

            EntityManager em = (EntityManager) Component.getInstance("entityManager");
            Session session = (Session) em.getDelegate();

            // Delete all documents with "_hibernate_class" term of the selected entity
            DirectoryProvider dirProvider = ContextHelper.getSearchFactory(session).getDirectoryProvider(entityClass);
            IndexReader reader = IndexReader.open(dirProvider.getDirectory());

            // TODO: This is using an internal term of HSearch
            reader.deleteDocuments(new Term("_hibernate_class", entityClass.getName()));
            reader.close();

            // Optimize index
            progress.setStatus("Optimizing index");
            log.debug("optimizing index (merging segments)");
            Search.createFullTextSession(session).getSearchFactory().optimize(entityClass);

            userTx.commit();

            progress.setStatus("Building index");
            log.debug("indexing documents in batches of: " + batchSize);

            // Now re-index with HSearch
            em = (EntityManager) Component.getInstance("entityManager");
            session = (Session) em.getDelegate();
            FullTextSession ftSession = org.hibernate.search.Search.createFullTextSession(session);

            userTx.begin();

            // Use HQL instead of Criteria to eager fetch lazy properties
            ScrollableResults cursor = session.createQuery("select o from " + entityClass.getName() + " o fetch all properties").scroll();

            cursor.last();
            int count = cursor.getRowNumber() + 1;
            log.debug("total documents in database: " + count);

            cursor.first(); // Reset to first result row
            int i = 0;
            while (true) {
                i++;
                Object o = cursor.get(0);
                log.debug("indexing: " + o);
                ftSession.index(o);
                if (i % batchSize == 0) session.clear(); // Clear persistence context for each batch

                progress.setPercentComplete( (100/count) * i);
                log.debug("percent of index update complete: " + progress);

                if (cursor.isLast())
                    break;
                else
                    cursor.next();
            }
            cursor.close();
            userTx.commit();

            progress.setStatus(Progress.COMPLETE);
            log.debug("indexing complete of entity class: " + entityClass);

        } catch (Exception ex) {
            try {
                if (userTx != null) userTx.rollback();
            } catch (Exception rbEx) {
                rbEx.printStackTrace();
            }
            throw new RuntimeException(ex);
        }

    }


It executes asynchronously and I poll the Progress object with Seam Remoting on a page. The batchsize is the same as the configured worker.batchsize of Hibernate Search (important).

_________________
JAVA PERSISTENCE WITH HIBERNATE
http://jpwh.org
Get the book, training, and consulting for your Hibernate team.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 14, 2007 4:48 am 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
[EDIT]

Ok, I changed my code a bit, and I think I have configured the batch_size well (to 50 like in your example, maybe it's too low?)

Code:
EntityManager em = (EntityManager) Component.getInstance("entityManager");
         Session session = (Session) em.getDelegate(); 
         FullTextSession fullTextSession = Search.createFullTextSession(session);
         Transaction transaction = fullTextSession.beginTransaction();
         int i = 0;
         while(iteration <11){
            ScrollableResults results = fullTextSession.createCriteria( Book.class ).setFirstResult(lastIndex).setMaxResults(10000).scroll(ScrollMode.FORWARD_ONLY);
            while(results.next()){
               i++;         
               fullTextSession.index(results.get(0));
                              
            }
            results.close();
            lastIndex +=10000;
            iteration++;
         }
         transaction.commit();



Still doesn't work. The actual file creation is done way after the commit is done. Maybe it's normal though, then I it wouldn't be a problem, but it seems just so far from plain Lucene's indexing time that it doesn't seem so to me.
In plain Lucene, indexing 20000 Books take me about 12s whereas it takes about 1mn22 this with my hibernate search process. The thing which is actually bothering me is that there's no file creation before a lot of time, but afterall, it's maybe the way Hibernate search works.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 14, 2007 10:20 am 
Hibernate Team
Hibernate Team

Joined: Sun Sep 14, 2003 3:54 am
Posts: 7256
Location: Paris, France
Assuming you're not using the JMS backend, the file creation should be happen right after either the transaction commit or when batch_size is reached (ie before the tx commit).

I think batch_size = 1000 would be good enough (actually as much as your memory can handle).

In my tests, the speed is similar to a JDBC read + a lucene work.

PS in your routine you don't clear the session

_________________
Emmanuel


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 14, 2007 1:25 pm 
Hibernate Team
Hibernate Team

Joined: Mon Aug 25, 2003 9:11 pm
Posts: 4592
Location: Switzerland
If you are using Seam I would greatly appreciate it if you could actually test my method 1:1 on your dataset. I don't have a large dataset around right now, so I don't know if it will perform well. That way we might both get what we want: You get a working routine that is at least correct from an API standpoint (not clearing the Session is definitely wrong) and I get a test :)

_________________
JAVA PERSISTENCE WITH HIBERNATE
http://jpwh.org
Get the book, training, and consulting for your Hibernate team.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Jun 15, 2007 3:32 am 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
Thanks for your answers.

I now believe the problem might simplier (or noobier) than anything. I think I might not have configured well my batch_size thingie, since the session.clear thing seems not to work.
Here's what I put in my persistence.xml file:

Code:
<persistence-unit name="LrbThesaurus">
      <provider>org.hibernate.ejb.HibernatePersistence</provider>
      <jta-data-source>java:/LrbThesaurusDatasource</jta-data-source>
      <properties>
         <property name="hibernate.hbm2ddl.auto" value="update"/>
         <property name="hibernate.cache.use_query_cache" value="true"/>
         <property name="hibernate.show_sql" value="false"/>
         <property name="jboss.entity.manager.factory.jndi.name" value="java:/LrbThesaurusEntityManagerFactory"/>
          <property name="hibernate.search.default.directory_provider" value="org.hibernate.search.store.FSDirectoryProvider"/>
       <property name="hibernate.search.default.indexBase" value="/home/admin/lucene/Thesaurus-indexes"/>
          <property name="hibernate.ejb.event.post-update" value="org.hibernate.search.event.FullTextIndexEventListener"/>
         <property name="hibernate.ejb.event.post-insert" value="org.hibernate.search.event.FullTextIndexEventListener"/>
         <property name="hibernate.ejb.event.post-delete" value="org.hibernate.search.event.FullTextIndexEventListener"/>
          <property name="hibernate.worker.batch_size" value="1000"/>
      </properties>
   </persistence-unit>


Anyway, I'll test your code as soon as get the batch_size property configured, which I think is the problem here.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Jun 15, 2007 1:07 pm 
Hibernate Team
Hibernate Team

Joined: Sun Sep 14, 2003 3:54 am
Posts: 7256
Location: Paris, France
It's hibernate.search.worker.batch_size
And be sure to use beta3 or above

_________________
Emmanuel


Top
 Profile  
 
 Post subject:
PostPosted: Fri Jun 15, 2007 3:48 pm 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
Yep, my bad, I miss typed it!
I just installed beta3 though, so I guess it didn't work before because of that. And it didn't work now because of the 'search' missing (silly me!).
I can't test it until monday though, so stay tuned=)


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 18, 2007 3:47 am 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
Ok, just like I thought, everything is working now!
It apparently takes about 100mn to index my 3,000,000 Entities.

So christian, I tryed your routine like I promessed.
I have two problems with your algorithm.
First, the UserTransaction thing doesn't work.
Code:
         userTx = (UserTransaction)org.jboss.seam.Component.getInstance("org.jboss.seam.transaction.transaction");
         userTx.begin();

Apparently userTx is null, so I have a Java null pointer exception on the 'begin' line.
So instead, I used that code:
Code:
EntityManager em = (EntityManager) Component.getInstance("entityManager");
         Session session = (Session) em.getDelegate();
         tx = session.getTransaction();
         tx.begin();

With that transaction, it works.

Second problem, the database query line.
With my 3,000,000 records, it simply throws a java heap out of memory error. It's apparently impossible (at least on my computer) to retrieve the whole records. So I had the divide it into several queries (with setFirstResult and setMaximumResult).
With both these changes, it works fine.

There's just an improvement that seems to work with me.
I replace this code:
Code:
ScrollableResults cursor = session.createQuery("select o from Book o fetch all properties").setFirstResult(firstID).setMaxResults(50000).scroll();

By this one:
Code:
ScrollableResults cursor = ftSession.createCriteria( Book.class ).setFirstResult(firstID).setMaxResults(50000).scroll();


Don't know why, but it seems to be about twice faster with the last one to index 10000 books (it's my batch_size).

One last question though.
If I want to improve the indexing performance, I guess I should first maximize the amount of items retrieved at a time, but what should I tweak then? Should I increase or decrease batch_size? Is there anything else I should change?


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 18, 2007 4:15 am 
Hibernate Team
Hibernate Team

Joined: Mon Aug 25, 2003 9:11 pm
Posts: 4592
Location: Switzerland
You need Seam 1.3 for the transaction component. In older versions you need to look up the UserTransaction in JNDI instead.

If you get an OOME on the query, your JDBC driver does not support cursors properly and returns the whole result into memory instead of just obtaining a database-side cursor. At least that's my guess, you could try a different JDBC driver or find out if your combination has known issues with cursors.

Increasing the performance might be as simple as increasing the batch size. Or it might not be, if memory is a concern. This is what you really need to test.

_________________
JAVA PERSISTENCE WITH HIBERNATE
http://jpwh.org
Get the book, training, and consulting for your Hibernate team.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 18, 2007 5:04 am 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
Ok, thanks.

I'm currently using Seam 1.2.x, so I guess you're right about the UserTransaction. I'll try to find out what's the problem with my JDBC driver too. Maybe it's related to the fact I'm not using an UserTransaction?
If I manage to work that out, that would mean your code works fine for my 3,000,000 records database.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Jun 18, 2007 10:56 am 
Hibernate Team
Hibernate Team

Joined: Sun Sep 14, 2003 3:54 am
Posts: 7256
Location: Paris, France
Quote:
Ok, just like I thought, everything is working now!
It apparently takes about 100mn to index my 3,000,000 Entities.


Interesting
Can you describe your architecture?
Which DB?
Is the DB on the same machine?
How many CPU/core?

_________________
Emmanuel


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jun 19, 2007 4:59 am 
Newbie

Joined: Wed Jun 13, 2007 5:44 am
Posts: 16
I'm currently running the application on a simple PIV3ghz simple core. My database server is Mysql, and is running on the same computer as Jboss.

Acutally it's more likely to take about 90mn, I think.


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 13 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.