-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 
Author Message
 Post subject: Hibernate Search index building takes very long
PostPosted: Fri Sep 19, 2008 10:29 am 
Beginner
Beginner

Joined: Tue Aug 12, 2008 9:06 am
Posts: 22
Location: Fort Washington, PA
I am manually building my lucene Index for Hibernate Search and its working correctly. My database structure is the following:

Recording - Transcript < Line < Utterance

My issue is that writing the index takes a very long time. We currently have around 2 million utterances among 10000 transcripts that we are indexing. I compile the program into an executable jar that is run on our server. I also wanted to write it in such a way, that it can be stopped and re-run without leaving an index that is incomplete or unreadable.

Does anyone see a way I can optimize this code so it doesn't take days and days to index my data? Thanks so much. I'm open to any suggestions.

Main Class
Code:
public class Main {
    public static void main(String[] args) {
        LuceneIndexer luceneIndexer = new LuceneIndexer();
        luceneIndexer.addUtteranceToIndex();
    }
}


Indexer Class
Code:
class LuceneIndexer {
    public LuceneIndexer() {
}

    public void addUtteranceToIndex() {
        Session session = getSessionFactory().openSession();
        FullTextSession fts = Search.createFullTextSession(session);
        File idx = new File("./RecordingSearchIndex");

        System.out.println("Index Start Time: " + now());
        List<Integer> transcriptIDs = session.createQuery("select UID from TempTranscript").list();
        System.out.println("Number of Transcripts: " + transcriptIDs.size());

        //foreach Transcript
        for (Integer transcriptID : transcriptIDs) {
            TempTranscript transcript = (TempTranscript) session.createQuery("from TempTranscript where UID = " + transcriptID).uniqueResult();
            List<Integer> turnIDs = session.createQuery("select UID from Turn where tempTranscript = " + transcript.getUID()).list();
            System.out.println("Transcript Start Time: " + transcript.getUID() + " : " + now());

            for (Integer turnID : turnIDs) {
                Turn turn = (Turn) session.createQuery("from Turn where UID = " + turnID).uniqueResult();
                List<Utterance> utterances = turn.getUtterances();
                for (Utterance utterance : utterances) {
                    org.hibernate.Query query = fts.createFullTextQuery(this.buildQuery(Integer.toString(utterance.getUID())));
                    if (query.list().size() == 0) {
                        fts.getTransaction().begin();
                        fts.index(utterance);
                        fts.getTransaction().commit();
                    }
                }
            }
            System.out.println("Transcript Completed Time: " + transcript.getUID() + " : " + now());
        }
        session.close();
        System.out.println("Index Completed Time: " + now());
    }



Top
 Profile  
 
 Post subject:
PostPosted: Tue Sep 23, 2008 11:19 am 
Hibernate Team
Hibernate Team

Joined: Thu Apr 05, 2007 5:52 am
Posts: 1689
Location: Sweden
Hi,

indexing performance can be influenced by many factors not at least the actual hardware configuration. Of course there are also quite a few of configuration options. I suggest you check the online documentation for Hibernate Search. Maybe this thread is of interest as well: http://forum.hibernate.org/viewtopic.php?t=989833.

Have you actually determined yet whether the actual indexing takes so much time or the retrieving of the objects? I see that you have quite some loop going there.

BTW, why do you query the index, before you index the utterance? If the utterance is already indexed and you index it again it will automatically get updated (actually deleted and readded in Lucene). And performace wise it is bad to index each utterance in its own transaction. You should be batch indexing. This might actually be the main reason of your grief.

--Hardy


Top
 Profile  
 
 Post subject:
PostPosted: Fri Sep 26, 2008 11:20 am 
Beginner
Beginner

Joined: Tue Aug 12, 2008 9:06 am
Posts: 22
Location: Fort Washington, PA
Hey Hardy, thanks for the reply.

I was checking it the Utterance was already in the index, mainly because I wanted to keep a really clean index, without "deleted" documents in there, but I guess that was just a personal preference of vanity and not really needed.

Another reason, was that I figured the writing portion of the index, stretched over several million documents, would take the most hit in performance and completion time, so I was trying to only write to the index if it was not currently there. My ideal world was to be able to rewrite the index every night to have it pristine, but I don't think it can write that fast.

Does the index compare the exact value it is in the index and what it is currently, and update it based on whether its been changed, or just does it mark it off and re-index it either way? If thats the case then I'm probably checking twice which doesn't make much sense. :)

Thanks.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Sep 27, 2008 2:36 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
it's probably much faster if you remove all data from index first, them optimize and reinsert them all.
You are hurting very much the performance by checking; Following the correct tips found in reference documentation or in the book (see banners) you can get several millions of entities indexed per hour. It really depends on several factors so I can't promise anything of course.. but it's possible to go very fast,
just try follow the guidelines.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.