Hibernate Community • View topic - Best ID generation strategy for large batch loading?

View unanswered posts | View active topics

Board index » Hibernate & Java Persistence » Hibernate Users

All times are UTC - 5 hours [ DST ]

Best ID generation strategy for large batch loading?

Page 1 of 1

[ 9 posts ]

Previous topic | Next topic

Author

Message

RobJellinghaus

Post subject: Best ID generation strategy for large batch loading?

Posted: Sat Feb 14, 2004 5:54 pm

Expert

Joined: Thu Jan 08, 2004 6:17 pm
Posts: 278

We have one table that contains consumer records. There are going to be vastly more consumers than anything else in our system (millions). We load up consumer data sometimes in batches of hundreds of thousands.

Our current consumer mapping uses an id field like so:

Code:

    <class name="com.nimblefish.core.data.Consumer">
        <meta attribute="generated-class">com.nimblefish.core.dto.ConsumerDTO</meta>
        <meta attribute="concrete-generated-class">true</meta>

        <id name="id" type="long" unsaved-value="null" >
            <generator class="native"/>

        </id>

        <property name="prefix" type="string"/>
        <property name="fname" type="string"/>
        <property name="lname" type="string"/>
        ...
    </class>

And our saving code (this is just a prototype!) is:

Code:

            for (int i = 0; i < numRecordsToHandle; i++) {
                Vector record = reader.read();
                Consumer consumer = new Consumer(list);

                consumer.setPrefix((String)record.get(0));
                consumer.setFname((String)record.get(1));
                consumer.setLname((String)record.get(2));
                ...
                // one database hit per consumer?!  sigh....
                // will this batch?!  how do we tell?
                session.save(consumer);

Clearly this calls out for some kind of optimized JDBC.

Or does it? What's the best thing to do here? Should we be doing our own ID assignment, and just doing massive bulk JDBC inserts?

It's our impression that with Postgres and native id generation, each session.save is doing at least two JDBC calls (to get a new sequence value and to insert the new consumer record), and possibly more. Is there a more efficient ID generator for this case?

Is there anything else we should be aware of if we wind up doing our own ID assignment? I envision some strategy such as allocating IDs in blocks of a thousand (in a separate transaction so as not to lock the consumer-ID table for the whole duration of the insert), and then just doing a mass set of inserts. If an insert block fails, that block of IDs might get lost since more IDs might have been allocated in the meantime (we are going to be doing an awful lot of consumer loading in our job handling system), and it would be a lot of trouble to track allocation. Are there any other good patterns here?

Cheers!
Rob

Top

michael

Post subject:

Posted: Sat Feb 14, 2004 6:08 pm

Hibernate Team

Joined: Tue Sep 09, 2003 2:10 pm
Posts: 3246
Location: Passau, Germany

Well, using HiLo with a max_lo of about 1000 would definately perform better, it would only do one database hit in 1000 inserts. You could also use increment or uuid generators which wont do any database hits for inserting.

Top

Myk

Post subject:

Posted: Sat Feb 14, 2004 6:15 pm

Senior

Joined: Sun Jan 04, 2004 2:46 pm
Posts: 147

Don't know about postgres and it's native id generation but if it is indeed native then I'd be suprised if you need to go to the db for the id everytime you save a row. MS SQL Server uses an AUTO_INCREMENT flag on the primary key when you create the table which internally handles id allocation.

I'd check this first, maybe use a profiler on your db to see what is actually being called.

Anyway if it turns out that is happening ( 2 calls per save ) then you can change the id generator to something like hilo, that allocates id's in blocks of whatever you fancy and only hits the database when it needs the next block of id's. You can lose id's using this method but generally only if the blocks are very very large or you are starting up and shutting down hibernate frequently. If you start hibernate, on the first save it will get the first block of id's ( say 1000 ) from the database. These id's are then effectively gone and only that instance of hibernate can use them. If you process 10 rows and the system falls over needing a restart then you lose 990 id's. Depends if your system is likely to fail totally, if it can survive a failed insert and just carry on it shouldn't be much of a problem. You can also make the block allocation smaller to mitigate the problem.

Cheers.

Myk.

Top

RobJellinghaus

Post subject:

Posted: Sat Feb 14, 2004 7:14 pm

Expert

Joined: Thu Jan 08, 2004 6:17 pm
Posts: 278

The Hibernate manual mentions that hilo is not usable with JTA connections since it needs a separate transaction to update the hilo table. Is this also true for seqhilo? -- i.e. with Postgres, and inside an EJB container using the container's transaction context, is seqhilo unsafe to use due to Hibernate's inability to create a separate transaction?

If hilo and seqhilo are not usable with JTA transactions (all this code is in a message bean with transaction-type="Required"), the the reference manual further suggests using a separate stateless session bean. In this case I would basically need to be doing my own hilo implementation and assigning my own IDs, correct?

Thanks for the help :-)
Cheers!
Rob

Top

RobJellinghaus

Post subject:

Posted: Sat Feb 14, 2004 7:17 pm

Expert

Joined: Thu Jan 08, 2004 6:17 pm
Posts: 278

Oh yes, and gloeglm: we can't use increment because we will be running this code in a cluster ultimately. (Actually, we will have a rack of server machines all running message beans performing bulk inserts in parallel, handling different and overlapping jobs.)

Cheers!
Rob

Top

gavin

Post subject:

Posted: Sat Feb 14, 2004 10:16 pm

Hibernate Team

Joined: Tue Aug 26, 2003 12:50 pm
Posts: 5130
Location: Melbourne, Australia

The same is not true for seqhilo, since sequence numbers are assigned nontransactionally. It is the nature of a sequence.

Top

RobJellinghaus

Post subject:

Posted: Tue Feb 17, 2004 6:13 pm

Expert

Joined: Thu Jan 08, 2004 6:17 pm
Posts: 278

Well, it turns out that MySQL doesn't support "seqhilo" (no big surprise there). Our app wants to run on both MySQL and Postgres, with efficient bulk loading on both.

So now the question is, what's the best way to optimize bulk loading across different databases which don't all support the same ID generation?

Should we:

- create different versions of our Consumer mapping for databases which support "seqhilo" versus databases that don't? (If we have two different mappings, wouldn't that mung up our configuration?) And is there any ID generator that works with MySQL inside an EJB container, *and* that doesn't do multiple database round trips per object you save to Hibernate?

- implement our own hilo generator using a stateless session bean (since that will definitely work across multiple databases)?

Remember this is required to work inside an EJB container, hence can't just use the regular "hilo" generator....

Cheers,
Rob

Top

RobJellinghaus

Post subject:

Posted: Tue Feb 17, 2004 6:16 pm

Expert

Joined: Thu Jan 08, 2004 6:17 pm
Posts: 278

Myk wrote:

What you say about SQL Server is true, but what you're forgetting (or not mentioning) is that Hibernate needs to *read back* the ID that SQL Server auto-incremented, immediately after you do the insert.

So Hibernate does still do two round trips, only rather than "get new sequence (and update in-memory id field); insert object with new sequence" it's "insert object with auto_increment; read back new object id (to update in-memory id field)".

Cheers,
Rob

Top

michael

Post subject:

Posted: Tue Feb 17, 2004 7:52 pm

Hibernate Team

Joined: Tue Sep 09, 2003 2:10 pm
Posts: 3246
Location: Passau, Germany

With newest Hibernate you can have Hibernate use getGeneratedKeys() for identity generator if your driver supports it. Newest version uses "insert table(...) values(...) select SCOPE_IDENTITY()" on MsSQL identity, too, to avoid database roundtrips.

Top

Page 1 of 1

[ 9 posts ]

Board index » Hibernate & Java Persistence » Hibernate Users

All times are UTC - 5 hours [ DST ]

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum