-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 
Author Message
 Post subject: batch processing question
PostPosted: Wed Jun 07, 2006 11:03 am 
Newbie

Joined: Wed Jun 07, 2006 10:07 am
Posts: 5
Hello, I read the article on batch processing, http://www.hibernate.org/hib_docs/v3/re ... batch.html but I have a little doubt.

I have a 60.000 line file I have to load to the database, but I have to load the new data only. How can I accomplish that? I'm currently querying the database first to check if the data is already loaded, and if not, I call save().

I know it's really ugly but I'm not sure what's the proper way to do it.

Here's my code, it's a method called saveIfNew(), which I call from a loop, that is, I'm opening 60.000 different transactions, and I'm not sure if that's the way to do it:
Code:
void saveIfNew(String modelo) {
    Session session = HibernateUtil.getSessionFactory().getCurrentSession();
    session.beginTransaction();

    tamaño = (Integer)
      session.createQuery("select count(*) from Modelo as mod where mod.modelo ='" + modelo +
                          "'").uniqueResult();

    session.getTransaction().commit();

    if (tamaño == 0)
    {
      session = HibernateUtil.getSessionFactory().getCurrentSession();
      session.beginTransaction();
      Modelo modtmp = new Modelo(modelo);

      session.save(modtmp);

      session.getTransaction().commit();

    }
}

Best regards,
Hernan


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 07, 2006 8:07 pm 
Beginner
Beginner

Joined: Sat Jun 03, 2006 6:23 pm
Posts: 28
I'm really not an expert, couple ideas that's all. I really hope you come back and tell us what solution worked best.

1. You should absolutley run the SELECT statement as a parametrized query, instead of building it dinamically. I think with this many statements, the fact that your database need not reparse the whole statement is a huge plus. I'd even try and go as far as reusing a Query object, only setting the parameter different each time, maybe it can reuse the underlying PreparedStatement, and not even the query string needs to be sent to the database (I'm not sure this works, though).

2. You are running at least 60.000 transactions, plus one for each insert. I definetly think you should do this in one transaction, or at least in bigger chunks. Anyway, checking in one transaction, then inserting in another, is not really good if others also access that table.

3. You should consider turning off cache interaction for these queries, unless you want all those records in the cache. Maybe even go as far as using the StatelessSession instead of Session.

4. If you stay with the Session interface, you should know, that the session is flushed before every Query by default. Maybe you should turn that off, and flush manually after you inserted the same amount of new rows as the jdbc batch size.

5. I'd consider loading the data into a temporary table without checking anything, using StatlessSession interface for absolute minimum overhead. Then use insert into (OriginalTable) ... select .... from (TempTable) (see 13.4. DML-style operations) to insert values which are not duplicates. It should be possible to formulate such a query maybe using left outer join or a subselect. The success of this approach is very much dependent on how much of that 60.000 rows you need to actually insert, and how often. Obviously, if only very few of the records usually needs to be inserted, this will not be the best solution.

Roland


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 07, 2006 8:34 pm 
Senior
Senior

Joined: Sun Jun 04, 2006 1:58 am
Posts: 136
is mod.modelo a string??


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 07, 2006 11:21 pm 
Newbie

Joined: Wed Jun 07, 2006 10:07 am
Posts: 5
to scarface: yes, mod.modelo is a string (it's the model number, as in "DVD2032/34R")

to szroland: well what I'm inserting is the model list, the company sends me the latest model list once a month. sometimes there are 20 new models, sometimes none. the 60000 record number is because they send me the part list and model list on the same excel file, a dump of their PartNumberByModelNumber table.

this table is sorted by model number so I save a LOT of queries with a simple "if lastModel != currentModel", so this whole update takes about 10 seconds to complete because I get about 6000 (faked) transactions per second .

the problem happens when I have to load the part numbers (which aren't sorted), and the number drops to 50-100 transactions per second. and the memory usage jumps to 350MB or more (according to windows' task manager) when I try to do that in one transaction alone. very weird because I'm constantly flush() ing. I tried calling System.gc() after flush() and the memory usage dropped significantly (to about 200MB), but the performance went to 20-30 lines/sec. I'm not sure if it's supposed to use this huge amount of memory, but hey it's Java :D. what drives me crazy is that I'm currently loading this same list in Access and it only takes a couple of seconds to load the whole file. it then whines about integrity (UNIQUE index to avoid duplicates), but it's fine by me.

a question: what does flush() do? I understand that it actually sends the data to the database, but that is not written into the database until commit(), is that the way it works? how can I use Hibernate in a non-transactional mode? like auto-commit?

anyway I don't care too much about the performance. this process runs only once a month and after office hours. but I'd like to make this faster than the 5 to 8 minutes it currently takes on my Sempron 2200+ with 1GB RAM


Top
 Profile  
 
 Post subject:
PostPosted: Wed Jun 07, 2006 11:54 pm 
Expert
Expert

Joined: Thu Dec 23, 2004 9:08 pm
Posts: 2008
Flush synchronizes the hibernate session cache, sending any changes made down to the database. It doesn't affect any transactions, so the changes are applied (by the DBMS) only inside the current transaction.

You can turn on autocommit mode like this:
Code:
session.connection().setAutoCommit(true);
But don't, it's not a good thing to do. You should use batching: something to the effect of flushing after every 20 saves and committing at the end. Optionally, you can commit after (e.g.) every 200 saves, but then you risk ending up with a partially-updates DB. That probably isn't too bad in this case, as your code is obviously handles importing the same file more than once, so you can simply re-import it.

One thing that will probably improve performance is to run this as a two-step process. First, outside of a transaction (or in a read-only transaction), get all your existing (in DB) models, and use them to whittle down the list of models to be imported, so that you end up with only the 0 to 20 new models to write. You can even use a StatelessSession to speed this up a little bit. Then save the (relatively) tiny list of new models.

_________________
Code tags are your friend. Know them and use them.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Jun 08, 2006 7:00 pm 
Newbie

Joined: Wed Jun 07, 2006 10:07 am
Posts: 5
I was thinking of retrieving the whole list of parts from the database at once, load it on a HashSet and iterate through the new list with .contains(). That will take a lot of memory at first but I'll have a local copy of the parts table on memory, and hashset is fast enough to check the 60.000 records and add the newer 20 or so. I checked the parts table and it has 14000 records (it won't grow much bigger than that)

I'll try that and I'll let you know what the numbers look like after that modification.


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.