Hi,
I've got a databse table that contains around 6 million rows, each row includes a place name (literally the name of a place such as a town).
I've got the table mapped to an EJB3 entity using annotations and I'm running in JBoss 4.0.5.GA using EJB3 stateless session beans for my application.
Most of this is fine :)
I have a problem with working through a large resultset using the EntityManager setMaxResults and setFirstResult methods...
Basically, I want to go through each place name and check it for "strange" unicode characters - it is more convienient to do this in Java than a stored procedure as I have some code that defines what "strange" is.
I have some code that selects all the rows from the place table and loops through them 1000 at a time:
Code:
Long rowCount = (Long) countQry.getSingleResult();
System.out.println("Row Count: " + rowCount.longValue());
int calls = (int) (rowCount.longValue() / MAX_RESULTS);
System.out.println("calls: " + calls);
int offset = 0;
int loopsCounted = 0;
resultsQry.setMaxResults(MAX_RESULTS);
while (loopsCounted <= (calls + 1)) {
long loopStart = System.currentTimeMillis();
resultsQry.setFirstResult(offset);
List rows = resultsQry.getResultList();
long dbFinished = System.currentTimeMillis();
visitor.process(rows);
long visitorProcessingFinished = System.currentTimeMillis();
_em.clear(); //lets not run out of memory
long clearedEm = System.currentTimeMillis();
loopsCounted = loopsCounted + 1;
offset = offset + (MAX_RESULTS);
long totalTime = clearedEm - loopStart;
long dbTime = dbFinished - loopStart;
long procTime = visitorProcessingFinished - dbFinished;
long emClear = visitorProcessingFinished - clearedEm;
long remainingTime = (calls - loopsCounted) * totalTime;
BigDecimal secondsToComplete = (new BigDecimal(remainingTime)).divide(new BigDecimal(1000), 2, RoundingMode.DOWN);
log.info("Loop: " + totalTime + ", DB: " + dbTime + ", Visitor: " + procTime
+ ". Remaining (s): " + secondsToComplete + ". Next offset: " + offset);
}
For a short while this code runs ok - with the crude profiling in the code showing that each call to resultsQry.getResultList() taking 50ms to 150ms to complete. By the time 1.2 million rows have been processed (7 hours later) the time taken for the resultsQry.getResultList() call to complete rises to about 5 seconds...
To convince myself that its not an environment problem, I re-wrote the process into 2 steps:
1) select all the pk values from the place table (ordered ascending)
2) use an hql statement to retreive 1000 places at a time using an hql query bounding the results to be greater than minid and less than maxid
This process shows the amount of time for resultsQry.getResultList() to complete staying constantly between 50ms and 150ms with the whole process completing in about 15 minutes.
Does anyone have any idea what might be going on?
My suspicion is that something bad is happening inside the oracle jdbc driver around the rownum values and caching on the client side :(
I'd be interested in any thoughts/suggestions/explanations.
Cheers,
Andy