bluesky wrote:
It looks through many layers of objects for each account. Throughout this proces it is composing a result set of problem data.
I played with this some more and found i get the same speed increase just by calling session.clear, sesson.flush after each account is audited. So the only problem with breaking it up into lots of new sessions is that i believe I am detatching all my objects.
I find it odd that this is necessary though. It doesn't really seem like that much data to me though. Is a way i could measure exactly how many objects are open in a session at any given time?
First I dont think you are "Detaching" your objects, because your Session is never closed on them and you never use those objects with a different Session instance.
Second the session.clear() will flush all 1st level cached object that are not dirty. So anything you touched read-only can be dropped, since lazy loading can also go and get it again (from SQL). But proxyied objects will still work (providing the Session stays open)
Third maybe you want to consider how you get your 1000000 account record list. I don't think you documented exactly how you did it. Generally speaking if you use the Query.list() (or similar HQL interface) all 1 million object might be loaded from the SQL server into RAM first before you get to work on your first object. This does not scale, However if you use an integral ID number and blindly start at 1 and issue a Session.get(Account.class, Integer.valueOf(thisId)); then you may not be subject to this bottleneck. Maybe you can chunk your HQL.
It sounds like there is a big difference between the data set size for:
* Object count touched to calculate if we are interested in this account for adding to the result set.
* Object count necessary to remember all those accounts we did find interesting.
I originally thought you were modifying a large amount of data, and I suspect you are modifying something on the way, otherwise I think tx.commit() would be unecessary and Session.flush() and Session.clear() would have done the trick.
Different people would have different opinions on if a possibly 1 million record result set is too big or not to work from ram. Maybe if you have 4Gig of ram and the only purpose of the application server is to service that one user then you might not consider your method outrageous.
Someone else might take the tact that if we only want to know the problem account numbers (or some other such reference) then in effect evey account is inspected to find out if we are interested, we simply record the accountId number to form our minimal result set (during a 1st pass).
As you can guess this is lighter weight result set which you can (if you really needed too) offload into a file and do away with any real RAM overhead (for real huge batch jobs).
I generally find Enterprise computing to be a trade off between resources and it sounds like might want to stand back a moment and make a decision as to what your environment parameters are. For me this might mean that any one of 30 users might run this process at any time of the working day, two people might run this process at the same time, so I might put a 64Mb limit on that task and then work out my algorithm with that in the specification. Since I know there are upto 90 users on the same box with maybe only 10 active at once. These metrics directly impact my choice of algorithm for that task.
I rarely find any "batch" job can be programmed in a simplistic / naive way, as the metrics dont scale well or allow fair use with other parts of the systems that share those resources at the same time.
HTH YMMV