Hibernate Shards not applying setMaxResults as a SQL LIMIT

rjrodger · **Joined:** Tue Jan 22, 2008 6:09 am **Posts:** 1

Hi -

I have come accross a problem with Hibernate Shards and queries on very large tables. Shards does not include a SQL LIMIT clause (using MySQL) in the actual query sent to the database. It only seems to cut the result list once it has collected all the shard result sets.

This presents a problem when the Sharded tables are massive (millions of rows).

I've taken a look at the Shards code and have come up with the following solution. In org.hibernate.shards.criteria.ShardedCriteriaImpl the setMaxResults method originally looks like this:

Code:

  
  public Criteria setMaxResults(int maxResults) {
    criteriaCollector.setMaxResults(maxResults);      
    return this;
  }

Which means that a limit is only applied as an exit strategy.

However, modifying the method like so:

Code:

  
  public Criteria setMaxResults(int maxResults) {
    criteriaCollector.setMaxResults(maxResults);
    
    // LIMIT FIX
    CriteriaEvent event = new SetMaxResultsEvent(maxResults);
    for (Shard shard : shards) {
      if (shard.getCriteriaById(criteriaId) != null) {
        shard.getCriteriaById(criteriaId).setMaxResults(maxResults);
      } 
      else {
        shard.addCriteriaEvent(criteriaId, event);
      }
    }
    
    return this;
  }

causes Shards to apply the limit as part of the SQL.

My question is: what will this break, if anything?

I can't see that it could cause a problem. For example, for a sorted query, the SQL statements will execute on each shard, return the top 100 results (with LIMIT 100, say). Shards then does the final merge and sort, and only keeps the top final 100. You still get the correct answer (i.e. same as for a non-sharded table).

Obviously each Shard returns more than it has too, but not *all* rows, which means that queries on large tables can actually work in a reasonable time.

Any opinions on this solution?

MaxCRoss · **Joined:** Wed Apr 27, 2005 10:22 pm **Posts:** 8

I think you're absolutely right - there's no reason to pull back the contents of the entire table. If you're limiting your query to X results, the first X results may not be in the top X across all shards, but there's no way the X + 1th could possibly be in the top X across all shards because it's not even in the top X on a single shard.

Let me look at your solution a bit longer in the context of the the rest of the ShardedCriteriaImpl methods, but at first glance it looks solid.

Thanks!

Max

gmax · **Joined:** Fri Feb 11, 2005 8:15 pm **Posts:** 31

Gah, I remember the issue now!

If you're just doing a LIMIT with a sort then everything you say is true, and pulling back the contents of the entire table is completely foolish. However, if you're doing a LIMIT with a sort and setting a value for firstResult then we have no choice but to pull back the entire contents of the table and do the merge in memory.

Suppose you have a PERSON table with a column called FIRST_NAME:

Shard 1
ALEX
BILL
CARL
DAVE

Shard 2
BOB
FRED
GARY
HENRY

Now suppose you want to issue a query that returns all users, ordered by FIRST_NAME, starting with the second result, with a limit of 2 results.

Code:

Criteria crit = session.createCriteria(Person.class).setFirstResult(2).setMaxResults(2).addOrder(Order.asc("firstName"))list();

What answer would you expect? Well, Shards tries to make your data look like it all lives in the same place, and if all this data was in the same place the result of the select and the order by would be this:
ALEX
BILL
BOB
CARL
DAVE
FRED
GARY
HENRY

We'd then start with the result at index 2 (BOB) and return 2 results: BOB and CARL.

Now suppose we apply the first result and max result constraints before we return data to the application. From the first Shard we would get:
CARL
DAVE

And from the second Shard we would get:
GARY
HENRY

And from there, well, I'm not even sure what we would do to merge those results, but one thing is for certain, we're not going to be able to return BOB and CARL because BOB isn't even part of the result set.

I've thought through this pretty extensively just now and I think I buy it. Do you? :-)

Max