I'm having a pretty severe performance problem with postgres/hibernate fetching where the query takes around 250ms and I'd rather it takes around 20 or 30ms if possible since this query is used the most in the entire system.
Here is the HQL. As you can see, I'm fetching a collection called shipment.statusMessages, which usually contains just 2 or 3 rows. However, this query typically returns 100 or 150 results, so it does add up. Since this pulls up duplicate results, I have to add them to a linked hash set to maintain order but remove the duplicates.
Code:
Query query = session.createQuery(
"select shipment " +
"from Shipment shipment " +
" inner join shipment.cargoControlNumber.carrierCode.carrier.persons person " +
" inner join shipment.currentStatus currentStatus " +
" inner join currentStatus.releaseCode releaseCode " +
" left join fetch shipment.currentStatus " +
" left join fetch shipment.statusMessages " +
"where " +
" person.id = :personId and " +
" shipment.isPurged = false and " +
" releaseCode.number = :releaseCodeNumber and " +
" currentStatus is not null and " +
" currentStatus.date >= current_date - 7 " +
"order by currentStatus.date desc"
);
query.setParameter( "personId", personId );
query.setParameter( "releaseCodeNumber", releaseCodeNumber );
query.setFirstResult( firstResult );
query.setMaxResults( maxResults );
List list = query.list();
Set distinctShipments = new LinkedHashSet( list );
return Arrays.asList( distinctShipments.toArray() );
Here is the generated SQL (I replaced the question marks with actual data so I could run it outside of the application). I also took out the stuff in the select clause because it occupies too much space and is messy. It's not very important anyway.
Code:
select [shipment0_.*, .... ]
from shipment shipment0_
inner join shipment_status shipmentst5_ on shipment0_.current_status_id = shipmentst5_.id
inner join release_code releasecod6_ on shipmentst5_.release_code_id=releasecod6_.id
inner join carrier_code carriercod1_ on shipment0_.carrier_code_id=carriercod1_.id
inner join carrier carrier2_ on carriercod1_.carrier_id=carrier2_.id
inner join carrier_to_person persons3_ on carrier2_.id=persons3_.carrier_id
inner join person person4_ on persons3_.person_id=person4_.id
left outer join shipment_status statusmess7_ on shipment0_.id=statusmess7_.shipment_id
where (person4_.id=255 )
and(shipment0_.is_purged=false )
and(releasecod6_.number='4' )
and(shipmentst5_.id is not null )
and(shipmentst5_.date>=current_date-7 )
order by shipmentst5_.date desc
limit 100
Using Dan Tow's diagram method, The query should really start with the person since it has 0.0045 filter ration (122 person rows). The shipment table has about 18,000 rows and shipment_status has about 33,000, which is the main reason the query takes so long. The rest of the rows are modest in size and won't get large.
The filter ratio on shipment_status is 0.083, which is very good considering there are 33k rows in that table. I would like to filter those out as fast as possible.
The filter ration on release_code is 0.125, but there are only 8 entries in this table. Ironically, this filter isn't very good because when the number = '4', it will match up to 54% on average of the rows in shipment_status anyway.
The filter on shipment.is_purge is very bad, roughly being 0.98, hence sequencial scans the optimizer is doing.
Here is the explain on this query with the current indexes I'm using.
Code:
Limit (cost=1090.48..1090.56 rows=30 width=112)
-> Sort (cost=1090.48..1090.56 rows=30 width=112)
Sort Key: shipmentst5_.date
-> Nested Loop Left Join (cost=615.95..1089.75 rows=30 width=112)
-> Nested Loop (cost=615.95..982.99 rows=17 width=87)
-> Index Scan using person_pkey on person person4_ (cost=0.00..5.61 rows=1 width=4)
Index Cond: (id = 255)
-> Hash Join (cost=615.95..977.21 rows=17 width=91)
Hash Cond: ("outer".release_code_id = "inner".id)
-> Hash Join (cost=614.84..975.28 rows=131 width=91)
Hash Cond: ("outer".id = "inner".current_status_id)
-> Index Scan Backward using current_status_date_idx on shipment_status shipmentst5_ (cost=0.00..305.46 rows=10735 width=21)
Index Cond: (date >= (('now'::text)::date - 7))
Filter: (id IS NOT NULL)
-> Hash (cost=613.86..613.86 rows=393 width=70)
-> Hash Join (cost=7.43..613.86 rows=393 width=70)
Hash Cond: ("outer".carrier_code_id = "inner".id)
-> Seq Scan on shipment shipment0_ (cost=0.00..509.10 rows=17908 width=66)
Filter: (is_purged = false)
-> Hash (cost=7.42..7.42 rows=2 width=8)
-> Hash Join (cost=4.74..7.42 rows=2 width=8)
Hash Cond: ("outer".id = "inner".carrier_id)
-> Seq Scan on carrier carrier2_ (cost=0.00..2.44 rows=44 width=4)
-> Hash (cost=4.74..4.74 rows=2 width=16)
-> Hash Join (cost=3.03..4.74 rows=2 width=16)
Hash Cond: ("outer".carrier_id = "inner".carrier_id)
-> Seq Scan on carrier_code carriercod1_ (cost=0.00..1.46 rows=46 width=8)
-> Hash (cost=3.02..3.02 rows=1 width=8)
-> Index Scan using carrier_to_person_person_id_idx on carrier_to_person persons3_ (cost=0.00..3.02 rows=1 width=8)
Index Cond: (255 = person_id)
-> Hash (cost=1.10..1.10 rows=1 width=4)
-> Seq Scan on release_code releasecod6_ (cost=0.00..1.10 rows=1 width=4)
Filter: ((number)::text = '4'::text)
-> Index Scan using shipment_status_shipment_id_release_code_id_ukey on shipment_status statusmess7_ (cost=0.00..6.26 rows=2 width=25)
Index Cond: ("outer".id = statusmess7_.shipment_id)
However, I don't think this plan resembles the optimal plan using Dan Tow's method and I don't know how to fix it. The plan should be
person -> carrier_to_person -> carrier -> carrier_code -> shipment -> current_status -> release_code -> shipment_status (the fetch join)
Or, if that is not performant enough, I would probably want this one:
currnet_status -> release_code -> shipment -> carrier_code -> carrier -> carrier_to_person -> person -> shipment_status (the fetch join)
I'm not very good at reading plans in postgresql, but it doesn't look like either of these plans are being used. I see that it's using an index on shipment_status that covers release_code_id and shipment_id. Even though Dan Tow's method does not say to use this index, it speeds the query from 650ms to 250ms.
Can anyone please help me out? I'm an object guy, not a dba. But there is nobody here to ask. Thanks.