2 letter search performance

sourabh · **Joined:** Thu Jan 08, 2009 9:23 am **Posts:** 9

Hi,

we face serious performance issue when users do 2 letter search e.g ho, jo, pa ma, um ar, ma fi etc. time taken between 10 - 15 secs. Search performs on 7 fields, PrefixQuery implementation on all fields, AND search.

We show only 100 top documents only.

Our indexer size is 300 MB.

We user StandardAnalyzer & StandardTokenizer for indexing & searching.

plz let me know how can we improve the performance.

Regards,
Sourabh

sanne.grinovero · **Posted:** Sun Feb 01, 2009 6:08 am

Hi,
are you combining "two letter search" with other contraints, are do you always do only this kind of searches?

You are currently building the worst-case query for Lucene IMHO.
This is more a Lucene forum question, but I think you could solve it by building a custom analyzer which emits "two letter" tokens: in this case the number of maching Terms would be high possibly incresing the index size, but it would be highly optimized for your kind of search, returning the situation to the usual blazing fast queries.

You may also try an additional field to combine the 7 ANDed fields into one at indexing time.

These considerations are quite broken if you want to support also other kinds of searches; in that case you may want to keep two sets of fields in the index, so to use the best ones depending on search type and possibly combining them.

sourabh · **Joined:** Thu Jan 08, 2009 9:23 am **Posts:** 9

Hi,

as per Lucene document I find out 2 ways of performance improvement:

1. Sorting the documents need to retrieve by docID order first increase the performance.
2. We can restrict the loading of all fields of a document by implementing the FieldSelectorResult.LAZY_LOAD.

plz correct me if I am wrong. I want to implement these suggestions & want to know Hibernate search support these.

sourabh

amin-mc · **Joined:** Wed Oct 03, 2007 2:31 pm **Posts:** 205

Hi

Please find info from Erick Erickson on prefix queries:

Quote:

Prefix queries are expensive here. The problem is
that each one forms a very large OR clause on all
the terms that start with those two letters. For instance,
if a field in your index contained
mine
milanta
mica

a prefix search on "mi" would form
mine OR milanta OR mica.

Doing this across seven fields could get expensive.

Two things:
1> what is the problem you are trying to solve? Perhaps some
of the folks on the list can give you some suggestions. You can
think about many strategies depending upon what you want
to accomplish. A 300M index isn't very big, so you could, for
instance, think about indexing a separate field that contains only
the two beginning letters and search *that* in this case. I'll
assume that three letter prefix queries are OK.

2> How are you measuring query time? If you're measuring the
time it takes when you first start a searcher, be aware that the
first few queries are usually slow because the caches haven't
been filled. Further, are you measuring total response time or
are you measuring *just* the query time? It's possible that the
time is being spent assembling the response in your code
rather than actual searching. You might insert some timers
to determine that.
[/code]

This is taken from the lucene mailing.

HTH

amin-mc · **Joined:** Wed Oct 03, 2007 2:31 pm **Posts:** 205

ok...sorry folks..just realised that the quote i copied is actually for the original author of this post.

Sorry again!

hardy.ferentschik · **Posted:** Tue Feb 03, 2009 10:33 am

hi,

I think the best approach is either Sanne's custom analyzer and emitting/adding the two letter tokens to the token stream or indexing the two letters into a separate field as suggested in the quoted email.

You definitely want to get away from the expensive PrefixQuery and complex BooleanQueries.

All this is possible in Hibernate Search right now.

--Hardy