what analzyer is good for my situation?

yiwong2001 · **Joined:** Mon Oct 27, 2008 6:26 am **Posts:** 36

Hi guys,

We are running a search app for book.

Book entity defined as following:

@Entity
@Indexed
public class Book{
@DocumentId
private Integer UID;
@Field
private String title;

@Field
private String description;
...}

If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.

The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?

If search words is microsoft access 2007, results have best score if they contains "microsoft access",

the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".

Can anyone offer me some clues?

thanks!

sanne.grinovero · **Posted:** Fri Jun 03, 2011 9:00 am

Quote:

If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.

Are you testing on a small corpus? the importance of each token is relative to how frequent it's used: if you have many books mentioning "2007", then automatically "2007" will be not very significant in the calculation of the score. So basically, Lucene should solve this for you without any explicit direction.

Quote:

The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?

That's correct, that's the default. There's are many alternatives, like org.apache.lucene.search.MultiPhraseQuery and
import org.apache.lucene.search.PhraseQuery, both supported by the QueryBuilder DSL; this is taken from the testsuite (which you can find in the sources) :

Code:

final QueryBuilder monthQb = fullTextSession.getSearchFactory()
            .buildQueryBuilder().forEntity( Month.class ).get();

Query query = monthQb.
      phrase()
      .onField( "mythology" )
      .sentence( "colder and whitening" )
      .createQuery();

sanne.grinovero · **Posted:** Fri Jun 03, 2011 9:08 am

(forgot a question)

Quote:

the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".

You can play with the Slop parameter of Phrase Queries, and maybe even help by pre-splitting the field on the comma. You could also try splitting the query yourself in different parts (before comma / after comma), apply different options, requirements, boosting to them and then combine the two queries with a boolean query.

If what worries you more is the division of "salt lake" in the two common terms, then you want to look into using org.apache.lucene.analysis.shingle.ShingleFilter