-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 3 posts ] 
Author Message
 Post subject: what analzyer is good for my situation?
PostPosted: Thu Jun 02, 2011 11:07 pm 
Beginner
Beginner

Joined: Mon Oct 27, 2008 6:26 am
Posts: 36
Hi guys,

We are running a search app for book.

Book entity defined as following:

@Entity
@Indexed
public class Book{
@DocumentId
private Integer UID;
@Field
private String title;

@Field
private String description;
...}

If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.

The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?

If search words is microsoft access 2007, results have best score if they contains "microsoft access",

the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".

Can anyone offer me some clues?

thanks!


Top
 Profile  
 
 Post subject: Re: what analzyer is good for my situation?
PostPosted: Fri Jun 03, 2011 9:00 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Quote:
If a user search book name, say, they input Microsoft access 2007, books with title or description contains microsoft, access or 2007 returned. That is what we expected. Some of books are totally unrelated because of keyword 2007. I am looking for a solution to understand importance of each keywords. In that case, 2007 is less important in search. But for that search, there is no difference for microsoft, access or 2007.

Are you testing on a small corpus? the importance of each token is relative to how frequent it's used: if you have many books mentioning "2007", then automatically "2007" will be not very significant in the calculation of the score. So basically, Lucene should solve this for you without any explicit direction.

Quote:
The second user case: Is there a good analyzer that can use in indexing and querying to support multiple phrases? I thought the default analyzer of hibernate search just tokenize search words into single word?
That's correct, that's the default. There's are many alternatives, like org.apache.lucene.search.MultiPhraseQuery and
import org.apache.lucene.search.PhraseQuery, both supported by the QueryBuilder DSL; this is taken from the testsuite (which you can find in the sources) :

Code:
final QueryBuilder monthQb = fullTextSession.getSearchFactory()
            .buildQueryBuilder().forEntity( Month.class ).get();

Query query = monthQb.
      phrase()
      .onField( "mythology" )
      .sentence( "colder and whitening" )
      .createQuery();

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: what analzyer is good for my situation?
PostPosted: Fri Jun 03, 2011 9:08 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
(forgot a question)
Quote:
the other search example: "salt lake city", "united states", results are not expected if only match salt, city or lake or at least, they should be behind results with "salt lake city".

You can play with the Slop parameter of Phrase Queries, and maybe even help by pre-splitting the field on the comma. You could also try splitting the query yourself in different parts (before comma / after comma), apply different options, requirements, boosting to them and then combine the two queries with a boolean query.

If what worries you more is the division of "salt lake" in the two common terms, then you want to look into using org.apache.lucene.analysis.shingle.ShingleFilter

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 3 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
cron
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.