-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 
Author Message
 Post subject: about analyzer for searching location
PostPosted: Wed Apr 14, 2010 11:46 am 
Beginner
Beginner

Joined: Mon Oct 27, 2008 6:26 am
Posts: 36
Hi All,

I am implementing a search function for address. The class definition as following:

@Indexed
public class Address implements Cloneable
{
@DocumentId
private int id;
@Field
private String addrCountry;
private String addrDesc;
@Field
private String addrLineOne;
private String addrLineTwo;
@Field
private String addrCity;
......

As you see, addrCountry, addrLineone and addrCity are fields for search. I am using default analyzer in index & search. So I think country name like United States would be indexed as two terms United, and states.

In addition, during search, a search keyword like United states, or Salt lake city would be tokenized as two or three single words.

As result, any address fields contain united, city would be returned. like United Kingdom, but actually I want to get a result of united states.

To improve search quality, I wonder if any other analyzer can help me to implement my requirement. It would be better to use dictionary based solution, then I can manage some search terms that could have multiple words.

thanks

Ian


Top
 Profile  
 
 Post subject: Re: about analyzer for searching location
PostPosted: Thu Apr 15, 2010 4:52 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
sorry I couldn't understand your issue, could we try with some examples.
Quote:
As result, any address fields contain united, city would be returned. like United Kingdom, but actually I want to get a result of united states

So if someone searches for "united" it should return "united states" but not "united kingdom" ?

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: about analyzer for searching location
PostPosted: Thu Apr 15, 2010 6:02 am 
Beginner
Beginner

Joined: Mon Oct 27, 2008 6:26 am
Posts: 36
My expected result as following:

if someone searches for "united" it should return "united states" and "united kingdom".

if someone searches for "united states" it should return "united states", and not "united kingdom".

I hope the analyzer can generate term with multiple words. say, united states to united states. I think standardanalyzer would analyze united states to united and states?

A different example: if search keyword is parking lot in Salt Lake City, the generated terms to search need to be: parking lot and Salt Lake City, not parking,lot,salt,lake and city.


Top
 Profile  
 
 Post subject: Re: about analyzer for searching location
PostPosted: Thu Apr 15, 2010 6:30 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
ok, now I got it :)

It's much simpler than that, the QueryParser defaults to "OR" operator, so parsing a query like "United States" results in a query like "united OR states".
You want the "AND" as default behaviour, just use the options on the QueryParser instance:
Code:
QueryParser parser = ...
parser.setDefaultOperator( org.apache.lucene.queryParser.QueryParser.Operator.AND );
parser.parse( "united states " );


Just keep in mind that "OR" operator is quite cool when using relevance sorting (the default) as it returns the documents which match most terms on top, so when the query is more complex (say 5 terms) and no document matches all five, you won't get an empty result but you will get those documents which matched most terms on top, descending up to document matching just one term; this is normally useful as users do enter typos or mistakes.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: about analyzer for searching location
PostPosted: Thu Apr 15, 2010 6:50 am 
Beginner
Beginner

Joined: Mon Oct 27, 2008 6:26 am
Posts: 36
Thanks,s.grinovero

But I would expect an analyzer can perform this job. If search keyword is "parking lot in salt lake city" and generated term can have multiple words, I could have queries like:

"parking lot" OR "salt lake city"

"parking lot" AND "salt lake city"

I think less search terms would be more efficient. In my case, 2 queries vs 5 queries.

In addition, I can use salt lake city in term query directly.

String loc = "salt lake city";

locationQuery.add(new TermQuery(new Term("vendorAddress.addrName", loc)), BooleanClause.Occur.SHOULD);





s.grinovero wrote:
ok, now I got it :)

It's much simpler than that, the QueryParser defaults to "OR" operator, so parsing a query like "United States" results in a query like "united OR states".
You want the "AND" as default behaviour, just use the options on the QueryParser instance:
Code:
QueryParser parser = ...
parser.setDefaultOperator( org.apache.lucene.queryParser.QueryParser.Operator.AND );
parser.parse( "united states " );


Just keep in mind that "OR" operator is quite cool when using relevance sorting (the default) as it returns the documents which match most terms on top, so when the query is more complex (say 5 terms) and no document matches all five, you won't get an empty result but you will get those documents which matched most terms on top, descending up to document matching just one term; this is normally useful as users do enter typos or mistakes.


Top
 Profile  
 
 Post subject: Re: about analyzer for searching location
PostPosted: Fri Apr 16, 2010 3:47 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Quote:
locationQuery.add(new TermQuery(new Term("vendorAddress.addrName", loc)), BooleanClause.Occur.SHOULD);

so you are planning to avoid the queryparser and build your query programmatically, that's ok but then you need to parse the query yourself and see how to combine the terms; error prone work, but if you know what you're doing that's cool.

Quote:
If search keyword is "parking lot in salt lake city" and generated term can have multiple words

So how would you understand which terms you're going to combine with a SHOULD close or a MUST close? You need to recognize parts of speech? that's very hard, and not handled by Lucene directly. You might need to use an entity recognizer, like Apache UIMA. AFAIK there are some clever Analyzers capable of doing this, but you need to plug another framework.

I have a hint for you: avoid this complexity and be clever; you should relax the requirements a bit and see if the results are good enough when you've indexed a huge amount of text; if you have many documents containing "parking in salt lake", they will always show up before "parking in canada" even if you're using SHOULD instead of MUST if you search for "parking near salt lake" as they are more relevant, and you'll only show the top results.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.