about analyzer for searching location

yiwong2001 · **Joined:** Mon Oct 27, 2008 6:26 am **Posts:** 36

Hi All,

I am implementing a search function for address. The class definition as following:

@Indexed
public class Address implements Cloneable
{
@DocumentId
private int id;
@Field
private String addrCountry;
private String addrDesc;
@Field
private String addrLineOne;
private String addrLineTwo;
@Field
private String addrCity;
......

As you see, addrCountry, addrLineone and addrCity are fields for search. I am using default analyzer in index & search. So I think country name like United States would be indexed as two terms United, and states.

In addition, during search, a search keyword like United states, or Salt lake city would be tokenized as two or three single words.

As result, any address fields contain united, city would be returned. like United Kingdom, but actually I want to get a result of united states.

To improve search quality, I wonder if any other analyzer can help me to implement my requirement. It would be better to use dictionary based solution, then I can manage some search terms that could have multiple words.

thanks

Ian

sanne.grinovero · **Posted:** Thu Apr 15, 2010 4:52 am

sorry I couldn't understand your issue, could we try with some examples.

Quote:

As result, any address fields contain united, city would be returned. like United Kingdom, but actually I want to get a result of united states

So if someone searches for "united" it should return "united states" but not "united kingdom" ?

yiwong2001 · **Joined:** Mon Oct 27, 2008 6:26 am **Posts:** 36

My expected result as following:

if someone searches for "united" it should return "united states" and "united kingdom".

if someone searches for "united states" it should return "united states", and not "united kingdom".

I hope the analyzer can generate term with multiple words. say, united states to united states. I think standardanalyzer would analyze united states to united and states?

A different example: if search keyword is parking lot in Salt Lake City, the generated terms to search need to be: parking lot and Salt Lake City, not parking,lot,salt,lake and city.

sanne.grinovero · **Posted:** Thu Apr 15, 2010 6:30 am

ok, now I got it :)

It's much simpler than that, the QueryParser defaults to "OR" operator, so parsing a query like "United States" results in a query like "united OR states".
You want the "AND" as default behaviour, just use the options on the QueryParser instance:

Code:

QueryParser parser = ...
parser.setDefaultOperator( org.apache.lucene.queryParser.QueryParser.Operator.AND );
parser.parse( "united states " );

Just keep in mind that "OR" operator is quite cool when using relevance sorting (the default) as it returns the documents which match most terms on top, so when the query is more complex (say 5 terms) and no document matches all five, you won't get an empty result but you will get those documents which matched most terms on top, descending up to document matching just one term; this is normally useful as users do enter typos or mistakes.

yiwong2001 · **Joined:** Mon Oct 27, 2008 6:26 am **Posts:** 36

Thanks,s.grinovero

But I would expect an analyzer can perform this job. If search keyword is "parking lot in salt lake city" and generated term can have multiple words, I could have queries like:

"parking lot" OR "salt lake city"

"parking lot" AND "salt lake city"

I think less search terms would be more efficient. In my case, 2 queries vs 5 queries.

In addition, I can use salt lake city in term query directly.

String loc = "salt lake city";

locationQuery.add(new TermQuery(new Term("vendorAddress.addrName", loc)), BooleanClause.Occur.SHOULD);

s.grinovero wrote:

ok, now I got it :)

It's much simpler than that, the QueryParser defaults to "OR" operator, so parsing a query like "United States" results in a query like "united OR states".
You want the "AND" as default behaviour, just use the options on the QueryParser instance:

Code:

QueryParser parser = ...
parser.setDefaultOperator( org.apache.lucene.queryParser.QueryParser.Operator.AND );
parser.parse( "united states " );

Just keep in mind that "OR" operator is quite cool when using relevance sorting (the default) as it returns the documents which match most terms on top, so when the query is more complex (say 5 terms) and no document matches all five, you won't get an empty result but you will get those documents which matched most terms on top, descending up to document matching just one term; this is normally useful as users do enter typos or mistakes.

sanne.grinovero · **Posted:** Fri Apr 16, 2010 3:47 am

Quote:

locationQuery.add(new TermQuery(new Term("vendorAddress.addrName", loc)), BooleanClause.Occur.SHOULD);

so you are planning to avoid the queryparser and build your query programmatically, that's ok but then you need to parse the query yourself and see how to combine the terms; error prone work, but if you know what you're doing that's cool.

Quote:

If search keyword is "parking lot in salt lake city" and generated term can have multiple words

So how would you understand which terms you're going to combine with a SHOULD close or a MUST close? You need to recognize parts of speech? that's very hard, and not handled by Lucene directly. You might need to use an entity recognizer, like Apache UIMA. AFAIK there are some clever Analyzers capable of doing this, but you need to plug another framework.

I have a hint for you: avoid this complexity and be clever; you should relax the requirements a bit and see if the results are good enough when you've indexed a huge amount of text; if you have many documents containing "parking in salt lake", they will always show up before "parking in canada" even if you're using SHOULD instead of MUST if you search for "parking near salt lake" as they are more relevant, and you'll only show the top results.