-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 5 posts ] 
Author Message
 Post subject: Search matching terms with EdgeNGramFilterFactory
PostPosted: Thu Mar 12, 2015 8:58 am 
Beginner
Beginner

Joined: Mon Feb 16, 2015 6:41 am
Posts: 32
Location: Lodz, Poland
I know this probably comes from my lack of knowledge how to properly configure analyzers and filters so I would really appreciate your help.

My goal is to create a search that produces results that make sense starting from one-letter queries, such as when you search for applications in the launcher on Ubuntu. The search should be performed over several fields, e.g. title, description, keywords.

My idea was to tokenize the fields using the StandardTokenizerFactory to separate each word in the field and them apply ASCIIFoldingFilterFactory and LowerCaseFilterFactory to make my life easier. That's when the problems start. I decided to use the EdgeNGramFilterFactory to match parts of the fields, but setting the minimum value to one seems to have little sense, since every letter is then tokenized.

I wanted to give higher priority to results which start with the search term given by the user but also include those, which are somewhere inside the field. What is more, if the search term provided by the user no longer contains the field, the result should be ignored. This is a bit unclear so let me follow up with an example:

Item 1: Title: "Steve Yzerman scores an amazing goal"
Item 2: Title: "Another great season by Steve Yzerman"
Item 3: Title: "Saku Koivu to miss three games due to injury"

Example 1: User input "s"
Result order: 1, 3, 2

Example 2: User input: "st"
Result order: 1, 2

Example 3: User input: "steven" (same as "steveodhgsajghsdkfg")
Result order: null

I think I should look at KeywordTokenizerFactory to be able to apply the filters to the entire field, instead of chop it up into smaller pieces. However, I don't understand how the query then works to indicate that the start of the field should have precedence over the words inside the field.

Also, how do I make sure that if the user inputs too many characters (example 3), the results are not selected? In the example, the 'steven' input would be tokenized into 's', 'st', 'ste', etc. so basically all the items would be returned.


Top
 Profile  
 
 Post subject: Re: Search matching terms with EdgeNGramFilterFactory
PostPosted: Thu Mar 12, 2015 9:12 am 
Beginner
Beginner

Joined: Mon Feb 16, 2015 6:41 am
Posts: 32
Location: Lodz, Poland
A thing that I believe is connected to my initial question:

If the user types in 'n' characters, is it possible to limit the results to those that match at least the characters provided and not less?

In short: match ALL OF the n characters from the beginning of the field, followed by possible more characters.

In the examples above, if the user enters 'scuba', all fields will match because 'scuba' is tokenized into 's', 'sc', 'scu', etc. and so are the words 'Steve', 'Saku', 'season', etc. Am I right to think that n-grams are not the way to go here?


Top
 Profile  
 
 Post subject: Re: Search matching terms with EdgeNGramFilterFactory
PostPosted: Thu Mar 12, 2015 11:01 am 
Beginner
Beginner

Joined: Mon Feb 16, 2015 6:41 am
Posts: 32
Location: Lodz, Poland
Some progress :) I took advantage of the ignoreAnalyzer() method when building the query:

Code:
org.apache.lucene.search.Query keywordQuery = (term == null || term
            .isEmpty()) ? null : qb
            .keyword()
            .onFields("description", "title", "momentHashtags.hashtag",
                  "momentKeywords.keyword").ignoreAnalyzer()
            .matching(term).createQuery();


The problem now is that I am unable to search for the entire name, e.g. "Steve Yzerman". This returns an empty result set. I tried switching over to a phrase query but I'm unable to ignore the analyzer then and when I type in "Steve Yzerman" I can see in the result Explanation object that the first letter "s" was what caused a match.


Top
 Profile  
 
 Post subject: Re: Search matching terms with EdgeNGramFilterFactory
PostPosted: Fri Mar 13, 2015 11:55 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi Pawel,
often for similar effects you wouldn't use a single analyzer choice, or a single query.. you can index the same field multiple times using the @Fields (plural) annotation.

For example you could index an entity which has "description" and "title" properties in fields "description_full" and "description_ngrams_3", "title_full" and "title_ngrams_3", using keyword style encoding (non analyzed) for the fields using _full and using Ngram(n=3) for the others.
When you build a query, you then target both fields with the boolean operator "Should", which makes sure you maximize results which score higher on the combination of fields.

This way you can run a prefix query "n*" on the _full fields, but also benefit from n-gram scoring.

I often need to index in many different ways, for example to get the sorting right you'll probably use the Sort only on the "description_full" field, not on the one you apply ngrams to.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: Search matching terms with EdgeNGramFilterFactory
PostPosted: Fri Mar 13, 2015 12:09 pm 
Beginner
Beginner

Joined: Mon Feb 16, 2015 6:41 am
Posts: 32
Location: Lodz, Poland
Hi Sanne,

I understand the idea of using several indices but I'm not sure I completely understand the idea of how the keyword is used in the query.

If I use the keyword analyzer (no splitting) on my field and in the search term I only submit the first two letters (matching the field), will the result be returned? I probably need to play around with the many possibilities but the fact that there are so many filters, analyzers and tokenizers make it very confusing for beginners :)


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 5 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.