search for a field ( tokenized) with multiple words

swapna_here · **Joined:** Tue Aug 04, 2009 6:46 am **Posts:** 17

Hi friends
I have a field in database which is seperated with commas
e.g. (US Cities,chicago,United States,florida....)
I have indexed this field(Using TOKENIZED) with my own analyzer WordSpliiterTokenizer which splits this field into multiple words and indexes

My luke tool shows the follwing in the field

>>US Cities
>> United States
>> chicago
>>florida

while searching iam unable to get results for the text containing "blah blah United States blah blah "
even Iam not getting results for "blah blah States blah blah "
But iam able to get results for "blah blah chicago blah " and for "blah blah florida blah "
So in order to solve this problem what kind of analyzers i need to use

thanks in advance

hibernatingworkert · **Joined:** Wed Jul 15, 2009 12:34 pm **Posts:** 18

As I said in your previous thread, you have to match the analyzers both for indexing and searching. Choosing which analyzer to use for searching depends on how your WordSpliiterTokenizer was implemented.

Do you really have to tokenize your database field this way, using commas? Why not use SimpleAnalyzer, for example, to analyze your fields? This way you would be able to query for "blah blah United States blah blah ".

swapna_here · **Joined:** Tue Aug 04, 2009 6:46 am **Posts:** 17

hibernatingworkert wrote:

As I said in your previous thread, you have to match the analyzers both for indexing and searching. Choosing which analyzer to use for searching depends on how your WordSpliiterTokenizer was implemented.

Do you really have to tokenize your database field this way, using commas? Why not use SimpleAnalyzer, for example, to analyze your fields? This way you would be able to query for "blah blah United States blah blah ".

thanks for ur reply
yes i have to use commas in database field..
If i use Simple Analyzer the lucene will index United and States as separate which i don't want
I will get results for both "blah blah United States blah blah " and also for "blah blah States blah blah "

I have to get results for only "blah blah United States blah blah "

My Analyzer class is

public class WordSplitterAnalyzer extends Analyzer {

public TokenStream tokenStream(String fieldName, Reader reader) {
return new WordSplitterTokenizer(reader);
}

@Override
public TokenStream reusableTokenStream(String fieldName, Reader reader)
throws IOException {
Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
if (tokenizer == null) {
tokenizer = new WordSplitterTokenizer(reader);
setPreviousTokenStream(tokenizer);
} else {
tokenizer.reset(reader);
}

return tokenizer;
}
}

and the tokenizer class is
public class WordSplitterTokenizer extends CharTokenizer {
protected static final char[] DEFAULT_WORD_SPLITTER = new char[] {','};
private char[] wordSplitter = DEFAULT_WORD_SPLITTER;
public WordSplitterTokenizer(Reader in) {
super(in);
}
protected boolean isTokenChar(char c) {
for(char ws : wordSplitter) {
return ws != c;
}
return true;
}
}

in order to get results for text containing only United States what anayzers i need to use...

hibernatingworkert · **Joined:** Wed Jul 15, 2009 12:34 pm **Posts:** 18

If you want to have search results for "blah blah United States blah blah ", you'll have to create an Analyzer that parses this text into "United States", otherwise you won't get results. I don't think there's any default Analyzer that would do that without changing the query. If you used the KeywordAnalyzer with the query: "United States" (quotes included, meaning your doing a phrase search), you would probably get the results.

swapna_here · **Joined:** Tue Aug 04, 2009 6:46 am **Posts:** 17

hibernatingworkert wrote:

If you want to have search results for "blah blah United States blah blah ", you'll have to create an Analyzer that parses this text into "United States", otherwise you won't get results. I don't think there's any default Analyzer that would do that without changing the query. If you used the KeywordAnalyzer with the query: "United States" (quotes included, meaning your doing a phrase search), you would probably get the results.

Thanks for your immediate response

Actualy if i use Simple analyzer database field New York, chicago will be tokenized into New , York & chicago
if my text is going to contain New then the New York, chicago row will be returned that i don't want
(e.g. New book is released)
thats why i have decided to use WordSplitterAnalyzer.
But while searching Analyzer is unable to search two tokens New York in search text

My search text length is around 100 to 300 every time .
So i cant use Keyword Analyzer with phrase query.

i have decided to use Simple analyzer, i don't want to make my code further complicate.
I am going to filter results based on scoring factor but iam unable to fetch scoring factor in
org.hibernate.search.FullTextQuery results

how to get the scoring factor for a hit in hibernate lucene

once again thanks for your speed posts

hibernatingworkert · **Joined:** Wed Jul 15, 2009 12:34 pm **Posts:** 18

There you go, directly from the docs:

Quote:

Projection is useful for another kind of use cases. Lucene provides some metadata information to the user about the results. By using some special placeholders, the projection mechanism can retrieve them:

Example 5.10. Using projection in order to retrieve meta data

Code:

org.hibernate.search.FullTextQuery query = s.createFullTextQuery( luceneQuery, Book.class );
query.setProjection( FullTextQuery.SCORE, FullTextQuery.THIS, "mainAuthor.name" );
List results = query.list();
Object[] firstResult = (Object[]) results.get(0);
float score = firstResult[0];
Book book = firstResult[1];
String authorName = firstResult[2];

You can mix and match regular fields and special placeholders. Here is the list of available placeholders:

*

FullTextQuery.THIS: returns the initialized and managed entity (as a non projected query would have done).
*

FullTextQuery.DOCUMENT: returns the Lucene Document related to the object projected.
*

FullTextQuery.OBJECT_CLASS: returns the class of the indexed entity.
*

FullTextQuery.SCORE: returns the document score in the query. Scores are handy to compare one result against an other for a given query but are useless when comparing the result of different queries.
*

FullTextQuery.ID: the id property value of the projected object.
*

FullTextQuery.DOCUMENT_ID: the Lucene document id. Careful, Lucene document id can change overtime between two different IndexReader opening (this feature is experimental).
*

FullTextQuery.EXPLANATION: returns the Lucene Explanation object for the matching object/document in the given query. Do not use if you retrieve a lot of data. Running explanation typically is as costly as running the whole Lucene query per matching element. Make sure you use projection!

Now let me see if I understood your use case.. you have database with a field that stores city names' list separated by commas and you have a bunch of text chunks and you're using them as queries to see if they match with any of your cities stored in the database? Is that it?

kinkon23 · **Joined:** Mon Aug 10, 2009 5:03 pm **Posts:** 3

If you use the Keyword Analyzer with the query: "USA" (quotes included, meaning your doing a phrase search), you would probably get the results.
nature cleanse

swapna_here · **Joined:** Tue Aug 04, 2009 6:46 am **Posts:** 17

thank a lot hibernatingworkert

is there any way to boost a tern while indexing
for e.g.
if i am indexing a field containing new york
i want to give boost 0.5 for the term "new" and 0.5 for "york"

sanne.grinovero · **Posted:** Thu Aug 13, 2009 10:20 am

swapna_here wrote:

is there any way to boost a tern while indexing
for e.g.
if i am indexing a field containing new york
i want to give boost 0.5 for the term "new" and 0.5 for "york"

So you need to define a boost per term? that's generally very hard to manage, are you sure that's what you want? what's your use case?
If you look at the Similarity formulas and Idf you'll see that most of the time you shouldn't bother with this details, as Lucene will automatically boost keywords as the inverse of their frequency: so if you have a big data set the "new" word is likely to be more often used and it will give a lesser contribution than the "York" term when calculating relevance.

In Search 3.2 (trunk, not yet released) there's support for dynamic boosting: you can define a different boost for each entity instance; to better document that it would be nice to know how you plan to use that. (I'm using it already, but that's a secret for now ;-)

swapna_here · **Joined:** Tue Aug 04, 2009 6:46 am **Posts:** 17

s.grinovero wrote:

So you need to define a boost per term? that's generally very hard to manage, are you sure that's what you want? what's your use case?
If you look at the Similarity formulas and Idf you'll see that most of the time you shouldn't bother with this details, as Lucene will automatically boost keywords as the inverse of their frequency: so if you have a big data set the "new" word is likely to be more often used and it will give a lesser contribution than the "York" term when calculating relevance.

In Search 3.2 (trunk, not yet released) there's support for dynamic boosting: you can define a different boost for each entity instance; to better document that it would be nice to know how you plan to use that. (I'm using it already, but that's a secret for now ;-)

thanks a lot for your response s.grinovero

yeah it is very difficult to boost term while indexing but i come to this scenario after long research on analyzer .
i hope u can solve my issue

is there any analyzer that matches the text with the indexed field with 1(min) to 3(max) terms
for e.g.
My search text contains "abc def ghi jkl mno abc def ghi jkl mno abc def New Yorkghi jkl mno abc def ghi jkl mno abc def ghi jkl mno abc def ghi jkl mno "
Here i cant use Standard, Simple, (because these 2 breaks search text into single terms) Keyword (here i cant use phrase query since the text is large) Analyzers
and i have a indexed field with 2 terms in it New York
in order to get this field what analyzer (that splits the search text into multiple words and searches the index) i need to use
Read the my above posts for more details (still i didn't solve this issue)

thanks in advance..

sanne.grinovero · **Posted:** Fri Aug 14, 2009 1:52 pm

so you want to extract "York" form "Yorkghi", and remove the rest of garbage?
that's quite hard if you don't have a list of valid terms, or you could try solving it by improving your chances of getting the right answer;
So not totally deterministic (don't remove the other terms) but just make sure they will probably be less relevant than the words you want.
You could get some effective matches using NGram filters, did you try that?

Also remember most Analyzers will lowercase, so unless you also analyze the query you want to search for "new york" not "New York".

swapna_here · **Joined:** Tue Aug 04, 2009 6:46 am **Posts:** 17

s.grinovero wrote:

so you want to extract "York" form "Yorkghi", and remove the rest of garbage?
that's quite hard if you don't have a list of valid terms, or you could try solving it by improving your chances of getting the right answer;
So not totally deterministic (don't remove the other terms) but just make sure they will probably be less relevant than the words you want.
You could get some effective matches using NGram filters, did you try that?

Also remember most Analyzers will lowercase, so unless you also analyze the query you want to search for "new york" not "New York".

sorry i have not observed it
actual string is
"abc def ghi jkl mno abc def ghi jkl mno abc def New York ghi jkl mno abc def ghi jkl mno abc def ghi jkl mno abc def ghi jkl mno"
from this i have to match "New York" with the indexed "new york" (2 terms)