-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 7 posts ] 
Author Message
 Post subject: NGramAnalyzer problem
PostPosted: Mon Aug 17, 2009 12:38 pm 
Newbie

Joined: Thu Jan 24, 2008 2:00 pm
Posts: 2
I'm trying to use this analyzer using annotations as suggested in Hibernate Search in action.

The problem is that the lucene query I obtain from hitory (e.g. a mispell of history) is
Code:
title:hitory authors:hitory subject:hitory isbn:hitory publisher:hitory titleNGram:"hit ito tor ory"

to work properly (and it does, as testing it in luke proved) should be:

Code:
title:hitory authors:hitory subject:hitory isbn:hitory publisher:hitory titleNGram:hit ito tor ory

without the ".

this is the Analyzer definition:

Code:
@AnalyzerDef(
   name=Book.NGRAM_ANALYZER,
   tokenizer=@TokenizerDef(factory=StandardTokenizerFactory.class),
   filters={
       @TokenFilterDef(factory=NGramFilterFactory.class,
              params={
         @Parameter(name="minGramSize",value="3"),
         @Parameter(name="maxGramSize",value="3")           
       }),
       @TokenFilterDef(factory=StandardFilterFactory.class),
       @TokenFilterDef(factory=LowerCaseFilterFactory.class)      
//       @TokenFilterDef(factory=StopFilterFactory.class,
//                 params={
//               @Parameter(name="words",value="stopwords.txt"),
//                                   @Parameter(name="ignoreCase", value="true")
//                        }),
           
})

This is the query code.
Code:
public List<Book> fullTextSearchBooks(String searchQuery){
   FullTextSession ftSession = Search.getFullTextSession(hibernateTemplate.getSessionFactory().getCurrentSession());
   String[] productFields = {"title", "authors","subject","isbn","publisher","titleNGram"};
   
   Analyzer analyzer = ftSession.getSearchFactory().getAnalyzer(Book.class);
   
   
   QueryParser parser = new MultiFieldQueryParser(
      productFields,
      analyzer);

   org.apache.lucene.search.Query luceneQuery;
   try {
       luceneQuery = parser.parse(searchQuery);
   }
   catch (ParseException e) {
   throw new RuntimeException("Unable to parse query: " + searchQuery, e);
   }
   return (List<Book>)ftSession.createFullTextQuery(luceneQuery, Book.class).list();
    }


How should I do to obtain the correct query?


Top
 Profile  
 
 Post subject: Re: NGramAnalyzer problem
PostPosted: Mon Aug 17, 2009 10:40 pm 
Newbie

Joined: Thu Jan 24, 2008 2:00 pm
Posts: 2
Ok did it myself.

I found this as a solution/workaround, I really don't understand why the NGramTokenFilter works only in the index building process and not in at query time.

This is my solution:
Code:
    @Override
    @SuppressWarnings("unchecked")
    public List<Book> fullTextSearchBooks(String searchQuery) {
   String[] productFields =  { "title", "authors", "subject", "isbn", "publisher" };

   
   FullTextSession ftSession = Search.getFullTextSession(hibernateTemplate.getSessionFactory().getCurrentSession());
   
   Analyzer analyzer = ftSession.getSearchFactory().getAnalyzer(Book.class);

   //create a query with all the fields and all the tokens using the analyzer used at
   //index time for entity Book
   
   QueryParser parser = new MultiFieldQueryParser(productFields, analyzer);
   org.apache.lucene.search.Query luceneQuery;
   try {
       luceneQuery = parser.parse(searchQuery);
   } catch (ParseException e) {
       throw new RuntimeException("Unable to parse query: " + searchQuery, e);
   }
   //create a query to use the NGram index on titleNGram field
   org.apache.lucene.search.Query luceneNGramQuery = null;
   try {
       luceneNGramQuery = buildNGramQuery(searchQuery);
   } catch (Exception e) {
       e.printStackTrace();
   }
   //Finally combine the two query as two should clauses and execute them.
   BooleanQuery finalQuery = new BooleanQuery();
   finalQuery.add(luceneQuery, Occur.SHOULD);
   finalQuery.add(luceneNGramQuery, Occur.SHOULD);
   
   return (List<Book>) ftSession.createFullTextQuery(finalQuery, Book.class).list();
    }

    private org.apache.lucene.search.Query buildNGramQuery(String search) throws Exception {
   Reader reader = new StringReader(search);
   Analyzer analyzer = new StandardAnalyzer();
   TokenStream stream = analyzer.tokenStream("titleNGram", reader);
   NGramTokenFilter ngramFilter = new NGramTokenFilter(stream, Book.getTitleMinGram(), Book.getTitleMaxGram());
   Token token = new Token();
   token = ngramFilter.next(token);
   BooleanQuery query = new BooleanQuery();
   while (token != null) {
       if (token.termLength() != 0) {
      String term = new String(token.termBuffer(), 0, token.termLength());
      // add it to the query by creating a TermQuery
      query.add(new TermQuery(new Term("titleNGram", term)), Occur.SHOULD);
       }
       token = ngramFilter.next(token);
   }
   return query;
    }


Hope this could help.


Top
 Profile  
 
 Post subject: Re: NGramAnalyzer problem
PostPosted: Fri Sep 11, 2009 5:15 pm 
Newbie

Joined: Fri Sep 11, 2009 4:58 pm
Posts: 3
Can someone within the Hibernate Search community comment on this? This would seem to be a bug in the NGramTokenFilter class.


Top
 Profile  
 
 Post subject: Re: NGramAnalyzer problem
PostPosted: Sat Sep 12, 2009 6:16 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi, sorry for late answer (vacations).
You're right the NGramTokenFilter is not behaving as it should, I've had a similar problem and used the same workaround, but I didn't investigate more at the time.

You might want to verify you're using the correct "edition" of org.apache.lucene.analysis.ngram.NGramTokenFilter,
as there is one provided by solr-lucene-analyzers-1.3.0.jar and another by lucene-analyzers-2.4.1.jar: make sure
you're not having the solr jar around too.

Let me know, I'm also going to look into it.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: NGramAnalyzer problem
PostPosted: Sat Sep 12, 2009 8:57 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
I had a short chat with Emmanuel, it looks like the solution you used is the recommended way of doing; I'm going to propose to hide this complexity in the new API we are planning for next release: basically an API to build the queries which should be much less verbose than using Lucene API, so the method to build a QGram query could take care of this to keep it simple to use.
In the meantime, you're welcome to blog about your solution as IMHO it's the way to go..

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: NGramAnalyzer problem
PostPosted: Mon Sep 28, 2009 4:32 pm 
Newbie

Joined: Fri Sep 11, 2009 4:58 pm
Posts: 3
Thank you Sanne, for looking into this. I verified that I am using lucene-analyzers-2.4.1.jar, and that solr-lucene-analyzers-1.3.0.jar is nowhere on my classpath. One issue that I had with the approach given here is that is does not lend itself well to the PerFieldAnalyzerWrapper, which I am using for all of my search terms:

Code:
        Analyzer standardAnalyzer = searchFactory.getAnalyzer(Study.class);
        Analyzer phoneticAnalyzer = searchFactory.getAnalyzer("nameanalyzer");
        Analyzer freetextAnalyzer = searchFactory.getAnalyzer("freetextanalyzer"); 
        Analyzer ngramAnalyzer = searchFactory.getAnalyzer("3gramanalyzer"); 
        PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(standardAnalyzer);
        wrapper.addAnalyzer("name_phonetic", phoneticAnalyzer);
        //wrapper.addAnalyzer("name_ngram", ngramAnalyzer);
        wrapper.addAnalyzer("referrer_phonetic", phoneticAnalyzer);
        wrapper.addAnalyzer("description", freetextAnalyzer);
        //wrapper.addAnalyzer("description_ngram", ngramAnalyzer);


I would like able to use the NGramTokenFilter in the same manner that I am using Phonetic and Stemming filters here. It seems to me that the NGramTokenFilter should support this usage.

I plan to have a look at the source code and perhaps propose a fix there. Naively it looks as though the filter should emit (thi his) for this instead of "thi his", but the actual fix may be more complex than that. I would appreciate any more of your thoughts.


Top
 Profile  
 
 Post subject: Re: NGramAnalyzer problem
PostPosted: Mon Sep 28, 2009 7:19 pm 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Quote:
Naively it looks as though the filter should emit (thi his) for this instead of "thi his", but the actual fix may be more complex than that.

personally I agree with you, but to have a good answer I'd recommend you to post to the Lucene mailing list; they should be able to tell you if this is a bug or not, and if it is they should accept your patch. It would be very cool if you could fix it (or report back an explanation)
thanks!

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 7 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.