NGramAnalyzer problem

seide · **Joined:** Thu Jan 24, 2008 2:00 pm **Posts:** 2

I'm trying to use this analyzer using annotations as suggested in Hibernate Search in action.

The problem is that the lucene query I obtain from hitory (e.g. a mispell of history) is

Code:

title:hitory authors:hitory subject:hitory isbn:hitory publisher:hitory titleNGram:"hit ito tor ory"

to work properly (and it does, as testing it in luke proved) should be:

Code:

 title:hitory authors:hitory subject:hitory isbn:hitory publisher:hitory titleNGram:hit ito tor ory

without the ".

this is the Analyzer definition:

Code:

@AnalyzerDef(
   name=Book.NGRAM_ANALYZER,
   tokenizer=@TokenizerDef(factory=StandardTokenizerFactory.class),
   filters={
       @TokenFilterDef(factory=NGramFilterFactory.class,
              params={
         @Parameter(name="minGramSize",value="3"),
         @Parameter(name="maxGramSize",value="3")           
       }),
       @TokenFilterDef(factory=StandardFilterFactory.class),
       @TokenFilterDef(factory=LowerCaseFilterFactory.class)       
//       @TokenFilterDef(factory=StopFilterFactory.class,
//                 params={
//               @Parameter(name="words",value="stopwords.txt"),
//                                   @Parameter(name="ignoreCase", value="true")
//                        }),
            
})

This is the query code.

Code:

public List<Book> fullTextSearchBooks(String searchQuery){
   FullTextSession ftSession = Search.getFullTextSession(hibernateTemplate.getSessionFactory().getCurrentSession());
   String[] productFields = {"title", "authors","subject","isbn","publisher","titleNGram"};
   
   Analyzer analyzer = ftSession.getSearchFactory().getAnalyzer(Book.class);
   
   
   QueryParser parser = new MultiFieldQueryParser(
      productFields,
      analyzer);

   org.apache.lucene.search.Query luceneQuery;
   try {
       luceneQuery = parser.parse(searchQuery);
   }
   catch (ParseException e) {
   throw new RuntimeException("Unable to parse query: " + searchQuery, e);
   }
   return (List<Book>)ftSession.createFullTextQuery(luceneQuery, Book.class).list();
    }

How should I do to obtain the correct query?

seide · **Joined:** Thu Jan 24, 2008 2:00 pm **Posts:** 2

Ok did it myself.

I found this as a solution/workaround, I really don't understand why the NGramTokenFilter works only in the index building process and not in at query time.

This is my solution:

Code:

    @Override
    @SuppressWarnings("unchecked")
    public List<Book> fullTextSearchBooks(String searchQuery) {
   String[] productFields =  { "title", "authors", "subject", "isbn", "publisher" };

   
   FullTextSession ftSession = Search.getFullTextSession(hibernateTemplate.getSessionFactory().getCurrentSession());
   
   Analyzer analyzer = ftSession.getSearchFactory().getAnalyzer(Book.class);

   //create a query with all the fields and all the tokens using the analyzer used at 
   //index time for entity Book
   
   QueryParser parser = new MultiFieldQueryParser(productFields, analyzer);
   org.apache.lucene.search.Query luceneQuery;
   try {
       luceneQuery = parser.parse(searchQuery);
   } catch (ParseException e) {
       throw new RuntimeException("Unable to parse query: " + searchQuery, e);
   }
   //create a query to use the NGram index on titleNGram field
   org.apache.lucene.search.Query luceneNGramQuery = null;
   try {
       luceneNGramQuery = buildNGramQuery(searchQuery);
   } catch (Exception e) {
       e.printStackTrace();
   }
   //Finally combine the two query as two should clauses and execute them.
   BooleanQuery finalQuery = new BooleanQuery();
   finalQuery.add(luceneQuery, Occur.SHOULD);
   finalQuery.add(luceneNGramQuery, Occur.SHOULD);
   
   return (List<Book>) ftSession.createFullTextQuery(finalQuery, Book.class).list();
    }

    private org.apache.lucene.search.Query buildNGramQuery(String search) throws Exception {
   Reader reader = new StringReader(search);
   Analyzer analyzer = new StandardAnalyzer();
   TokenStream stream = analyzer.tokenStream("titleNGram", reader);
   NGramTokenFilter ngramFilter = new NGramTokenFilter(stream, Book.getTitleMinGram(), Book.getTitleMaxGram());
   Token token = new Token();
   token = ngramFilter.next(token);
   BooleanQuery query = new BooleanQuery();
   while (token != null) {
       if (token.termLength() != 0) {
      String term = new String(token.termBuffer(), 0, token.termLength());
      // add it to the query by creating a TermQuery
      query.add(new TermQuery(new Term("titleNGram", term)), Occur.SHOULD);
       }
       token = ngramFilter.next(token);
   }
   return query;
    }

Hope this could help.

jordan002 · **Joined:** Fri Sep 11, 2009 4:58 pm **Posts:** 3

Can someone within the Hibernate Search community comment on this? This would seem to be a bug in the NGramTokenFilter class.

sanne.grinovero · **Posted:** Sat Sep 12, 2009 6:16 am

Hi, sorry for late answer (vacations).
You're right the NGramTokenFilter is not behaving as it should, I've had a similar problem and used the same workaround, but I didn't investigate more at the time.

You might want to verify you're using the correct "edition" of org.apache.lucene.analysis.ngram.NGramTokenFilter,
as there is one provided by solr-lucene-analyzers-1.3.0.jar and another by lucene-analyzers-2.4.1.jar: make sure
you're not having the solr jar around too.

Let me know, I'm also going to look into it.

sanne.grinovero · **Posted:** Sat Sep 12, 2009 8:57 am

I had a short chat with Emmanuel, it looks like the solution you used is the recommended way of doing; I'm going to propose to hide this complexity in the new API we are planning for next release: basically an API to build the queries which should be much less verbose than using Lucene API, so the method to build a QGram query could take care of this to keep it simple to use.
In the meantime, you're welcome to blog about your solution as IMHO it's the way to go..

jordan002 · **Joined:** Fri Sep 11, 2009 4:58 pm **Posts:** 3

Thank you Sanne, for looking into this. I verified that I am using lucene-analyzers-2.4.1.jar, and that solr-lucene-analyzers-1.3.0.jar is nowhere on my classpath. One issue that I had with the approach given here is that is does not lend itself well to the PerFieldAnalyzerWrapper, which I am using for all of my search terms:

Code:

        Analyzer standardAnalyzer = searchFactory.getAnalyzer(Study.class);
        Analyzer phoneticAnalyzer = searchFactory.getAnalyzer("nameanalyzer");
        Analyzer freetextAnalyzer = searchFactory.getAnalyzer("freetextanalyzer");  
        Analyzer ngramAnalyzer = searchFactory.getAnalyzer("3gramanalyzer");  
        PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(standardAnalyzer);
        wrapper.addAnalyzer("name_phonetic", phoneticAnalyzer);
        //wrapper.addAnalyzer("name_ngram", ngramAnalyzer);
        wrapper.addAnalyzer("referrer_phonetic", phoneticAnalyzer);
        wrapper.addAnalyzer("description", freetextAnalyzer);
        //wrapper.addAnalyzer("description_ngram", ngramAnalyzer);

I would like able to use the NGramTokenFilter in the same manner that I am using Phonetic and Stemming filters here. It seems to me that the NGramTokenFilter should support this usage.

I plan to have a look at the source code and perhaps propose a fix there. Naively it looks as though the filter should emit (thi his) for this instead of "thi his", but the actual fix may be more complex than that. I would appreciate any more of your thoughts.

sanne.grinovero · **Posted:** Mon Sep 28, 2009 7:19 pm

Quote:

Naively it looks as though the filter should emit (thi his) for this instead of "thi his", but the actual fix may be more complex than that.

personally I agree with you, but to have a good answer I'd recommend you to post to the Lucene mailing list; they should be able to tell you if this is a bug or not, and if it is they should accept your patch. It would be very cool if you could fix it (or report back an explanation)
thanks!