Internationalized searching

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

A head's up to people working with European style texts éèã etc.
Hibernate uses the StandardAnalyzer to analyze text. This means that the characters are not normalized. This is documented but you might still sort of be surprised by this

In order to work with normalized text you can easily write a custom analyzer such as this one. It's almost equal to the StandardAnalyzer but it does fold éè to e.
[properties]
hibernate.search.analyzer=nl.myproject.dao.hibernate.search.MyStandardAnalyzer

Code:

public class MyStandardAnalyzer extends Analyzer {

   private static final Log log = LogFactory.getLog(MyStandardAnalyzer.class);
   
   @Override
   public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = null;
         try {
             result = new StandardTokenizer(Constants.LUCENE_VERSION,reader);
             result = new StandardFilter(Constants.LUCENE_VERSION,result);
             result = new LowerCaseFilter(Constants.LUCENE_VERSION,result);
             result = new ASCIIFoldingFilter(result);
         } catch (Throwable t) {
             log.warn("Error during filtering ", t);
         }
         if (log.isDebugEnabled()) {
             log.debug("Filtered : " + reader.toString());
      }
      return result;
   }
}

Cheers,
Marc

sanne.grinovero · **Posted:** Mon Aug 22, 2011 4:45 pm

Hi Marc,
thanks for posting this.
Do you think a better notice should be placed in the docs? We appreciate patches to the documentation too: docs are never good enough: either some things are missing, or it's too long for people to read.. it's hard to us to judge for the good balance ;)

About your implementation, generally I think it would be preferable to use the declarative approach (annotations) to define a custom Analyzer; if you do the components will be reused for each invocation, rather than to create new instances at each invocation of tokenStream.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

It makes sense to mention this separately in the docs because generally I think for many/most use cases people will want ASCII folding. I'm almost tempted to say it should be standard but since it's so easy to implement, if it's pointed out, should be good enough.

Thanks for the tip on annotations (this might also deserve mention in the docs when/if the previous item is described). I will do that.

BTW, I'm surprised it's more performant to use annotations then to use the hibernate.search.analyzer property.

sanne.grinovero · **Posted:** Tue Aug 23, 2011 11:11 am

Quote:

BTW, I'm surprised it's more performant to use annotations then to use the hibernate.search.analyzer property.

No sorry for my rushed explanation, that's not what I meant to say. Reading an annotation is indeed and "expensive" operation, and a property should do better, but I'm referring to the resulting implementation: we read annotations only once, and the Analyzer is reused at runtime for each query and each time a document is indexed, so how we define an analyzer is not a performance issue, but how a defined analyzer performs is critical for performance.
The way you coded the custom analyzer is what I'm saying is not optimal, as it will create a new StandardTokenizer, a new StandardFilter, a LowerCaseFilter and an ASCIIFoldingFilter at each invocation while these components are safe to be reused. So you could write a custom analyzer functionally equivalent to the one you posted but reusing the components, but that makes the code way more complex hence my recommendation to let the framework do the dirty work.