-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 
Author Message
 Post subject: Internationalized searching
PostPosted: Mon Aug 22, 2011 11:13 am 
Pro
Pro

Joined: Wed Nov 05, 2003 7:22 pm
Posts: 211
A head's up to people working with European style texts éèã etc.
Hibernate uses the StandardAnalyzer to analyze text. This means that the characters are not normalized. This is documented but you might still sort of be surprised by this

In order to work with normalized text you can easily write a custom analyzer such as this one. It's almost equal to the StandardAnalyzer but it does fold éè to e.
[properties]
hibernate.search.analyzer=nl.myproject.dao.hibernate.search.MyStandardAnalyzer

Code:
public class MyStandardAnalyzer extends Analyzer {

   private static final Log log = LogFactory.getLog(MyStandardAnalyzer.class);
   
   @Override
   public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = null;
         try {
             result = new StandardTokenizer(Constants.LUCENE_VERSION,reader);
             result = new StandardFilter(Constants.LUCENE_VERSION,result);
             result = new LowerCaseFilter(Constants.LUCENE_VERSION,result);
             result = new ASCIIFoldingFilter(result);
         } catch (Throwable t) {
             log.warn("Error during filtering ", t);
         }
         if (log.isDebugEnabled()) {
             log.debug("Filtered : " + reader.toString());
      }
      return result;
   }
}


Cheers,
Marc


Top
 Profile  
 
 Post subject: Re: Internationalized searching
PostPosted: Mon Aug 22, 2011 4:45 pm 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi Marc,
thanks for posting this.
Do you think a better notice should be placed in the docs? We appreciate patches to the documentation too: docs are never good enough: either some things are missing, or it's too long for people to read.. it's hard to us to judge for the good balance ;)

About your implementation, generally I think it would be preferable to use the declarative approach (annotations) to define a custom Analyzer; if you do the components will be reused for each invocation, rather than to create new instances at each invocation of tokenStream.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: Internationalized searching
PostPosted: Tue Aug 23, 2011 6:13 am 
Pro
Pro

Joined: Wed Nov 05, 2003 7:22 pm
Posts: 211
It makes sense to mention this separately in the docs because generally I think for many/most use cases people will want ASCII folding. I'm almost tempted to say it should be standard but since it's so easy to implement, if it's pointed out, should be good enough.

Thanks for the tip on annotations (this might also deserve mention in the docs when/if the previous item is described). I will do that.

BTW, I'm surprised it's more performant to use annotations then to use the hibernate.search.analyzer property.


Top
 Profile  
 
 Post subject: Re: Internationalized searching
PostPosted: Tue Aug 23, 2011 11:11 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Quote:
BTW, I'm surprised it's more performant to use annotations then to use the hibernate.search.analyzer property.


No sorry for my rushed explanation, that's not what I meant to say. Reading an annotation is indeed and "expensive" operation, and a property should do better, but I'm referring to the resulting implementation: we read annotations only once, and the Analyzer is reused at runtime for each query and each time a document is indexed, so how we define an analyzer is not a performance issue, but how a defined analyzer performs is critical for performance.
The way you coded the custom analyzer is what I'm saying is not optimal, as it will create a new StandardTokenizer, a new StandardFilter, a LowerCaseFilter and an ASCIIFoldingFilter at each invocation while these components are safe to be reused. So you could write a custom analyzer functionally equivalent to the one you posted but reusing the components, but that makes the code way more complex hence my recommendation to let the framework do the dirty work.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.