-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 3 posts ] 
Author Message
 Post subject: Analyzer problem wiht "_"
PostPosted: Wed Mar 21, 2012 4:55 am 
Newbie

Joined: Thu Nov 10, 2011 9:41 am
Posts: 2
Hi at all,

I ve got a little problem indexing mutliple fields containing names of projects, companies and of other organization units.
Those names can contain all chararcters, also underscores.
The fields are indexed using annotations. Hibernate search 3.4.1 and Lucene 3.1.0 is used.

So what I want is, that at every _ the fields are split into terms. I recognized, that if StandardAnalyzer VERSION.LUCENE_31 is
programatically used nothing is splitted. Therefore VERSION.LUCENE_30 analyzer splits in the same behaviour as the annoted values.
The annotated values are split if they are not containing any digits.

Input; expected value
asdf_asdf; asdf, asdf
asdf_333; asdf, 333

Is there any possibilty to get that stuff to work that I can get my expected result.
Would be great if anybody has a solution.

Thanks
Pat


Top
 Profile  
 
 Post subject: Re: Analyzer problem wiht "_"
PostPosted: Wed Mar 21, 2012 5:39 am 
Hibernate Team
Hibernate Team

Joined: Thu Apr 05, 2007 5:52 am
Posts: 1689
Location: Sweden
Hi,

as of Lucene 3.1 the StandardAnalyzer uses a new version of StandardTokenizer which implements Unicode Standard Annex #29. The old version of the StandardTokenizer is now called ClassicTokenizer.

The now called ClassicTokenizer always treated tokens with numbers differently. In the documentation it says, eg: "Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. " I would assume that's the behavior you are seeing.

As a solution you can always create your own tokenizer, eg by starting with the code for ClassicTokenizer. Have a look at this thread as well - http://lucene.472066.n3.nabble.com/Inco ... 34767.html

--Hardy


Top
 Profile  
 
 Post subject: Re: Analyzer problem wiht "_"
PostPosted: Wed Mar 21, 2012 12:17 pm 
Newbie

Joined: Thu Nov 10, 2011 9:41 am
Posts: 2
Hi,

thanks a lot.
But meanwhile I found another solution using WordDelimiterFilterFactory.

Pat


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 3 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.