-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 3 posts ] 
Author Message
 Post subject: Search: Exclusions of Pattern from tokenization
PostPosted: Wed Jun 08, 2011 4:12 pm 
Newbie

Joined: Fri Dec 17, 2010 11:14 am
Posts: 11
I've got some serial number like fields in my entities that contain characters like "/" or "-" I'd like to not get tokenized, which get tokenized by the StandardAnalyzer. For these specific fields I can use a KeywordAnalyzer, however sometimes my users reference these items in free text areas, is there a way to Enhance the StandardAnalyzer to not tokenize these items when it finds them. I've looked at the PatternTokenizerFactory, but that seems to just include or split on the items, which (unless I'm miss understanding it) would mean that It would only keep tokens that match the pattern -- I'd essentially like to keep those tokens as is, and apply standard tokenizer rules to the rest of the document.

Essentially I'd like these business id's to be retained similar to the way that StandardAnalyzer keeps email addresses and hostnames etc..

Something like:
Code:
   @AnalyzerDef(name = "customAnalyzer", tokenizer = @TokenizerDef(
         factory = PatternTokenizerFactory.class,
         params = @Parameter(name = "pattern", value = "<myPatternMatchingAllMyBusinessIds>")  )),
         filters = {
                    @TokenFilterDef(factory = StandardFilterFactory.class),
                    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                  @TokenFilterDef(factory = StopFilterFactory.class) })


Is this possible, or do i have to do as the StandardTokenizer says and create my own grammer based tokenizer?


In addition where i'd like to make this the default analyzer -- do i just define it in any class and then reference it in my properties file like: hibernate.search.analyzer=customAnalyzer? It just seems a bit random (as to where the definition lives)


Top
 Profile  
 
 Post subject: Re: Search: Exclusions of Pattern from tokenization
PostPosted: Thu Jun 09, 2011 2:14 pm 
Newbie

Joined: Fri Dec 17, 2010 11:14 am
Posts: 11
I've wound up modifying the ClassicTokenizer.jflex from lucene to do this (the StandardTokenizer.jflex seemed to not do email addresses/hosts anymore (I had googled around a bit to see what was going on there but couldn't find anything mentioning what was going on).

Is there any recommended place to put Analyzer definitions? it seems odd to throw my global default analyzer on any of my specific entities --- it does seem to get picked up regardless.

edit: if anyone is interested in the .jflex solution let me know I'll post some snippets)


Top
 Profile  
 
 Post subject: Re: Search: Exclusions of Pattern from tokenization
PostPosted: Thu Jun 09, 2011 7:51 pm 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Quote:
In addition where i'd like to make this the default analyzer -- do i just define it in any class and then reference it in my properties file like: hibernate.search.analyzer=customAnalyzer? It just seems a bit random (as to where the definition lives)

I agree it's odd, but that's the current state: most people use as a global analyzer some existing analyzer, until now I saw the @AnalyzerDef being used only for special purpose analyzers used in a single class.

There's an open issue to be able to annotate packages instead of classes: http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-633 you could contribute that, or alternatively you could define all your custom objects via a factory class to create your configuration by code instead of by annotations: look for "hibernate.search.model_mapping" in the reference documentation.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 3 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.