Search: Exclusions of Pattern from tokenization

ialpert · **Joined:** Fri Dec 17, 2010 11:14 am **Posts:** 11

I've got some serial number like fields in my entities that contain characters like "/" or "-" I'd like to not get tokenized, which get tokenized by the StandardAnalyzer. For these specific fields I can use a KeywordAnalyzer, however sometimes my users reference these items in free text areas, is there a way to Enhance the StandardAnalyzer to not tokenize these items when it finds them. I've looked at the PatternTokenizerFactory, but that seems to just include or split on the items, which (unless I'm miss understanding it) would mean that It would only keep tokens that match the pattern -- I'd essentially like to keep those tokens as is, and apply standard tokenizer rules to the rest of the document.

Essentially I'd like these business id's to be retained similar to the way that StandardAnalyzer keeps email addresses and hostnames etc..

Something like:

Code:

   @AnalyzerDef(name = "customAnalyzer", tokenizer = @TokenizerDef(
         factory = PatternTokenizerFactory.class, 
         params = @Parameter(name = "pattern", value = "<myPatternMatchingAllMyBusinessIds>")  )), 
         filters = {
                    @TokenFilterDef(factory = StandardFilterFactory.class),
                    @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                  @TokenFilterDef(factory = StopFilterFactory.class) })

Is this possible, or do i have to do as the StandardTokenizer says and create my own grammer based tokenizer?

In addition where i'd like to make this the default analyzer -- do i just define it in any class and then reference it in my properties file like: hibernate.search.analyzer=customAnalyzer? It just seems a bit random (as to where the definition lives)

ialpert · **Joined:** Fri Dec 17, 2010 11:14 am **Posts:** 11

I've wound up modifying the ClassicTokenizer.jflex from lucene to do this (the StandardTokenizer.jflex seemed to not do email addresses/hosts anymore (I had googled around a bit to see what was going on there but couldn't find anything mentioning what was going on).

Is there any recommended place to put Analyzer definitions? it seems odd to throw my global default analyzer on any of my specific entities --- it does seem to get picked up regardless.

edit: if anyone is interested in the .jflex solution let me know I'll post some snippets)

sanne.grinovero · **Posted:** Thu Jun 09, 2011 7:51 pm

Quote:

In addition where i'd like to make this the default analyzer -- do i just define it in any class and then reference it in my properties file like: hibernate.search.analyzer=customAnalyzer? It just seems a bit random (as to where the definition lives)

I agree it's odd, but that's the current state: most people use as a global analyzer some existing analyzer, until now I saw the @AnalyzerDef being used only for special purpose analyzers used in a single class.

There's an open issue to be able to annotate packages instead of classes: http://opensource.atlassian.com/projects/hibernate/browse/HSEARCH-633 you could contribute that, or alternatively you could define all your custom objects via a factory class to create your configuration by code instead of by annotations: look for "hibernate.search.model_mapping" in the reference documentation.