I've got some serial number like fields in my entities that contain characters like "/" or "-" I'd like to not get tokenized, which get tokenized by the StandardAnalyzer. For these specific fields I can use a KeywordAnalyzer, however sometimes my users reference these items in free text areas, is there a way to Enhance the StandardAnalyzer to not tokenize these items when it finds them. I've looked at the PatternTokenizerFactory, but that seems to just include or split on the items, which (unless I'm miss understanding it) would mean that It would only keep tokens that match the pattern -- I'd essentially like to keep those tokens as is, and apply standard tokenizer rules to the rest of the document.
Essentially I'd like these business id's to be retained similar to the way that StandardAnalyzer keeps email addresses and hostnames etc..
Something like:
Code:
@AnalyzerDef(name = "customAnalyzer", tokenizer = @TokenizerDef(
factory = PatternTokenizerFactory.class,
params = @Parameter(name = "pattern", value = "<myPatternMatchingAllMyBusinessIds>") )),
filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = StopFilterFactory.class) })
Is this possible, or do i have to do as the StandardTokenizer says and create my own grammer based tokenizer?
In addition where i'd like to make this the default analyzer -- do i just define it in any class and then reference it in my properties file like: hibernate.search.analyzer=customAnalyzer? It just seems a bit random (as to where the definition lives)