Hi Sanne,
Thanks for the reply. I might look into it in the future. Right now I went with a really simple approach which basically ignores the languages:
Code:
@AnalyzerDef(name = "simpleAnalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = NGramFilterFactory.class, params = {
@Parameter(name = "minGramSize", value = "5"),
@Parameter(name = "maxGramSize", value = "5") }) })
Since we're using English, Finnish and Polish this should give satisfactory results. Just ignore the specific characters using ASCIIFoldingFilterFactory ;) We'll see how it works in the production environment and probably tweak it in the future. Good to hear that the sharding doesn't give a big performance penalty.