Hibernate Search with fields in arbitrary languages

pawel.predki · **Posted:** Mon Feb 16, 2015 7:52 am

Hey everyone. I'm just starting my adventure with Hibernate Search and so far I see a lot of potential. Thanks to Sanne Grinovero's help, I managed to add Hibernate Search 5.0.1 to my Wildfly installation and I did some simple tests with it.

The problem I'm facing now is that our application uses a 'description' field in one of the entities, which may be in one of several languages. While it would be possible for my application to know which language is being used when the description is being posted (dynamic sharding, I assume?), I would like to be able to search over the 'description' field and match the entries in all the languages.

A good example would be what YouTube does, although I obviously don't know how their backend is organized. Anyway, you can search for videos which can be in many languages and it somehow works :)

Two solutions I though about are:

1. Use dynamic sharding and, when searching, search in all the languages separately and compound the result sets
2. Don't use advanced analyzers, just tokenize and index the fields as easily as possible and use one search request to search for everything.

The problem with solution 1 is that it might take a lot of time if there are many languages, so it's not scalable at all.
In case of solution 2, I wouldn't be able to use stemming and match synonyms, etc.

Is there a middle ground, which would make it fast and would allow me to use more advanced techniques? We can assume that the list of supported languages is known.

Thanks for any suggestions,
Cheers,
Pawel

sanne.grinovero · **Posted:** Thu Feb 19, 2015 8:22 am

Hi Pawel,
I once developed an application using Search which had similar requirements, and back then we created org.hibernate.search.annotations.AnalyzerDiscriminator to achieve it.

In my case I would have a "lang" attribute in my entity, and my custom discriminator would switch on that property to return the name of the Analyzer I'd want to use to index that entity.

At query time things get a bit more complex of course, you'd need to keep such things in mind. For example, if parsing a user string you might need to know which language that is, and you probably want a dedicated index (via sharding) for each language, or a different field.

Among your proposals, 1# is actually not bad. Let Search do the query on all shards, you'll be surprised the query performance doesn't get much slower.
But generally speaking such an approach would "mix" keywords which are actually unrelated, so the quality of the query wouldn't be as good as when you can fully separate them.

pawel.predki · **Posted:** Thu Feb 19, 2015 8:29 am

Hi Sanne,

Thanks for the reply. I might look into it in the future. Right now I went with a really simple approach which basically ignores the languages:

Code:

@AnalyzerDef(name = "simpleAnalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
      @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
      @TokenFilterDef(factory = LowerCaseFilterFactory.class),
      @TokenFilterDef(factory = NGramFilterFactory.class, params = {
            @Parameter(name = "minGramSize", value = "5"),
            @Parameter(name = "maxGramSize", value = "5") }) })

Since we're using English, Finnish and Polish this should give satisfactory results. Just ignore the specific characters using ASCIIFoldingFilterFactory ;) We'll see how it works in the production environment and probably tweak it in the future. Good to hear that the sharding doesn't give a big performance penalty.

sanne.grinovero · **Posted:** Thu Feb 19, 2015 8:36 am

Right, that's a valid approach for most use cases.
The dynamic analyzers stuff is only useful if you get into very specific linguistic processing.