-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 
Author Message
 Post subject: Hibernate Search with fields in arbitrary languages
PostPosted: Mon Feb 16, 2015 7:52 am 
Beginner
Beginner

Joined: Mon Feb 16, 2015 6:41 am
Posts: 32
Location: Lodz, Poland
Hey everyone. I'm just starting my adventure with Hibernate Search and so far I see a lot of potential. Thanks to Sanne Grinovero's help, I managed to add Hibernate Search 5.0.1 to my Wildfly installation and I did some simple tests with it.

The problem I'm facing now is that our application uses a 'description' field in one of the entities, which may be in one of several languages. While it would be possible for my application to know which language is being used when the description is being posted (dynamic sharding, I assume?), I would like to be able to search over the 'description' field and match the entries in all the languages.

A good example would be what YouTube does, although I obviously don't know how their backend is organized. Anyway, you can search for videos which can be in many languages and it somehow works :)

Two solutions I though about are:

1. Use dynamic sharding and, when searching, search in all the languages separately and compound the result sets
2. Don't use advanced analyzers, just tokenize and index the fields as easily as possible and use one search request to search for everything.

The problem with solution 1 is that it might take a lot of time if there are many languages, so it's not scalable at all.
In case of solution 2, I wouldn't be able to use stemming and match synonyms, etc.

Is there a middle ground, which would make it fast and would allow me to use more advanced techniques? We can assume that the list of supported languages is known.

Thanks for any suggestions,
Cheers,
Pawel


Top
 Profile  
 
 Post subject: Re: Hibernate Search with fields in arbitrary languages
PostPosted: Thu Feb 19, 2015 8:22 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi Pawel,
I once developed an application using Search which had similar requirements, and back then we created org.hibernate.search.annotations.AnalyzerDiscriminator to achieve it.

In my case I would have a "lang" attribute in my entity, and my custom discriminator would switch on that property to return the name of the Analyzer I'd want to use to index that entity.

At query time things get a bit more complex of course, you'd need to keep such things in mind. For example, if parsing a user string you might need to know which language that is, and you probably want a dedicated index (via sharding) for each language, or a different field.

Among your proposals, 1# is actually not bad. Let Search do the query on all shards, you'll be surprised the query performance doesn't get much slower.
But generally speaking such an approach would "mix" keywords which are actually unrelated, so the quality of the query wouldn't be as good as when you can fully separate them.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: Hibernate Search with fields in arbitrary languages
PostPosted: Thu Feb 19, 2015 8:29 am 
Beginner
Beginner

Joined: Mon Feb 16, 2015 6:41 am
Posts: 32
Location: Lodz, Poland
Hi Sanne,

Thanks for the reply. I might look into it in the future. Right now I went with a really simple approach which basically ignores the languages:

Code:
@AnalyzerDef(name = "simpleAnalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
      @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
      @TokenFilterDef(factory = LowerCaseFilterFactory.class),
      @TokenFilterDef(factory = NGramFilterFactory.class, params = {
            @Parameter(name = "minGramSize", value = "5"),
            @Parameter(name = "maxGramSize", value = "5") }) })


Since we're using English, Finnish and Polish this should give satisfactory results. Just ignore the specific characters using ASCIIFoldingFilterFactory ;) We'll see how it works in the production environment and probably tweak it in the future. Good to hear that the sharding doesn't give a big performance penalty.


Top
 Profile  
 
 Post subject: Re: Hibernate Search with fields in arbitrary languages
PostPosted: Thu Feb 19, 2015 8:36 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Right, that's a valid approach for most use cases.
The dynamic analyzers stuff is only useful if you get into very specific linguistic processing.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 4 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.