Hibernate search multiple languages

sore · **Joined:** Sun Nov 15, 2009 8:22 pm **Posts:** 19

Hello!
I have tried to set up a multilanguage application for using Hibernate Search. I have some trouble and would be very thankfull if you could have a look at it and give me some help!

The Arrangement class

Code:

@Indexed
public class Arrangement  {

   private Set<Text> summarytexts = new HashSet<Text>();
   private Set<Tag> tags = new HashSet<Tag>();
   
   public Arrangement() {}


   @ManyToMany(targetEntity=Tag.class, cascade={CascadeType.PERSIST, CascadeType.MERGE}, fetch=FetchType.LAZY)
   @JoinTable(name="ARRANGEMENT_TAG", joinColumns={@JoinColumn(name="arrangement_id")}, inverseJoinColumns={@JoinColumn(name="arrangementtag_id")})
        @IndexedEmbedded
        @Boost(2.5f)
   public Set<Tag> getTags() {
      return tags;
   }

   public void setTags(Set<Tag> arrangementtags) {
      this.tags = arrangementtags;
   }
   

   @OneToMany(cascade=CascadeType.ALL,  fetch=FetchType.EAGER)
   @OrderBy("language ASC")
        @JoinTable(name="ARRANGEMENT_SUMMARY", joinColumns={@JoinColumn(name="arrangement_id")}, inverseJoinColumns={@JoinColumn(name="text_id")})
        @Boost(1.3f)
        @Field(
      name="summary", 
      index=Index.TOKENIZED, 
      store=Store.YES, 
      bridge = @FieldBridge(impl=I18FieldBridge.class,
           params = @Parameter(name="prefix", value="summary")))
   public Set<Text> getSummarytexts() {
      return summarytexts;
   }
   
   public void setSummarytexts(Set<Text> summarytexts) {
      this.summarytexts = summarytexts;
   }
}

A fieldbridge

Code:

public class I18FieldBridge implements FieldBridge, ParameterizedBridge {


   public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
      
      Set<Text> texts = (Set<Text>) value;
      
      for (Text text : texts) {
      
         if (text == null) {
            return;
         }
         
          Field field = new Field(
                prefix + "_" + text.getLanguage(), 
                text.getWord(),
                  luceneOptions.getStore(), 
                  luceneOptions.getIndex(),
                  luceneOptions.getTermVector());
          Float boost = luceneOptions.getBoost();
          field.setBoost(boost);
          document.add(field);
      }
   }

   
   private String prefix;

    public void setParameterValues(Map parameters) {
        this.prefix = (String) parameters.get("prefix");
    }

}

The Text class.

Code:

@Table(name="TEXT")
@AnalyzerDefs({
   @AnalyzerDef(name = "SWE", 
      tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
      @TokenFilterDef(factory = LowerCaseFilterFactory.class),
      @TokenFilterDef(factory = StopFilterFactory.class, params = {
         @Parameter(name = "words", value = "stopwords_swe.properties"),
         @Parameter(name = "ignoreCase", value = "true") }),
      @TokenFilterDef(factory = SnowballPorterFilterFactory.class, params = { 
            @Parameter(name = "language", value = "Swedish") })
   }), 
   @AnalyzerDef(name = "ENG", 
         tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
         @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class),
         @TokenFilterDef(factory = LowerCaseFilterFactory.class),
         @TokenFilterDef(factory = StopFilterFactory.class, params = {
            @Parameter(name = "words", value = "stopwords_eng.properties"),
            @Parameter(name = "ignoreCase", value = "true") })
         @TokenFilterDef(factory = SnowballPorterFilterFactory.class, params = { 
            @Parameter(name = "language", value = "English") })
      }), 
   @AnalyzerDef(name = "onsearchAnalyzerSWE", 
            tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = {
            @TokenFilterDef(factory = LowerCaseFilterFactory.class),
            @TokenFilterDef(factory = SynonymFactory.class, params = {
               @Parameter(name = "ignoreCase", value = "true"),
                    @Parameter(name = "expand", value = "true"),
                    @Parameter(name = "synonyms", value = "synonyms_swe.properties")}),
            
            @TokenFilterDef(factory = SnowballPorterFilterFactory.class, params = { 
               @Parameter(name = "language", value = "Swedish") })
         }), 
})
@AnalyzerDiscriminator(impl = LanguageDiscriminator.class)
public class Text extends Base {

   private String word;
   private Language language;


   @Enumerated(EnumType.STRING)
   public Language getLanguage() {
      return language;
   }
   public void setLanguage(Language language) {
      this.language = language;
   }


   @Column(nullable=true)
   public String getWord() {
      return word;
   }
   public void setWord(String word) {
      this.word = word;
   }
}

The LanguageDiscriminator

Code:

public class LanguageDiscriminator implements Discriminator {

    public String getAnanyzerDefinitionName(Object value, Object entity, String field) {
          return ((Text) entity).getLanguage().name();
    }
}

So, simplyfied it a bit, but for example a Arrangement has 2 Text objects in it. One with a english summary and one with a swedish.
The fieldbridge sees to that the Text object is indexed like: summary_SWE or summary_ENG, and when I search on something I specify what summary that should be searched depending on the language of the searchword.
The language descriminator is used to put different analyzers on the different Text objects depending on the language in the language variable.
So far so good. I would like to know if I am using the correct approach here, but my problem is>

It seems that the SnowballPorterFilterFactory isnt applied on indexing, it is only applied when I search, and use the "onsearchAnalyzerSWE". It seems to use the other analyzers in the AnalyzerDef SWE and ENG, but not the snowball.

Can anyone explain why? Or what I should check?

Thankfull for any help, I have been trying quite long..

hardy.ferentschik · **Posted:** Mon Nov 16, 2009 9:00 am

Hi,

looks ok at first sight. Why do you think that the snowball analyzers are not applied? Have you looked at the generated index with Luke?
Is there anything in the log file?
Maybe add some debug trace to you discriminator. Is "((Text) entity).getLanguage().name();" really returning "ENG" or "SWE"?
It might look like "the other analyzers get applied" since if null is returned the StandardAnalyzer is applied. Unless, you are saying that you can show that stopwords_swe.properties and stopwords_eng.properties get applied.

--Hardy

sanne.grinovero · **Posted:** Mon Nov 16, 2009 9:11 am

Hi sore,
the problem is that by using a FieldBridge you override any annotations on the type, so you have now the responsability to provide the mapping to the index and consequently the AnalyzerDiscriminator is ignored.
You should select the analyzer on

Code:

Set<Text> getSummarytexts()

.

sanne.grinovero · **Posted:** Mon Nov 16, 2009 9:11 am

oops hi Hardy, sorry we were writing at the same time.

hardy.ferentschik · **Posted:** Mon Nov 16, 2009 9:20 am

Seems though you had the better answer :)

sore · **Joined:** Sun Nov 15, 2009 8:22 pm **Posts:** 19

Thank you!
Nice to be getting somewhere. If you could just give me some more pointers :)
Should I use the analyzerdescriminator on getSummaries?

Code:

@AnalyzerDiscriminator(impl = LanguageDiscriminator.class)
public Set<Text> getSummarytexts() {
   return summarytexts;
}

I dont know how to solve that then, because I then get a Set<Text> into the LanguageDiscriminator, and cant "discriminate" what Analyzer to use because the set has Text objects with different languages.

Or should I somehow set the analyzer in the fieldbridge? If so, how could I do that?

hardy.ferentschik · **Posted:** Mon Nov 16, 2009 2:19 pm

Have you tried to use @IndexedEmbedded on "Set<Text> getSummarytexts" and adding a ClassBridge to the class Text?

sanne.grinovero · **Posted:** Mon Nov 16, 2009 5:19 pm

As Hardy said you need a ClassBridge to dynamically set your field names, but you also need to map your fields to analyzers.
I needed something like this, not very clean code but you can reuse it on many entities (or even configure it as your global Analyzer) if you follow a convention to label your fields per language:

Code:

public class LocalizedAnalyzer extends Analyzer {
   
   private final Analyzer italianAnalyzer = ...
   private final Analyzer englishAnalyzer = ...
   ...
   private final Analyzer globalAnalyzer = ...

   @Override
   public TokenStream tokenStream(String fieldName, Reader reader) {
      if (fieldName.endsWith( "_IT" )) return italianAnalyzer.tokenStream(fieldName, reader);
      if (fieldName.endsWith( "_UK" )) return englishAnalyzer.tokenStream(fieldName, reader);
      ...
      return globalAnalyzer.tokenStream(fieldName, reader);
   }

}

sore · **Joined:** Sun Nov 15, 2009 8:22 pm **Posts:** 19

Thanks again for your help.
I seem to be getting closer to a solution, but still some trouble...

I have a made a analyzer like you suggested s.grinovero and a classbridge, they are both working seperatly, but if I use the classbridge the analyzer never gets called. This is my setup now.

The Arrangement class

Code:

class Arrangement {

...

@IndexedEmbedded
public Set<Text> getSummarytexts() {
      return summarytexts;
}

...

}

The Text class

Code:

@Analyzer(impl = LocalizedAnalyzer.class)
@ClassBridge(name="textbridge", index=Index.UN_TOKENIZED, impl=TextBridge.class )
class Text {

...

@Column(nullable=true)
public String getWord() {
   return word;
}

...

}

In this setup the fields get the correct names i.e. SUMMARY_SWE or SUMMARY_ENG, but the analyzer is never called -checked this with breakpoint and with Luke.
It doesnt make a difference if I put the analyzer on getSummarytexts().

Feels like Im close at least :)

hardy.ferentschik · **Posted:** Tue Nov 17, 2009 12:04 pm

hi,

have you tried to set the analyzer parameter in the @ClassBridge annotation? Bu just setting it in the entity you are setting the default analyzer in case @Field is used. In your case you want to set the default analyzer for the ClassBridge.

--Hardy

sore · **Joined:** Sun Nov 15, 2009 8:22 pm **Posts:** 19

Thanks Hardy,
do you mean like this:

Code:

@ClassBridge(name="textbridge", 
      index=Index.UN_TOKENIZED, 
      impl=TextBridge.class, 
      analyzer= @Analyzer(impl = LocalizedAnalyzer.class))
public class Text extends Base {


   private String word;
...

   @Column(nullable=true)
   public String getWord() {
      return word;
   }
...
}

The Arrangement has no changes.

that doesnt work :(
It doesnt use the analyzer anyway. Any ideas??
There is no @ManyToOne mapping of Arrangement in the Text object (so the Text class doesnt know in what Arrangement collection it is), could that be a problem??

sanne.grinovero · **Posted:** Tue Nov 17, 2009 1:06 pm

I was not expecting this, and I'm surprised as I did something similar. Version used?
Could you open an issue and attach a testcase? Much easier to inspect :-)

sore · **Joined:** Sun Nov 15, 2009 8:22 pm **Posts:** 19

I have tried both hibernate core 3.3.2 GA and 3.2.6 GA (and followed its versionrecommendations in the compatibility matrix).
I will try debugging it some more and if I dont find the error I will try to find some time to make a testcase.

sanne.grinovero · **Posted:** Wed Nov 18, 2009 6:58 am

And what about the Hibernate Search version?

sore · **Joined:** Sun Nov 15, 2009 8:22 pm **Posts:** 19

I followed this
https://www.hibernate.org/30.html#A3

so I have tried
3.0.1 GA with core 3.2.6 GA
3.1.1 GA with core 3.3.2 GA