Creating Multiple Indexes based on language

david2011 · **Joined:** Tue May 17, 2011 1:45 am **Posts:** 52

Hi,

I need to index documents for the following languages

1. English
2. Chinese
3. Japanese
4. Korean
5. German
6. French
7. Spanish
8. Dutch

My questions are as follows

1. Can Lucene index documents in the above languages?
2. How can Hibernate Search be used to index a document in all these languages? ---> An example with code for any two languages will be very helpful
3. During Manual Indexing, how can Hibernate Search be configured/coded to ensure that the same document has different indexes based on different language, or rather during indexing process can Hibernate index the document in all the languages? How?
4. During query time, how do we determine which index to use, (given that during query time, the language is known and passed as a parameter to the DAO) ?

Thanks
David

sanne.grinovero · **Posted:** Tue May 17, 2011 6:38 am

Quote:

1. Can Lucene index documents in the above languages?

Yes, you could index them all in the same way using a StandardAnalyzer (not sure about Japanese) or you could use some of the additional libraries in Lucene which do language-spcialized analysis. The Snowball project is considered very good at this, and I believe it supports all languages you listed. You don't need to use snowball for all of them; you can pick the analyzer of your choice or write your own for each one.

Quote:

2. How can Hibernate Search be used to index a document in all these languages? ---> An example with code for any two languages will be very helpful

Simple solution: use the same strategy for all documents.
Better solution: you'll need to pick the proper analyzer to be used for each entity instance;
example is in the docs: http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#d0e3385

Quote:

3. During Manual Indexing, how can Hibernate Search be configured/coded to ensure that the same document has different indexes based on different language, or rather during indexing process can Hibernate index the document in all the languages? How?

Have a look also to Index sharding: you can define a strategy to keep each language in it's own index. during search, you'll transparently search across all indexes, or you can use a custom filter as explained here:
http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#query-filter-shard
this filter will have the capability to "pick" the appropriate index according to the filter parameter (i.e. to implement something like "search for documents about hibernate in the dutch language" )

Quote:

4. During query time, how do we determine which index to use, (given that during query time, the language is known and passed as a parameter to the DAO) ?

as above ;)

let me know if you need more pointers; when you'll have it working, blog about it! many people ask about this so it would be nice to find your instructions.