Hi all,
I have to create a new whole module for my web application. I have an entity Book that has a relation with the Format entity. One book can have several formats (HTML, PDF, MS,...).
At this moment I have a metadata search to return the entities matching search criteria (ej: title, author, subject...). What I want now is to be able to search in the formats of a book. So I am thinking about indexing the format table with the content using a lazy field as is shown in https://community.jboss.org/wiki/HibernateSearchAndOfflineTextExtraction and Apache Tika content extracting library. Sometimes I can have a pdf with 300 pages and maybe it takes some time to index the content....can I have any problem with big files?
I am using Hibernate Search 3.4 and JBoss AS 6 but lately I saw this link http://planet.jboss.org/post/progress_on_hibernate_search_4_2_tika_text_extraction_and_sort_by_distance and maybe I should start thinking about updating to the last version.
Furthermore, though the format index, I would like to have:
- content concordance searching - highlight content search
To sum up, do you think that I should change the version of Hibernate Search? is this the right way to do the extract content with Hibernate Search? What about the efficience?
Thanks in advance,
Hibernator,
|