extract content indexing

hibernator_11 · **Joined:** Thu Jun 16, 2011 12:03 pm **Posts:** 94

Hi all,

I have to create a new whole module for my web application. I have an entity Book that has a relation with the Format entity. One book can have several formats (HTML, PDF, MS,...).

At this moment I have a metadata search to return the entities matching search criteria (ej: title, author, subject...). What I want now is to be able to search in the formats of a book. So I am thinking about indexing the format table with the content using a lazy field as is shown in https://community.jboss.org/wiki/HibernateSearchAndOfflineTextExtraction and Apache Tika content extracting library. Sometimes I can have a pdf with 300 pages and maybe it takes some time to index the content....can I have any problem with big files?

I am using Hibernate Search 3.4 and JBoss AS 6 but lately I saw this link http://planet.jboss.org/post/progress_on_hibernate_search_4_2_tika_text_extraction_and_sort_by_distance and maybe I should start thinking about updating to the last version.

Furthermore, though the format index, I would like to have:

- content concordance searching
- highlight content search

To sum up, do you think that I should change the version of Hibernate Search? is this the right way to do the extract content with Hibernate Search? What about the efficience?

Thanks in advance,

Hibernator,

sanne.grinovero · **Posted:** Wed Feb 13, 2013 9:04 am

Hi Hibernator,
yes I think you should upgrade; there are many more other good reasons, including performance of all different aspects.
Note that for Tika we don't support yet out-of-the-box asynchronous indexing: only synchronous. If you need to index pdfs in a seprate thread we'll need you to run some experiments.