Indexing pdf/html documents

hibernator_11 · **Joined:** Thu Jun 16, 2011 12:03 pm **Posts:** 94

Hi all,

I would like to add a new module to my HS application to index pdf and html files. I dont want to store the whole text in my database, i just want to store it in the index. I have been searching on forums different alternatives like:

https://community.jboss.org/wiki/HibernateSearchAndOfflineTextExtraction
http://twproject.blogspot.com.es/2007/11/using-hibernate-search-with-complex.html

I think that the correct way is the first link but before I start to implement this, I would like know if anybody had the same problem and his solution.

I have like 300000 records and each one has 1, 2 or 3 PDF/HTML files...so i think that off-line extraction is a good idea.

Let me know any ideas...

Thanks in advance,

Hibernator,

sanne.grinovero · **Posted:** Wed Sep 19, 2012 4:18 pm

Hi Hibernator,
the current Hibernate Search master include Apache Tika integration so it supports PDF parsing and text extraction directly; we didn't tag any release with it, so I'd suggest to checkout the sources and build a snapshot?