-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 2 posts ] 
Author Message
 Post subject: Indexing pdf/html documents
PostPosted: Wed Sep 19, 2012 10:47 am 
Regular
Regular

Joined: Thu Jun 16, 2011 12:03 pm
Posts: 94
Hi all,

I would like to add a new module to my HS application to index pdf and html files. I dont want to store the whole text in my database, i just want to store it in the index. I have been searching on forums different alternatives like:

https://community.jboss.org/wiki/HibernateSearchAndOfflineTextExtraction
http://twproject.blogspot.com.es/2007/11/using-hibernate-search-with-complex.html


I think that the correct way is the first link but before I start to implement this, I would like know if anybody had the same problem and his solution.

I have like 300000 records and each one has 1, 2 or 3 PDF/HTML files...so i think that off-line extraction is a good idea.

Let me know any ideas...

Thanks in advance,

Hibernator,


Top
 Profile  
 
 Post subject: Re: Indexing pdf/html documents
PostPosted: Wed Sep 19, 2012 4:18 pm 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi Hibernator,
the current Hibernate Search master include Apache Tika integration so it supports PDF parsing and text extraction directly; we didn't tag any release with it, so I'd suggest to checkout the sources and build a snapshot?

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 2 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.