-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 2 posts ] 
Author Message
 Post subject: extract content indexing
PostPosted: Tue Feb 12, 2013 12:24 pm 
Regular
Regular

Joined: Thu Jun 16, 2011 12:03 pm
Posts: 94
Hi all,

I have to create a new whole module for my web application. I have an entity Book that has a relation with the Format entity. One book can have several formats (HTML, PDF, MS,...).

At this moment I have a metadata search to return the entities matching search criteria (ej: title, author, subject...). What I want now is to be able to search in the formats of a book. So I am thinking about indexing the format table with the content using a lazy field as is shown in https://community.jboss.org/wiki/HibernateSearchAndOfflineTextExtraction and Apache Tika content extracting library. Sometimes I can have a pdf with 300 pages and maybe it takes some time to index the content....can I have any problem with big files?

I am using Hibernate Search 3.4 and JBoss AS 6 but lately I saw this link http://planet.jboss.org/post/progress_on_hibernate_search_4_2_tika_text_extraction_and_sort_by_distance and maybe I should start thinking about updating to the last version.

Furthermore, though the format index, I would like to have:

- content concordance searching
- highlight content search

To sum up, do you think that I should change the version of Hibernate Search? is this the right way to do the extract content with Hibernate Search? What about the efficience?

Thanks in advance,

Hibernator,


Top
 Profile  
 
 Post subject: Re: extract content indexing
PostPosted: Wed Feb 13, 2013 9:04 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi Hibernator,
yes I think you should upgrade; there are many more other good reasons, including performance of all different aspects.
Note that for Tika we don't support yet out-of-the-box asynchronous indexing: only synchronous. If you need to index pdfs in a seprate thread we'll need you to run some experiments.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 2 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.