-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 2 posts ] 
Author Message
 Post subject: Full text search on html or pdf files
PostPosted: Tue Oct 01, 2013 9:06 am 
Regular
Regular

Joined: Thu Jun 16, 2011 12:03 pm
Posts: 94
Hi all,

I would like to know if it possible to create several Lucene documents from one record in the database. I have the url file in the database and I want to index every paragraph associated with its anchor (<a name="example"></a>) in order to perfom the search and paragraph position.

So for example if I have the record in my database with the :

File database table
FileName

example1.html

with the following text:

Code:
....
the book is fine <a name="ex1"></a>
i love trees <a name="ex2"></a>
play with me <a name="ex3"></a>
.....


I would like index the following:
url ----------------------text
example1.html#ex1------------the book is fine
example1.html#ex2------------i love trees
example1.html#ex3------------play with me


I have all the process implemented in order to create the paragraphs but I have to figure out how to index the three documents from the one database record.

I am thinking about using natively Lucene:

Code:
org.apache.lucene.store.Directory directory = searchFactory.getDirectoryProviders(FileEntity.class)[0].getDirectory();


Any ideas?

Thanks!


Top
 Profile  
 
 Post subject: Re: Full text search on html or pdf files
PostPosted: Thu Oct 10, 2013 2:21 pm 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
Hi,
did you consider implementing a FieldBridge?

A FieldBridge is normally a small function, but technically there is no limit to what you do in this function: you're free to open the external files, load them, extract the text and write it to the Lucene Document.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 2 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.