Hibernate Search: projection of document id? (highlighter)

wiedmann · **Joined:** Wed Jan 09, 2008 4:00 pm **Posts:** 11

I'm trying to implement search highlighting as shown in the Lucene Highlighter example code using TokenSources.getAnyTokenStream(IndexReader reader, int docId, String field, Analyzer analyzer).

It seems like there currently is no way to retrieve the docId from Hibernate Search for use in this call. Are there any plans to project the docId from the hit?

Has anyone else run into this issue and solved it in a different way? Are there any drawbacks to just retrieving the token stream out of the field in the document (by projecting it)?

emmanuel · **Posted:** Wed Jan 09, 2008 7:50 pm

Christian uses the highlighter in the Seam wiki (check the seam 2 distro in the example directory). I suspect he uses plain Lucene.
I am interested in feedbacks though to help make a better HSearch integration by providing the actual docId or doing something of a higher level.

hardy.ferentschik · **Posted:** Thu Jan 10, 2008 9:55 am

Yes, match highlighting would be high on my list of features as well. Is there a feature request for this already?

emmanuel · **Posted:** Thu Jan 10, 2008 10:25 am

nope, go ahead.

wiedmann · **Joined:** Wed Jan 09, 2008 4:00 pm **Posts:** 11

FYI, Christian's code appears to just reparse the indexed text, rather than trying to get the token stream out of Lucene.

Code:

            // Use the same analyzer as the indexer!
            TokenStream tokenStream = new StandardAnalyzer().tokenStream(null, new StringReader(indexedText));

            String unescapedFragements =
                    highlighter.getBestFragments(tokenStream, indexedText, numOfFragments, getFragmentSeparator());

I think projecting the document id would allow use of TokenSources.getAnyTokenStream to get a TokenStream without having to reparse the full text when it is already available in Lucene.

emmanuel · **Posted:** Fri Jan 11, 2008 6:31 am

Open a JIRA issue then.
But TokenSources.getAnyTokenStream taking an analyzer as a parameter tells me that Lucene actually reparses the text as well. So it does no more no less work than what Christian did.

wiedmann · **Joined:** Wed Jan 09, 2008 4:00 pm **Posts:** 11

Actually getAnyTokenStream may reparse the document, but only does so if it has to. If it can reconstruct the token stream without reparsing (if the term positions are stored, as I understand it), it does so. I will open a JIRA issue.

Marx2 · **Joined:** Thu Feb 28, 2008 4:58 am **Posts:** 37

Hello
I have searching based on HS and it works ok.
I see HS has projection of DOCUMENT_ID.
I've tried to use highlighting as described above.
I see in debug that StreamTokenizer gets data ok, but there is problem with score. I get nothing highlighted because scores in text fragments are always 0 (despite scores being displayed ok by projection).
Below is my code.

Code:

String searchPattern = "text*";
Analyzer = new StarndardAnalyzer();
QueryParser parser = new QueryParser(searchPattern, analyzer);
Query luceneQuery = parser.parse(searchPattern);

Session session = (Session) em.getDelegate();
FullTextSession fts = Search.createFullTextSession(session);
Transaction tx = fts.beginTransaction();
FullTextQuery query = fts.createFullTextQuery(luceneQuery, Person.class);

query.setProjection(FullTextQuery.THIS, FullTextQuery.DOCUMENT_ID,FullTextQuery.DOCUMENT,FullTextQuery.SCORE,FullTextQuery.BOOST);
Collection<Object[]> lista = query.list();

SearchFactory searchFactory = fts.getSearchFactory();

String fragmentSeparator = "...";
Fragmenter fragmenter = new SimpleFragmenter();
int numOfFragments = 5;

ReaderProvider readerProvider = searchFactory.getReaderProvider();
DirectoryProvider directoryProviders = searchFactory.getDirectoryProviders(Person.class)[0];
IndexReader reader = readerProvider.openReader(directoryProvider);

luceneQuery.rewrite(reader);
Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(luceneQuery));
highlighter.setTextFragmenter(fragmenter);

for (Object[] zlozony : lista) {
  int docId = Integer.parseInt(zlozony[1].toString());
  Document document = (Document) zlozony[2];
  TokenStream tokenStream = TokenSources.getAnyTokenStream(   reader, docId, "name", analyzer);
  org.apache.lucene.document.Field field = document.getField("name");
  String highlight = highlighter.getBestFragments(tokenStream, field.stringValue(), numOfFragments,fragmentSeparator);
}
readerProvider.closeReader(reader);

tx.commit();
fts.close(); //session.close();

Marx2 · **Joined:** Thu Feb 28, 2008 4:58 am **Posts:** 37

I've found a problem - I was using metacharacter "*" in query.
Is highlighting suppose to work with metacharacters? (I know it's more Lucene problem, but Lucene doesn't have normal newsgroup or forum, only mailing list)

wiedmann · **Joined:** Wed Jan 09, 2008 4:00 pm **Posts:** 11

I'm not an expert, but as I understand it the luceneQuery.rewrite() is supposed to take care of expanding the terms when you're using wildcards. Are you sure you don't have an Analyzer mismatch between your query and your index? I believe that the Analyzer must match in order for the query to work right.

Marx2 · **Joined:** Thu Feb 28, 2008 4:58 am **Posts:** 37

I use StandardAnalyzer with polish stopwords in both cases (indexing and querying). I will try with some simpler analyzer, maybe stopwordAnalyzer?

amin-mc · **Joined:** Wed Oct 03, 2007 2:31 pm **Posts:** 205

Is there a new feature request for search highlighting? I have recently done a demo using Hibernate Search and some of my colleagues said that highlighting would be a great feature. I understand you can do it with Lucene natively but doing it via hibernate seems like a nice approach too.

Thanks

emmanuel · **Posted:** Tue Jun 03, 2008 5:39 pm

If some one show me a compelling API to do it (ie much better than the Lucene one), why not :)

eliezerreis · **Joined:** Mon Feb 08, 2010 3:47 pm **Posts:** 3

wiedmann wrote:

I'm not an expert, but as I understand it the luceneQuery.rewrite() is supposed to take care of expanding the terms when you're using wildcards. Are you sure you don't have an Analyzer mismatch between your query and your index? I believe that the Analyzer must match in order for the query to work right.

Well, according with this link http://www.gossamer-threads.com/lists/l ... -dev/42240 it's necessary. I'm making everything correct but if I don't rewrite the query after parse hightlight doesn't work.

So, is there some method that hibernate use to call rewrite automatically? I don't like too much to retreive IndexReader and make these operation with him. I try flush(); but it doesn't do any rewrite.

If someone there's some news about this issue please share with us.

sanne.grinovero · **Posted:** Wed Feb 10, 2010 5:30 am

Hello,
I got this working some time ago, honestly I don't remember how I did.
If you could show some test code and point me to what you're expecting I'd be more than happy to look into it; a real Unit test would be great, but some pseudo code will do.

The sources are packed with very simple examples of working unittests, pick one as an example:
http://fisheye.jboss.org/browse/Hibernate/search/trunk/src/test/java/org/hibernate/search/test/RamDirectoryTest.java?r=18737