Indexing Unidirectional Associations

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

I have a unidirectional many-to-one association between Item and TechnicalCategory classes. An Item contains a single instance of TechnicalCategory, but the same TechnicalCategory can be applied to multiple items. It is possible someone may wish to search on an Item by TechnicalCategory. However, my tests are returning no hits when there absolutely should be.

Here is the association found in my Item class

Code:

@ManyToOne
   @JoinColumn(name = "TECHNICAL_CATEGORY_ID", nullable = false)
   @Field(index = Index.TOKENIZED, store = Store.NO)
   @IndexedEmbedded
   @FieldBridge(impl = TechnicalCategoryStringBridge.class)
   public TechnicalCategory getTechnicalCategory() {
      return technicalCategory;
   }

In case it matters, TechnicalCategoryStringBridge simply provides a way to represent TechnicalCategory as a String by calling its toString method, which returns the name of the category.

I have added all the trimmings to my TechnicalCategory class like @Indexed, @DocumentId, and @Field on the name, which is what will be searched on.

Here are my questions:

1) Is there anything obviously wrong with my setup?

2) Where do I add @ContainedIn? What other annotations must I add if any?

3) Do I need to make this association bi-directional? In other words, must TechnicalCategory have a reference to Item as well? Such a thing is unnecessary for the purposes of my application, but I am willing to do it if it is necessary for Hibernate Search.

Any insight is appreciated.

Thanks.

emmanuel · **Posted:** Wed Nov 21, 2007 11:23 am

how is your query?
Can you check the index (with luke for example), to see if the field is properly indexed?

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

I had never heard of Luke, but it is a pretty nifty little tool. It has given me some insight, but I am still at a loss.

Here are the queries I have tried in my test:

Code:

technicalCategory:"Electrical Properties"~0.4
technicalCategory.name:"Electrical Properties"~0.4

And according to Luke, here is an excerpt of the indexes for my Item class (where the two values on each line are the "Field" and "Text" values respectively):

Code:

<_hibernate_class>           mypackage.Item
<technicalCategory>          properties
<technicalCategory.name>     properties
<technicalCategory>          electrical
<technicalCategory.name>     electrical

All of this looks pretty good to me, but of course I also don't know what the indexes are supposed to look like. Perhaps Emmanuel or his team or any other expert on Lucene, which I most certainly am not, can provide some insight.

Thanks.

emmanuel · **Posted:** Wed Nov 21, 2007 7:45 pm

Yes it looks good.
note that you don't really need @Field and @FieldBridge when you use @IndexedEmbedded, the data end up being index twice in technicalCategory and technicalCategory.name

The query should work when you remove ~0.4
Look for Proximity searches in http://lucene.apache.org/java/docs/queryparsersyntax.html for more info.

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

I think this goes to show why testing is critical. The ~0.4 came as a result of my setting the "fuzziness" of a fuzzy query, and of course it works for single-term queries. Of course, setting the boost on a phrase query as I do in my test inadvertently generates a proximity search, which of course will fail.

I suppose then my question becomes how best to build a phrase query. I am aware of the Lucene PhraseQuery class, but I am still getting no results. I am clearly using the class incorrectly.

Given the search I indicated above, let's say there are name values in the database of the form "Electrical Properties 1234" and "Electrical Properties are fun 5678." But...the user does a name search and provides only "Electrical Properties." How do I use the Lucene query API to create a query such that a search on merely "Electrical Properties" yields the two results "Electrical Properties 1234" and "Electrical Properties are fun 5678."

Thanks, and I hope everyone had a wonderful Thanksgiving.

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

OK...promise this is the last Lucene question. I have made some headway with phrases, and I have come across something curious.

What is the difference between the following:

name:"Electrical Properties" (+name:Electrical ~0.4 +name:Properties~0.4)

and

+name:"Electrical Properties" (+name:Electrical ~0.4 +name:Properties~0.4)

Basically, to use pseudocode and plain English, I am saying give me everything that has...

"Electrical Properties" in it in precisely that way with the two words side-by-side
OR
Everything with "Electrical" in it (within some fuzziness) AND everything with "Properties" in it (within some fuzziness)

I would think the second query above does this, but that query returns no results while the first one returns what it should. And I should mention that the expected results do indeed have "Electrical Properties" specified in precisely that way, which is why the result is so puzzling.

Thanks.

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

I forgot to mention...simply annotating the owner of the association with IndexedEmbedded on the appropriate property and then applying the appropriate annotations to the association object itself did the trick.

Thanks again, Emmanuel.

jgriffin · **Joined:** Fri Mar 04, 2005 4:27 pm **Posts:** 13

Hi,

I see that your data is tokenized but I don't see which one you are using. Make sure the index is built with the same analyzer you are querying by. Be wary of the StandardAnalyzer it does funny things behind the scenes. I would start with the SimpleAnalyzer and work from there (being mindful of numbers and case and which analyzers do what).

Use Query.toString() to look at your search tokens and make sure you're doing what you think you are doing.

I think this may be your problem.

You say:
---------
Given the search I indicated above, let's say there are name values in the database of the form "Electrical Properties 1234" and "Electrical Properties are fun 5678." But...the user does a name search and provides only "Electrical Properties." How do I use the Lucene query API to create a query such that a search on merely "Electrical Properties" yields the two results "Electrical Properties 1234" and "Electrical Properties are fun 5678."
---------

This could be done with a prefix query.

The problem is that since Lucene splits content into individual terms searching by 'Electrical" will return anything with "Electrical" in the content and the same for 'Properties'.

Cheers

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

I didn't even think to look at the analyzer. Thanks for pointing me in that direction.

I studied up on analyzers and decided to try things out with StopAnalyzer. After running a test on the phrase "Carbon Steel," I obtained the following results:

Code:

Bessemer Steel (Simulated) 0.1 % Carbon
Carbon Steel, 0.4 C
Carbon Steel, 0.1 C
Carbon Steel, 1.0 C
Carbon Steel, 0.5 C
Carbon Steel (AISI 1211)

In this case, the results are precisely what I want...but with the "wrong" relevance. I want the first item "Bessemer Steel" to be last. After all, since the user queried on "Carbon Steel," I want the results with precisely that phrase to be ranked first followed by items that contain the two words but not juxtaposed--like the "Bessemer" entry.

Here is my query as generated through the query API:

name:"Carbon Steel" (+name:Carbon~0.4 +name:Steel~0.4)

Please let me know if I should choose a different analyzer or if I should rework my query.

Thanks.

jgriffin · **Joined:** Fri Mar 04, 2005 4:27 pm **Posts:** 13

Hello again,

Now that you understand the analyzer problem let's look at your specific problem.

If I understand you correctly what you want Lucene to do can't be done in a single statement. If you want the phrase "Carbon Steel" to appear first, as is, then you will have to use a PhraseQuery to get those 'Hits'. Save them.

Next do a BooleanQuery(this is really just separate term queries) for the terms entered specifying that all terms MUST be present. This will give you a Hits object with results containing both 'carbon' and 'steel' in any order.

If there is a combination of these terms you don't want (ex. steel followed by carbon) you will have to iterate through the second 'Hits' and skip those entries. Sorry but that's what you have to work with.

Hope this helps.
Please rate the answer.

jgriffin · **Joined:** Fri Mar 04, 2005 4:27 pm **Posts:** 13

Hello again,

Now that you understand the analyzer problem let's look at your specific problem.

If I understand you correctly what you want Lucene to do can't be done in a single statement. If you want the phrase "Carbon Steel" to appear first, as is, then you will have to use a PhraseQuery to get those 'Hits'. Save them.

Next do a BooleanQuery(this is really just separate term queries) for the terms entered specifying that all terms MUST be present. This will give you a Hits object with results containing both 'carbon' and 'steel' in any order.

If there is a combination of these terms you don't want (ex. steel followed by carbon) you will have to iterate through the second 'Hits' and skip those entries. Sorry but that's what you have to work with.

Hope this helps.
Please rate the answer.

jgriffin · **Joined:** Fri Mar 04, 2005 4:27 pm **Posts:** 13

Hi,

It is possible that this will work for you. Use a Phrase Query as follows:

PhraseQuery query = new PhraseQuery();
query.add(new Term("content", "carbon"));
query.add(new Term("content", "steel"));
query.setSlop(0);

Hits hits = indexSearcher.search(query);

The PhraseQuery searches by order of addition i.e. above "carbon steel". slop(0) says no words in between and in this order.

when you build the index use:

IndexWriter writer =
new IndexWriter("...", new SimpleAnalyzer(), true);

Document doc = new Document();
doc.add(new Field("fieldname",
"Carbon Steel",
Field.Store.YES,
Field.Index.TOKENIZED));
The SimpleAnalyzer will tokenize everything into lower case that's why you search after converting everything to lowercase.

If you need more you're back to multiple searches and combining.

Cheers
John G.
Please rate the answer.

neilac333 · **Joined:** Fri Oct 05, 2007 1:17 pm **Posts:** 78

Thanks for the tip. I ended up using the StandardAnalyzer and a BooleanQuery that combined PhraseQuery with several TermQuery's. Things seem to be working just fine for me now. I am passing all my unit tests.

Of course, I may need your help again once the customers run the acceptance tests and everything comes out all wrong.

I appreciate your help. Thanks again.