Thank you for the response Sanne. It certainly seems like there's a bug somewhere... I'll try that TermVector.YES thing, have to figure out what that is and where to use it :). So here are some very strange things I noticed with Hibernate Search over the weekend. I'll start with more of my Name entity:
Code:
@Fields({@Field(boost = @Boost(2.0f)),
@Field(name = "firstPhonetic", analyzer = @Analyzer(definition = "phonetic"))})
private String first;
@Fields({@Field(name = "full", boost = @Boost(2.0f)),
@Field(name = "fullPhonetic", analyzer = @Analyzer(definition = "phonetic"))})
private String fullLowercase;
"first", and other fields like it, get indexed perfectly. For each person, there's a names.first that has a list of all the first names. But the odd thing is that full and fullPhonetic has a list of the full names TWICE! So for example, if a person has 2 names, "Susan Anthony" and "Bruce Willis". In the index will be "names.first:susan bruce". And in the index will also be "names.full:susan anthony susan anthony bruce willis bruce willis."
Now this doubling doesn't really cause too much of a problem... the term frequency for a matching term is always double what it should be of course, but otherwise not too big of a deal. However, that bug might have something to do with the far more damaging bug causing the scores to go out of whack. Why is this happening? We have 5 other fields with similar regular and phonetic indexes and the doubling problem doesn't occur. Only on the "full" and "fullPhonetic" fields are the values doubled. Unfortunately, those are our default fields to search on...
The only difference I can see between full and fullPhonetic versus first and firstPhonetic is that I name the full field. Could that be causing problems? This whole big scoring problem doesn't occur if I search for names.first:<first_name> names.last:<last_name>. So there's definitely something fishy going on with full and fullPhonetic...
In Luke, when I hit the "explain" button on the top search result for "names.full:bruce names.full:willis" I get the following. Note that there are 10 names for this top result, 6 of which contain the word bruce and ZERO of which contain the word willis.
Code:
1095709.8750 product of:
2191419.7500 sum of:
2191419.7500 weight(names.full:bruce in 3705465), product of:
0.7018 queryWeight(names.full:bruce), product of:
7.8596 idf(docFreq=11631, maxDocs=11084675)
0.0893 queryNorm
3122531.0000 fieldWeight(names.full:bruce in 3705465), product of:
3.4641 tf(termFreq(names.full:bruce)=12)
7.8596 idf(docFreq=11631, maxDocs=11084675)
114688.0000 fieldNorm(field=names.full, doc=3705465)
0.5000 coord(1/2)
Ranked #50 below this result we have this result with only 1 name of "willis bruce":
Code:
31.6756 sum of:
15.6013 weight(names.full:bruce in 3913601), product of:
0.7018 queryWeight(names.full:bruce), product of:
7.8596 idf(docFreq=11631, maxDocs=11084675)
0.0893 queryNorm
22.2302 fieldWeight(names.full:bruce in 3913601), product of:
1.4142 tf(termFreq(names.full:bruce)=2)
7.8596 idf(docFreq=11631, maxDocs=11084675)
2.0000 fieldNorm(field=names.full, doc=3913601)
16.0742 weight(names.full:willis in 3913601), product of:
0.7124 queryWeight(names.full:willis), product of:
7.9778 idf(docFreq=10334, maxDocs=11084675)
0.0893 queryNorm
22.5646 fieldWeight(names.full:willis in 3913601), product of:
1.4142 tf(termFreq(names.full:willis)=2)
7.9778 idf(docFreq=10334, maxDocs=11084675)
2.0000 fieldNorm(field=names.full, doc=3913601)
Clearly the big issue here is the fieldNorm value... it is an astronomical 3122531.0000 in the first result. What's going on with this fieldNorm? I just glanced at chapter 12 of Hibernate Search in Action and it's a bit vague on this. It seems to mostly be composed of lengthNorm which is "its effect is to decrease the scoring contribution of fields with many terms and to increase scoring for shorter fields." This seems to be having the exact opposite effect, but not really... again, the biggest factor in determining this score is how many names are in the Person's name collection... not necessarily how long those names are or how many words there are in each name. So my guess is this is something Hibernate Search is doing at index time to collections.