Search results not sorted properly after 5 million documents

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

I don't know if this is a problem with Hibernate Search or Lucene... Whenever the number of documents gets above roughly 5 million, the results are no longer sorted. So normally I give a boost for exact matches and no boost to phonetic matches. But the results come in a pretty random order after some threshold. Everything's working perfect and when too many documents are added to the index, the results are kind of scrambled...

The first time this happened I figured there was just some corruption in the index.. But I've tried many times now and on different machines. Also I don't know the exact threshold when the results stop coming back in order.

Any ideas what's going on?? I'm using Hibernate Search 3.2. I assume Hibernate Search does in fact support >5 million documents... Is there something special I need to do?

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

Is anyone else using Hibernate Search with >5 million instances of an entity? I cannot figure a way to make this work. It seems everything just breaks down at this level. I know companies are using Lucene for large indexes, so I can't imagine the problem is with Lucene... Maybe it's with my use of MultiFieldQueryParser?

Code:

MultiFieldQueryParser.parse(Version.LUCENE_29, query, new String[] {"names.full", "names.fullPhonetic"}, new BooleanClause.Occur[] {BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD}, analyzer);

So it seems that the first 5 million or so entities maintain their proper ranking, but then subsequent entities being indexed can go in front of the first 5 million for a particular search without regard to boosts or accuracy (not all words have to be present). The + qualifier works somewhat... not perfect, but close. So if I put the + in front of every term in the query, I get a results page similar to the old results page.

My guesses as to what's happening:

trying to think... I'm lost, I have no idea what could be the issue...

I guess it's time to figure out Luke and see if Luke gives back the same invalid results.

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

Ok, I used Luke and grabbed a snapshot of our search index to my local machine. I think I see what's going on, though I'm not sure why it's doing this.

So our Person entity has multiple names in its names collection:

Code:

    @IndexedEmbedded
    private Set<Name> names;

When I search our database for a name like "bruce willis" it'll return an Person that has many Names like "father douglas", "sam harris" and "bruce lee". That example entity might show up #1, even above an entity with the sole name "bruce willis". The reason it seems to show up with an enormous score is that it has 3 names, doesn't matter that 2 of those names are totally irrelevant and even the matching name only has 1 of the search terms. There is a direct correlation in the examples I've tried with score and number of names! Why the heck the number of names matters I have no idea...

To put this into context, the above example with 4 irrelevant names and 1 partially matching name has a score of 571. While a person with only 1 fully matching name of "bruce willis" has a score of 32. And if there are 5 irrelevant names with 1 partially matching name, the score jumps to the thousands. 6 irrelevant names and 1 partial match, would give a score of 24,000 (in 1 example I have). So there's an exponential relationship here. Oh and if I only search for "bruce" in the example above, the scores triple.

So how does all this work exactly? How do I get this to work the expected way? On page 105 of Hibernate Search in Action, it seemed like lucene would just have a series of strings for all the names...

sanne.grinovero · **Posted:** Mon Jun 07, 2010 12:47 pm

Hi mueller,
sorry I'm traveling so I can't confirm my thoughts; maybe I can give some ideas.
Chap.12 should be explaining the scoring formulas; As far as I remember it should be the other way around (to score highes the case in which it has only one match) but I might be wrong; otherwise you might have hit a bug in Lucene (version?) or are you using a custom Similarity?
If not, you might actually try to customize the Similarity implementation.
Also please try the effect of using TermVector.YES

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

Thank you for the response Sanne. It certainly seems like there's a bug somewhere... I'll try that TermVector.YES thing, have to figure out what that is and where to use it :). So here are some very strange things I noticed with Hibernate Search over the weekend. I'll start with more of my Name entity:

Code:

    @Fields({@Field(boost = @Boost(2.0f)),
             @Field(name = "firstPhonetic", analyzer = @Analyzer(definition = "phonetic"))})
    private String first;
    @Fields({@Field(name = "full", boost = @Boost(2.0f)),
             @Field(name = "fullPhonetic", analyzer = @Analyzer(definition = "phonetic"))})
    private String fullLowercase;

"first", and other fields like it, get indexed perfectly. For each person, there's a names.first that has a list of all the first names. But the odd thing is that full and fullPhonetic has a list of the full names TWICE! So for example, if a person has 2 names, "Susan Anthony" and "Bruce Willis". In the index will be "names.first:susan bruce". And in the index will also be "names.full:susan anthony susan anthony bruce willis bruce willis."

Now this doubling doesn't really cause too much of a problem... the term frequency for a matching term is always double what it should be of course, but otherwise not too big of a deal. However, that bug might have something to do with the far more damaging bug causing the scores to go out of whack. Why is this happening? We have 5 other fields with similar regular and phonetic indexes and the doubling problem doesn't occur. Only on the "full" and "fullPhonetic" fields are the values doubled. Unfortunately, those are our default fields to search on...

The only difference I can see between full and fullPhonetic versus first and firstPhonetic is that I name the full field. Could that be causing problems? This whole big scoring problem doesn't occur if I search for names.first:<first_name> names.last:<last_name>. So there's definitely something fishy going on with full and fullPhonetic...

In Luke, when I hit the "explain" button on the top search result for "names.full:bruce names.full:willis" I get the following. Note that there are 10 names for this top result, 6 of which contain the word bruce and ZERO of which contain the word willis.

Code:

1095709.8750 product of:
  2191419.7500 sum of:
    2191419.7500 weight(names.full:bruce in 3705465), product of:
      0.7018 queryWeight(names.full:bruce), product of:
        7.8596 idf(docFreq=11631, maxDocs=11084675)
        0.0893 queryNorm
      3122531.0000 fieldWeight(names.full:bruce in 3705465), product of:
        3.4641 tf(termFreq(names.full:bruce)=12)
        7.8596 idf(docFreq=11631, maxDocs=11084675)
        114688.0000 fieldNorm(field=names.full, doc=3705465)
  0.5000 coord(1/2)

Ranked #50 below this result we have this result with only 1 name of "willis bruce":

Code:

31.6756 sum of:
  15.6013 weight(names.full:bruce in 3913601), product of:
    0.7018 queryWeight(names.full:bruce), product of:
      7.8596 idf(docFreq=11631, maxDocs=11084675)
      0.0893 queryNorm
    22.2302 fieldWeight(names.full:bruce in 3913601), product of:
      1.4142 tf(termFreq(names.full:bruce)=2)
      7.8596 idf(docFreq=11631, maxDocs=11084675)
      2.0000 fieldNorm(field=names.full, doc=3913601)
  16.0742 weight(names.full:willis in 3913601), product of:
    0.7124 queryWeight(names.full:willis), product of:
      7.9778 idf(docFreq=10334, maxDocs=11084675)
      0.0893 queryNorm
    22.5646 fieldWeight(names.full:willis in 3913601), product of:
      1.4142 tf(termFreq(names.full:willis)=2)
      7.9778 idf(docFreq=10334, maxDocs=11084675)
      2.0000 fieldNorm(field=names.full, doc=3913601)

Clearly the big issue here is the fieldNorm value... it is an astronomical 3122531.0000 in the first result. What's going on with this fieldNorm? I just glanced at chapter 12 of Hibernate Search in Action and it's a bit vague on this. It seems to mostly be composed of lengthNorm which is "its effect is to decrease the scoring contribution of fields with many terms and to increase scoring for shorter fields." This seems to be having the exact opposite effect, but not really... again, the biggest factor in determining this score is how many names are in the Person's name collection... not necessarily how long those names are or how many words there are in each name. So my guess is this is something Hibernate Search is doing at index time to collections.

sanne.grinovero · **Posted:** Mon Jun 07, 2010 3:53 pm

I agree it looks fishy. It would be of great value if you could
* open an issue so that I won't forget this when I'm back
* attach a minimal test case so that we can reproduce - if we can't reproduce it's hard to fix

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

Ok, so we finally figured out some things including a workaround! I'm happy about that... Basically it all comes down to those @Boost annotations. They screw everything up for us. So if we take them out and simply add the Boosts to the search query at run-time, all is well!

Is there a bug in the @Boost annotations with Hibernate Search? Interestingly, what happens is it only screws things up if we add multiple names to a person. It doesn't matter if we have only 1 name but with many words. For example, we might have a person with these names:

Code:

name 1: Bruce Willis
name 2: Bruce Willies
name 3: Brucester Willius

@Boost annotations make those collections' scores all screwey

However, this name:

Code:

name 1: Bruce Willis Willies Brucester Willius

That works perfectly fine. So Hibernate Search does something special to the index beyond simply concatenating all the words of the names in a collection. And that something special appears to be the bug.

We will update the JIRA.

sanne.grinovero · **Posted:** Fri Jun 18, 2010 9:38 pm

Hi mueller,

Quote:

We will update the JIRA.

Did you create one?

Quote:

So Hibernate Search does something special to the index beyond simply concatenating all the words of the names in a collection.

Hibernate Search will just add the different fields several times using the same field (that's possible in Lucene and should be similar to concatenating - I guess different regarding positions and term vectors), then splitting for each name is being done by the configured Analyzer as usual.
so while I'm still planning to review this and would appreciate a test, I think this issue might be forwarded to Lucene's issue tracker.

eborix13 · **Joined:** Sat Feb 14, 2009 7:49 am **Posts:** 8

Hi Sanne,

Here is the JIRA link with a test case attached to it.
http://opensource.atlassian.com/project ... SEARCH-542