HS: Out of memory on large index

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

We have a table consisting of 1,123,676,191 documents (i.e.: a little over 1 billion).

The Lucene index for this is just a little over 20Gb (which, as I understand it, is not huge by lucene standards).

Unfortunately any attempt to perform a search or query for number of results always returns out of memory:

java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.MultiReader.norms(MultiReader.java:271)
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)
at org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:230)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
...

The type of queries are simple, of the form:

(+value:church +marcField:245 +subField:a)

The interpreter is already running with the maximum of heap space allowed (-Xms 1200m -Xmx 1200m)

I have already removed NORMS from all the fields for which I can. There is only one field which does have norms and this is tokenised because it is textual information on which we need to search:

@Column
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO)
private Integer marcField;

@Column (length = 2)
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO)
private String subField;

@Column(length = 2)
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO)
private String indicator1;

@Column(length = 2)
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO)
private String indicator2;

@Column(length = 10000)
@Field(index=org.hibernate.search.annotations.Index.TOKENIZED, store=Store.NO)
private String value;

@Column(name = "BLVL", length = 3)
private String bibliographicLevel;

@Column
@Analyzer(impl=org.apache.lucene.analysis.SimpleAnalyzer.class)
@Field(index=org.hibernate.search.annotations.Index.NO_NORMS, store=Store.NO)
private Integer recordId;

Can anyone shed some light on how to get around this problem? I can't see what else I can do to make this work.

Thanks.

hardy.ferentschik · **Posted:** Wed Mar 04, 2009 7:33 am

Hi,

how does your query look like? Do you use an extensive BooleanQuery and/or do you supply a custom search?

Maybe a different approach for building the queries might help?

Have a look at this as well - http://www.lucidimagination.com/search/ ... ig_indexes

--Hardy

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

Hi,

As I said in the original post, the queries are of the form:

(+value:church +marcField:245 +subField:a)

This is quite specific and will only return a few thousand results.

Thanks for the link.

hardy.ferentschik · **Posted:** Wed Mar 04, 2009 9:10 am

Hmm, the query is simple enough. Have you tried to run te same query in Luke? This could help to isolate the problem.
Also why do you say that 1200m is your maximum heap size? Do you not have more memory? Personally I think it is quite low for the size of your index.

--Hardy

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

A small update. I gave an incorrect size for the original number of documents. We actually have 286,000,000 rows in the table we indexed - not 1 billion as previously stated (we hope to finally do indexing on our 1 billion row table - if we can get the search working for this smaller one).

I'm somewhat surprised / disappointed / amazed that Lucene appears to be so poor that it runs out of memory in such a case. 286 million rows is not very big, so Lucene and/or Hibernate Search is either poorly written (I find that hard to believe) or there is something wrong with what I've done - although I can't see what it is.

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

hardy.ferentschik wrote:

Hmm, the query is simple enough. Have you tried to run te same query in Luke? This could help to isolate the problem.
Also why do you say that 1200m is your maximum heap size? Do you not have more memory? Personally I think it is quite low for the size of your index.

--Hardy

Thanks for your comments.

I was unable to use Luke very effectively with any of our indexes. It frequently runs out of memory because it apparently hasn't been written to page results (bad!) I will have another look at it and see what it produces for this query.

w.r.t heap size. As far as I am aware a 32bit JVM instance is only capable of having a maximum memory allocation of 2Gb - that includes libraries, the stack and the heap. Hence a maximum heap size of ~1500Mb (or you find the JVM will refuse to start as it is unable to allocate the requested heap).

If you have a solution for this then I would be pleased to hear it.

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

I have tried the query with Luke. It generates the same Out of Memory condition (lack of heap space).

Lucene is a bit of a joke if it can't index a couple of hundred million items! I thought this was one of the things it was supposed to be good at?

hardy.ferentschik · **Posted:** Thu Mar 05, 2009 4:59 am

Maybe you should post your problem also on the Lucene forum. I don't think that the amount of data you are indexing should be a problem. On the other hand, what do you expect a 32 bit machine with 2 GB to handle. It might be that your expectations are a little to high there. BTW, depending on your OS you might be able to allocate more memory. The theoretical lmit is 4GB. How much you can really allocate for the JVM is OS dependent. 32 bit Windows for example has indeed a 2GB limit per process.

--Hardy

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

Thanks. I posted my query on the Lucene list this-morning.

To be quite honest, yes, I would expect it to be able to perform such a query in less than 2Gb of memory. I'd expect it to be able to work in less than 1Gb. Two reasons: the result set is only a few thousand values and the results are not being requested to be sorted.

I really don't see anything unreasonable being asked of Lucene, no matter how big the index.

Something to ponder...

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

I raised the issue of OOM on the Lucene mailing list. It looks very much like Hibernate Search is doing something (that perhaps it shouldn't?) which causes the problem. It was suggested I return to this list with the problem as it is clearly a Hibernate Search problem.

Below is what I posted today on the Lucene mailing list. In short, I created stand-alone programs that do a simple search on a large index. The Lucene version works and the HS version does not.

I'd appreciate any feedback on this issue. Is there something I am not doing in HS (perhaps some option that can be set)? I have looked through "Hibernate Search in Action" but nothing jumps out at me.

To solve this problem I could just call Lucene directly and then construct the HQL to get the records, but, ah... that is what HS is supposed to do!

thanks.

-----------------------------------------------------------------------------

I haven't got around to profiling the code. Instead, I took your advice and knocked
Hibernate out of the equation with a small stand-alone program that calls Lucene
directly. I then wrote a similar stand-alone using Hibernate Search to do the same thing.

On a small index both work fine:

E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky
hits = 29410

E:\hibtest>java -Xmx1200m -classpath
.;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar
hibtest
size = 29410

Trying it on our huge index works for the straight Lucene version:

E:\>java -Xmx1200M -classpath .;lucene-core.jar lucky
hits = 320500

but fails for the Hibernate version:

E:\hibtest>java -Xmx1200m -classpath
.;lib/antlr-2.7.6.jar;lib/commons-collections-3.1.jar;lib/dom4j.jar;lib/ejb3-persistence.jar;lib/hibernate-commons-annotations.jar;lib/hibernate-core.jar;lib/javassist-3.4.GA.jar;lib/jms.jar;lib/jsr250-api.jar;lib/jta-1.1.jar;lib/jta.jar;lib/lucene-core.jar;lib/slf4j-api-1.5.2.jar;lib/slf4j-api.jar;lib/solr-common.jar;lib/solr-core.jar;lib/hibernate-annotations.jar;lib/hibernate-search.jar;lib/slf4jlog4j12.jar;lib/log4j.jar;lib\hibernate-c3p0.jar;lib/mysql-connector-java.jar;lib/c3p0-0.9.1.jar
hibtest
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(Unknown Source)
at org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirec
tory.java:596)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
ut.java:136)
at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(
CompoundFileReader.java:247)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
ut.java:136)
at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInp
ut.java:92)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:907)
at org.apache.lucene.index.MultiSegmentReader.norms(MultiSegmentReader.j
ava:352)
at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:6
9)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)

at org.apache.lucene.search.Searcher.search(Searcher.java:136)
at org.hibernate.search.query.QueryHits.updateTopDocs(QueryHits.java:100
)
at org.hibernate.search.query.QueryHits.<init>(QueryHits.java:61)
at org.hibernate.search.query.FullTextQueryImpl.getQueryHits(FullTextQue
ryImpl.java:354)
at org.hibernate.search.query.FullTextQueryImpl.getResultSize(FullTextQu
eryImpl.java:741)
at hibtest.main(hibtest.java:45)

E:\hibtest>

I am not sure why this is occuring. Any ideas? I am calling IndexSearcher.search() and
so is Hibernate. Is Hibernate Search telling Lucene to try and read in the entire index
into memory?

Code for the Lucene version is:

public class lucky
{
public static void main(String[] args)
{
try
{
Term term = new Term("value", "church");
Query query = new TermQuery(term);
IndexSearcher searcher = new
IndexSearcher("E:/lucene/indexes/uk.bl.dportal.marcdb.MarcText");
Hits hits = searcher.search(query);

System.out.println("hits = "+hits.length());

searcher.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
}

and for the Hibernate Search version:

public class hibtest {

public static void main(String[] args) {
hibtest mgr = new hibtest();

Session session = HibernateUtil.getSessionFactory().getCurrentSession();

session.beginTransaction();

FullTextSession fullTextSession = Search.getFullTextSession(session);
TermQuery luceneQuery = new TermQuery(new Term("value", "church"));

org.hibernate.search.FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(
luceneQuery, MarcText.class );

long resultSize = fullTextQuery.getResultSize(); // this is line 45

System.out.println("size = "+resultSize);

session.getTransaction().commit();

HibernateUtil.getSessionFactory().close();
}

}

johnbyng · **Joined:** Fri Feb 13, 2009 4:54 am **Posts:** 10

This problem was caused by a bug in the native Java libraries. Currently there is no fix (see the Sun Java bug report linked below).

Below I've included my response on the Lucene mailing list. The solution is to patch the Lucene code so that read() only reads data in chunks (50Mb works) and then concatenate the result.

----------------------------------------------------

I now know the cause of the problem. Increasing heap space actually breaks Lucene when
reading large indexes.

Details on why can be found here:

http://bugs.sun.com/bugdatabase/view_bu ... id=6478546

Lucene is trying to read a huge block (about 280Mb) at the point of failure. While it
allocates the required bytes in method MultiSegmentReader.norms(), line 334, just fine,
it is only when it attempts to use this array in a call to RandomAccessFile.readBytes()
that it gets OOM. This is caused by a bug in the native code for the Java IO.

As observed in the bug report, large heap space actually causes the bug to appear. When I
reduced my heap from 1200M to 1000M the exception was never generated and the code
completed correctly and it reported the correct number of search hits in the Hibernate
Search version of my program.

This isn't good - I need as much memory as possible because I intend to run my search as
a web service.

The work-around would be to read the file in small chunks, but I am not familiar with the
Lucene code so I am unsure how that would be done in a global sense (i.e.: does it really
need to allocate a buffer of that size in MultiSegmentReader?)

sanne.grinovero · **Posted:** Sat Mar 14, 2009 6:40 am

Hi,
thanks for all the information. The main difference I see in how Search is doing the count compared to how your plain Lucene test code is doing it, is that in Search we are using the new Lucene APIs, while you are using deprecated methods.

Are you hitting the problem also with plain Lucene now, or is it still relevant only about the Search code? This could mean the deprecated methods are safer ;-)

As a workaround, you have some options:
1) use an in-memory index, so you can use all your heap space and not bother with I/O.
2)try the system property which makes FSDirectoryProvider change the I/O implementation:

Code:

-Dorg.apache.lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDirectory
or
-Dorg.apache.lucene.FSDirectory.class=org.apache.lucene.store.MMapDirectory

beware, the NIO implementation is known to be buggy on windows.

please let me know your experience with these alternative I/O strategies if you try them out.

sanne.grinovero · **Posted:** Sat Mar 14, 2009 6:53 am

I've just seen all the work on the Lucene list; great job!
I guess I should read the list more frequently.

hardy.ferentschik · **Posted:** Fri Mar 20, 2009 1:45 pm

Thanks for the bug report indeed.

I am not sure about how bound you are to run your application on Windows, but have you considered switching OS to for example Linux. Not only will your IO problems probably be solved, but you will also be able to utilize your memory. Well, just an idea ;-)

--Hardy

sanne.grinovero · **Posted:** Sat Mar 21, 2009 4:00 pm

I couldn't reproduce this on 64bit jdk on linux using the simple test attached to the sun bug report, so it would be interesting to try your case on linux-64jdk, at least to confirm that 64 bit is unaffected.