This problem was caused by a bug in the native Java libraries. Currently there is no fix (see the Sun Java bug report linked below).
Below I've included my response on the Lucene mailing list. The solution is to patch the Lucene code so that read() only reads data in chunks (50Mb works) and then concatenate the result.
----------------------------------------------------
I now know the cause of the problem. Increasing heap space actually breaks Lucene when
reading large indexes.
Details on why can be found here:
http://bugs.sun.com/bugdatabase/view_bu ... id=6478546
Lucene is trying to read a huge block (about 280Mb) at the point of failure. While it
allocates the required bytes in method MultiSegmentReader.norms(), line 334, just fine,
it is only when it attempts to use this array in a call to RandomAccessFile.readBytes()
that it gets OOM. This is caused by a bug in the native code for the Java IO.
As observed in the bug report, large heap space actually causes the bug to appear. When I
reduced my heap from 1200M to 1000M the exception was never generated and the code
completed correctly and it reported the correct number of search hits in the Hibernate
Search version of my program.
This isn't good - I need as much memory as possible because I intend to run my search as
a web service.
The work-around would be to read the file in small chunks, but I am not familiar with the
Lucene code so I am unsure how that would be done in a global sense (i.e.: does it really
need to allocate a buffer of that size in MultiSegmentReader?)