File Descriptor Leak

lovelyliatroim · **Joined:** Thu Oct 08, 2009 10:34 am **Posts:** 55

Also to add on to this.

I also see doubling when i do the following

1. Create the index for class Foo(Using mass indexer). Lets say size of 13MB
2. Modify 100 records of type Foo and reindex just the 100.
3. Call optimize for that index.
4. I now see my index has doubled to 26MB although no new records added, just modified and optimize has being called!!

When i open the index in luke i can see for "deletions/optimized" = "No/No".

So no deletions present in index but luke tells me it isnt optimitzed!!

sanne.grinovero · **Posted:** Mon Aug 06, 2012 10:44 am

Hi LL,
HSEARCH-842 only affects Hibernate Search 3.x, in Hibernate Search 4.x it can't be the same problem.

In version 4.1.0 there was a file leak, which was the reason for us to release 4.1.1.
If you already are running 4.1.0, you don't need to make any change to upgrade to 4.1.1

http://in.relation.to/Bloggers/HibernateSearch411FinalFixFileHandleLeakAndMore

Regarding optimising the index, the weirdness you see is likely caused by the file leak, but anyway it's no longer recommended to optimise the index regularly; we still support the option mostly so that it's your choice but keep in mind that the latest mergers and segmentreaders+caches do a very nice job transparently. I'll open a JIRA to clarify this in the docs.

lovelyliatroim · **Joined:** Thu Oct 08, 2009 10:34 am **Posts:** 55

Hi Sanne,

I see it also with the version 4.1.1 !!

LL

sanne.grinovero · **Posted:** Tue Aug 07, 2012 5:40 am

lovelyliatroim wrote:

Hi Sanne,

I see it also with the version 4.1.1 !!

LL

Hi LL, what problem exactly are you seeing? Sorry I'm getting confused on the different details of this thread.

Keep in mind I would not be concerned on the fact the index is not being optimised - this is likely related to the way Lucene is now behaving, I guess even the Luke tool might get confused about it. Anyway even if it's not being completely optimised, that is not going to affect performance.

But if you have a file leak, I would be glad to look at it as that would be embarassing. Are you aware that the exclusive_index_use option is disabled now by default? With this option, the IndexWriter is kept open until you shutdown the SessionFactory; that provides a very significant performance boost at the cost of keeping file descriptors open until shutdown. So you might be seeing these?

lovelyliatroim · **Joined:** Thu Oct 08, 2009 10:34 am **Posts:** 55

Hi Sanne,

Maybe it is just a mis-understanding of how i expect HS to work.

Ok to refresh you on this. Lets say i have 500 docs which get indexed with a master configuration. When doing the first run, there is an empty index. A batch index occurs and the master index is say 163K. Now i can do multiple batch index runs in the same lifecycle but the master index shall stay at 163K at the end of the run.

Now i shutdown and do a second run but we now have the original master index from our first run with a size of 163K. I do a batch index of 500 records again but at end of this run i the master index has a size of 319K. Almost double!! My question is why?? I would have expected the index to be roughly the same size as the first run.

Here is the test case adapted for 4.1.1 so you can see if i have an error in my thinking....

Code:

import java.util.Properties;
import java.util.UUID;

import javax.persistence.EntityManager;
import javax.persistence.EntityManagerFactory;
import javax.persistence.Persistence;

import org.hibernate.Session;
import org.hibernate.search.FullTextSession;
import org.hibernate.search.Search;
import org.junit.Before;
import org.junit.Test;


public class FileLeakTest {

   private EntityManagerFactory emf = null;
   
   @Before
   public void setup(){
      emf = Persistence.createEntityManagerFactory("cors",getProperties());
   }
   
   private static Properties getProperties(){
      Properties props = new Properties();
      props.put("hibernate.connection.driver_class","org.h2.Driver");
      props.put("hibernate.dialect","org.hibernate.dialect.H2Dialect");
      props.put("hibernate.cache.provider_class","org.hibernate.cache.NoCacheProvider");
      props.put("hibernate.jdbc.charSet","UTF-8");
      props.put("hibernate.hbm2ddl.auto","create-drop");
      
      props.put("hibernate.connection.url","jdbc:h2:mem:test");
      props.put("hibernate.connection.username","sa");
      props.put("hibernate.connection.password","");

      props.put("hibernate.search.default.sourceBase","C:/lucene/fileleaktest/shared");
      props.put("hibernate.search.default.indexBase","C:/lucene/fileleaktest/master");
      props.put("hibernate.search.default.refresh","300");
      props.put("hibernate.search.default.directory_provider","filesystem-master");
      
      return props;
   }
   
   
   public FullTextSession getFulltextSession(){
      EntityManager em = emf.createEntityManager();
      Session session = (Session) em.getDelegate();
      FullTextSession fullTextSession = Search.getFullTextSession(session);
      return fullTextSession;
   }
   
   
   public void createDocuments(int size){
      FullTextSession session = getFulltextSession();
      try{
         session.getTransaction().begin();
         for(int i = 0; i < size; i++){
            HibDocument d = new HibDocument(UUID.randomUUID().toString(),UUID.randomUUID().toString(),UUID.randomUUID().toString());
            session.persist(d);
         }
         session.flush();
         session.getTransaction().commit();

      }finally{
         session.close();
      }
   }
   
   
   public void batchIndex(){
      System.out.println("About to reindex");
      FullTextSession session = getFulltextSession();
      try {
         //tried with trans and without- same effect
         session.getTransaction().begin();
         session.createIndexer().batchSizeToLoadObjects(30)
         .threadsForSubsequentFetching(4)
         .threadsToLoadObjects(2)
         .startAndWait();
         session.getTransaction().commit();
         
      } catch (InterruptedException e) {
         e.printStackTrace();
      }finally{
         session.close();
      }
   }
   
   
   @Test
   public void testForLeaks(){
      //on second run we re-create again, not ideal, can use filebased Db here but i dont see this as an issue but can adjust it
      createDocuments(500);
      System.out.println("Created Docs");
      int loops = 5;
      int counter = 0;
      while(counter < loops){
         try {
            batchIndex();
            Thread.currentThread().sleep(5000);
         } catch (InterruptedException e) {
            e.printStackTrace();
         }
         counter++;
      }
      System.out.println("Finito");
   }

   
}

When i run the above once with an empty index i see a master size of

Quote:

$ du -h .
163K .

ls -l
total 163
----------+ 1 0 mk 38896 Aug 8 14:05 _5.fdt
----------+ 1 0 mk 4004 Aug 8 14:05 _5.fdx
----------+ 1 0 mk 51 Aug 8 14:05 _5.fnm
----------+ 1 0 mk 15536 Aug 8 14:05 _5.frq
----------+ 1 0 mk 1504 Aug 8 14:05 _5.nrm
----------+ 1 0 mk 8500 Aug 8 14:05 _5.prx
----------+ 1 0 mk 991 Aug 8 14:05 _5.tii
----------+ 1 0 mk 78875 Aug 8 14:05 _5.tis
----------+ 1 0 mk 20 Aug 8 14:05 segments.gen
----------+ 1 0 mk 240 Aug 8 14:05 segments_c
----------+ 1 0 mk 0 Aug 8 14:05 write.lock

Now i run the test case again for the second time and note there is a master index existing from our first run. After it completes i see

Quote:

du -h .
319K .

ls -l
total 319
----------+ 1 0 mk 38896 Aug 8 14:05 _5.fdt
----------+ 1 0 mk 4004 Aug 8 14:05 _5.fdx
----------+ 1 0 mk 15536 Aug 8 14:05 _5.frq
----------+ 1 0 mk 1504 Aug 8 14:05 _5.nrm
----------+ 1 0 mk 8500 Aug 8 14:05 _5.prx
----------+ 1 0 mk 78875 Aug 8 14:05 _5.tis
----------+ 1 0 mk 38896 Aug 8 14:09 _b.fdt
----------+ 1 0 mk 4004 Aug 8 14:09 _b.fdx
----------+ 1 0 mk 51 Aug 8 14:09 _b.fnm
----------+ 1 0 mk 15547 Aug 8 14:09 _b.frq
----------+ 1 0 mk 1504 Aug 8 14:09 _b.nrm
----------+ 1 0 mk 8500 Aug 8 14:09 _b.prx
----------+ 1 0 mk 1008 Aug 8 14:09 _b.tii
----------+ 1 0 mk 78888 Aug 8 14:09 _b.tis
----------+ 1 0 mk 20 Aug 8 14:09 segments.gen
----------+ 1 0 mk 240 Aug 8 14:09 segments_n
----------+ 1 0 mk 0 Aug 8 14:05 write.lock

Now i would have expected the master index size to stay around 163K and not nearly double. (Also bear in mind, i have trimmed down this scenario, i see this with a 10g index being doubled to 20g, not ideal).

Is this expected behaviour from HS or is this a bug? As i said my thinking is, if i index 500 records, it should be in and around the same size every time.(Obviously without a big fluctuation on what it is storing in the 500).

Hope thats clearer for you.

Thanks for the support,
LL

sanne.grinovero · **Posted:** Wed Aug 08, 2012 11:01 am

Hi,
so you're adding 500 elements as the first step of your test. These are stored in the database, and indexed.

When you shutdown the first test run, the data is gone as you are using an in memory database, but the index is not cleaned up so it still contains the 500 elements.

Then you start it again, run the same test: you have now 500 elements in the database, but 1000 in the index: note that you're using UUIDs - I guess for the DocumentId too - so the ids are different than your first run and the new batch of 500 isn't replacing the original set of documents.

lovelyliatroim · **Joined:** Thu Oct 08, 2009 10:34 am **Posts:** 55

Quote:

When you shutdown the first test run, the data is gone as you are using an in memory database, but the index is not cleaned up so it still contains the 500 elements.

Yes

Quote:

Then you start it again, run the same test: you have now 500 elements in the database, but 1000 in the index: note that you're using UUIDs - I guess for the DocumentId too - so the ids are different than your first run and the new batch of 500 isn't replacing the original set of documents.

Yes thats what i thought but when i ran it a third time, what would you expect to see?? It stays the same size as after two runs, so its not growing with every time i restart it. This i assumed was down to the fact that when i do a new mass index it purges everything that has gone before and starts with a clean slate. Thats the only way i could explain it not growing with each new run after the second run. Maybe i was wrong with this assumption, i didnt look under the hood. I would also assume since the id is generated a clean restart of h2 that it would use the same sequence.

I will adjust the test case to use "fixed ids" just for fun and ill report back if it changes but the direction your going i would expect on the third run the size to increase again and then with the 4th etc etc. I dont see that!!

sanne.grinovero · **Posted:** Wed Aug 08, 2012 12:29 pm

you're right, the Massindexer starts its job with a purge operation to cleanup the existing index so you should not be accumulating the previous run documents, but they will still take disk space as previously reserved for the index.

This might be useful to read, especially the note about "additional transient disk usage"
http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/store/MMapDirectory.html

The memory mapped Directory aggressively reserves space, but is not eager to give the space back to the system; for sure when the MassIndexer starts even if it performs a purge the space is not immediately reclaimed; to attempt to reclaim this space Hibernate Search issues an optimize command to the backend after the initial purge, but this might not be effective on all systems.

I wouldn't be too concerned about Lucene occasionally taking twice the expected space: that's a requirement you have anyway to be able to merge new segments appropriately. Of course this is a problem if you observe 3x, 4x, and higher disk space consumption, but do you? I can't reproduce it on my Linux workstation. Are you on Windows?

lovelyliatroim · **Joined:** Thu Oct 08, 2009 10:34 am **Posts:** 55

Quote:

I wouldn't be too concerned about Lucene occasionally taking twice the expected space: that's a requirement you have anyway to be able to merge new segments appropriately.

Ok this is probably what i see. However when i optimize the index i would expect the index to go back to in and around its original size. Or maybe this is where my understanding is wrong. I take it this isnt the case???

Also in the test case on the first run, i do a mass index 5 times but when it is finished, it is always the same size. So there is no issue with merging and extra size needed during the first run. What i mean by this, i would expect the need to merge segments when doing the batch index over and over but at the end of the run, it is always the same size.

Quote:

Of course this is a problem if you observe 3x, 4x, and higher disk space consumption, but do you?

No just double.

Quote:

I can't reproduce it on my Linux workstation. Are you on Windows?

Im on windows but linux is our test/prod environments. I shall run the test case on our linux box just to see it for myself but what i do see in our linux environments is that our index of 15g doubles to 30g over a period of time. Hence why im looking at this. I would expect the master to expand when updates and new entries are applied but once optimized i would expect it to be not too far off the original index size.

lovelyliatroim · **Joined:** Thu Oct 08, 2009 10:34 am **Posts:** 55

Quote:

I can't reproduce it on my Linux workstation.

I gave it a run on the mac and your right, i dont see it. So seems the test case is only for windows. Like i said, i see doubling of our index off our linux boxes and im just trying to understand why i see it. I thought the test case was a lead but back to the drawing board. Will see if i can preproduce this behaviour off one of our linux boxes.

Cheers Sanne,
LL

sanne.grinovero · **Posted:** Thu Aug 09, 2012 12:50 pm

thanks LL!

Before you get digging too deep, did you try the 4.1.1 in production already? As I mentioned, we *had* a leak before.