We've narrowed this down a bit. The problem only occurs during transactions that add a large amount of luceneWork.
In one case we were able to reproduce the problem by updating about 1000 @Indexed entities (each with average two @IndexedEmbedded entities included within) in a single transaction.
The problem appears to be that the algorithm used in addWorkToQueue is inefficient for large N. In our case it can get so bad that this runs for days.
If I understand correctly, this seems to be a known issue per this comment in the source code:
Code:
//TODO with the caller loop we are in a n^2: optimize it using a HashMap for work recognition
Of course a possible workaround for us is to try to find a way to break the work into multiple smaller transactions. This would be a massive change to our system but it might be worthwhile for us to make the change in any case just because of other issues (like locking+concurrency) introduced by transactions that are too large.