MassIndexer threadsForIndexWriter

newsosa · **Joined:** Wed Nov 18, 2009 7:00 pm **Posts:** 12

This question is for Sanne,

In your MassIndexer you commented out the method threadsForIndexWriter(int); however I seen when you initialize the LuceneBatchBackend you hardcode the number of threads for "concurrent_writers" to 2. You also state that you've seen a performance gain in unusual setups. Can you elaborate on what you meant by "unusual"?

Also when running a profiler I see the entityloader threads sometimes spend a bulk of their time in waiting, what can I do to improve this?

Thanks,

newsosa

sanne.grinovero · **Posted:** Wed May 12, 2010 3:02 am

Quote:

In your MassIndexer you commented out the method threadsForIndexWriter(int); however I seen when you initialize the LuceneBatchBackend you hardcode the number of threads for "concurrent_writers" to 2. You also state that you've seen a performance gain in unusual setups. Can you elaborate on what you meant by "unusual"?

yes this parameter is not currently configurable but we could enable it if we find a good use case.
By "unusual" I mean a situation in which the indexwriting and analyzing phases are way slower than the data loading for all entities and work preparation (bridges, text extraction). As the indexwriting phase is very fast, that would be a very unusual setup: I can hardly think of anybody spending a shitload of money on a super-grid database and then have no resources to afford a fast disk. It might happen in cases of very big PDF attachments, in this case the CPU cost of text analyzing could be high, but again I would expect that the cost of extracting the text from the PDF would be more expensive.
So in the end, the final stage is usually not the bottleneck; to be fair I could have hardcoded it to one thread, but two seems fine.

If your mileage varies much, please send feedback.

Quote:

Also when running a profiler I see the entityloader threads sometimes spend a bulk of their time in waiting, what can I do to improve this?

it might block for several reasons:

waiting for the database to return the entity during a query -> might need to add threads, but only possible if the database could handle it.
waiting because it finished, and is waiting for other threads to end
waiting because it's faster than the next element in the pipeline, and the pipeline is full. -> might need to add more threads to the next phase (threadsForSubsequentFetching(X)), or reduce the number of threads of entityloader (threadsToLoadObjects(Y))
in first seconds it's likely blocked waiting for the primary keys to be loaded

generally, it's ok that it blocks frequently as it serves as an active buffer between the other phases: while the model is fixed, the actual data fetched is varying so a single phase might be a bottlenek in some seconds, and be too fast (blocked waiting for others) in other seconds.
Threads are cheap, just make sure you have enough of them, not too many blocking as they're useless, and not killing 1) your database with too many concurrent requests 2)your memory with way too many threads.
If some threads block frequently, that's ok.