Hi Sanne, thank you for the reply.
s.grinovero wrote:
Quote:
Do these numbers sound about like what others are seeing?
I've seen people with much worse numbers complaining here, so this is a good start.
Personally I am going way faster, but we can't compare the numbers as most of the complexity depends on your object graph complexity and the possibility to cache parts of it.
Just as a reference, my object graph is quite complex and the time needed to reindex 6 millions of documents is around 20 seconds on my dual core laptop.
Wow! That is very fast. Can you comment on how many fields you are indexing per document? And how deep is your object graph, ie when you load one object via hibernate how many sub-objects are there?
Quote:
What is the code you're using to rebuild the index? What is described in the book is good, and it is full of tips. Hope you've seen the indexwriter settings in the reference too.
Here is the current code I am using, it runs in a class extending the Spring DAO Helper base class. I have verified that I do not have the n+1 selects problem, everything is loaded from a single sql select, and the selects are very very fast, for all intents and purposes instant. My object graph is also not very deep, generally all properties are simple (int/string/long), with one or two one-to-many or many-to-manys, as I mentioned I index between 5-20 fields, the dominant documents only have around 10 fields indexed, with maybe 25% of the fields using the default @Field annotation, the rest using @Field combined with Un-Tokenized.
Code:
@Override
public void buildSearchIndex() {
// If we are not an indexed class return
if (this.eventClass.getAnnotation(Indexed.class) == null) {
return;
}
getHibernateTemplate().execute(new HibernateCallback() {
@Override
public Object doInHibernate(Session session) throws HibernateException,
SQLException {
FullTextSession fullTextSession = Search.getFullTextSession(session);
Criteria crit = fullTextSession.createCriteria(eventClass);
int flushSize = 10000;
int pageSize = 1000;
int i = 0;
List<E> results = null;
do {
crit.setFirstResult(i);
crit.setMaxResults(pageSize);
results = crit.list();
for (E entity : results) {
fullTextSession.index(entity);
}
// flush the index changes to disk so we don't hold until a commit
if (pageSize % flushSize == 0) {
fullTextSession.flushToIndexes();
}
i += pageSize;
} while (results.size() > 0);
return null;
}
});
}
Quote:
You should try to understand what is slowing you down using a profiler; there are 3 main candidates:
1) your object loading from DB. Study better caching and fetching strategies; some helper API stuff will be added; I hope soon but can't make promises as it depends on lots of other changes.
I believe that this part is very fast, I had some problems getting my profiler to run last night but hope to solve that this evening and try and understand where the slowness is at.
Quote:
2) Your garbage collector. Make sure your JVM has good sized memory pools and you are not holding references during the whole indexing process.
I will check this, can you suggest any easy ways to check if your JVM is experiencing performance problems due to memory parameters?
Quote:
3) H.Searche's backend is constantly improving on this side there should come another good speedup in 3.1.1 if Emmanuel accepts my proposals; Are you using automatic optimization strategies? (there's currently a bug that makes them trigger even during indexing)
Not that I am aware of, when using Luke I have noticed that one or two of my indexes lists it as being optimized, but some do not.
Quote:
In any case usually it depends too much from your own entity to be able to provide a general-purpose magical speedup, so you should definitely check your object load timings and make sure the data you need is fetched with minimal DB roundtrips (try enabling SQL logging if you can't profile).
I believe the DB fetching part is very fast, it is on the same localhost that I am performing these tests, and as mentioned previously being loaded by a single sql statement per batch, on indexed primary keys.
Most everything I am running is stock, here are the only parameters I have tried changing:
Code:
<prop key="hibernate.search.default.indexwriter.batch.ram_buffer_size">64</prop>
<prop key="hibernate.search.default.indexwriter.transaction.ram_buffer_size">64</prop>
This did give me some speedup, I'd be very interested to know what some optimal configuration parameters are however.
I have also noticed a lot of disk activity during indexing which surprises me, the index it spits out is on the order of just over a meg, which seems crazy to me that it takes multiple minutes and tons of disk activity to create.
Thanks!
-David