Unwanted initialization non-dirty lazy references on-update

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

Hi,

I have a problem where storing a Poll leads to its already indexed owning entity (Post) being fully reinitialized when updating the Poll object.

When I debug this, I basically see Hibernate correctly identifying the dirty relationships and updating them.
However, when we hit Hibernate Search
PostTransactionWorkQueueSynchronization.beforeCompletion(84) | Processing Transaction's beforeCompletion() phase: org.hibernate.search.backend.impl.PostTransactionWorkQueueSynchronization

Boom! Everything gets initialized, including all the lazy and extra-lazy, non-dirty items. It's basically a full reindex instead of an update.
Especially, in my case a disaster because the extra-lazy collections are extra-lazy for a reason.

Really puzzling and it looks like incorrect behaviour. Any suggestions are appreciated.

Code:

@Entity
@Table
@Indexed
public class Post{   
   
   @ManyToOne(cascade=CascadeType.ALL,fetch=FetchType.LAZY)
   @JoinColumn(name="FK_PollId", nullable=true,updatable=false)
   @IndexedEmbedded
   public Poll getPoll() {
      return poll;
   }

}

@Entity
@Table
public class Poll {

   @OneToOne(optional=false,fetch=FetchType.LAZY,mappedBy="poll")
   @ContainedIn
   public Post getPost(){
      return post;
   }


   @OneToMany(cascade=CascadeType.ALL,fetch=FetchType.LAZY,mappedBy="poll",orphanRemoval=true)
   @IndexedEmbedded(prefix="question.")
   public List<Question> getQuestions() {
      return questions;
   }

}

@Entity
@Table
public class Question {

   @Column
   @Field(index=Index.NO,store=Store.YES)
   public int getCount() {
      return count;
   }
}

I posted part of this post in another thread but since the title of that thread does not accurately reflect the content of the message, I'm reposting here.

Kind regards,
Marc

sanne.grinovero · **Posted:** Mon Sep 03, 2012 8:58 am

Hi Marc,

Quote:

It's basically a full reindex instead of an update.

If you forgive me the far fetched comparison, consider that Lucene indexes are something like a one-way data structure. The information written into the index can not be extracted from the index! One of the consequences is that, since Lucene doesn't have any form of update either, if a single indexed element is potentially dirty, Hibernate Search needs to reload all fields to re-index them.

It was discussed to re-load needed information from the index itself, but this bloats the index size and is less efficient than a RDBMS, as a database is designed to perform this kind of operation.
Hibernate Search has a special dirty-checking optimization so that it will skip indexing if not needed, but this is only effective of all indexed fields are found non-dirty, and if there are no dynamic analyzers or custom bridges which makes it unreliable.

You have to workaround the need for such large updates by either/all:
- index less fields
- use conditional indexing, to have your application hint when an index operation could be skipped
- make clever use of Hibernate's second level cache, to avoid hitting the database for frequently used values

Second level caching is by far more effective than other hacks on the Lucene structure, and is a pretty solid technology.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

Hi Sanne,

Thanks for the extensive reply. In my case, skipping indexing is not the issue, the object is in fact dirty. It's just that the indexed object represents a whole bunch of expensive database reads if it has to be fully reindexed when only one field changes.

Ok, so every update to an indexed field in an object leads to a full re-index. At first glance, it makes sense to read everything from the database and reindexing that. But at second glance, to me, it doesn't. Why not, as you say read the document, just add/overwrite the dirty fields and write again? I don't see how this bloats the index size.

It would only bloat the index size if you assume that a read/write scenario would require the index to return the same content as the database. But why would that be a requirement? That's up to the developer. All I care about is that whatever is in the index gets updated with the dirty fields.

The second level cache option is an option, yes. But in my scenario I avoid it all together by just relying on the index. And with Extra Lazy content, I don't want to use the cache because it would require me to add a lot more memory for a corner case (when I should be able to optimize this).

Indexing less fields is not an option for me. I'm already extremely selective (thanks to the includePaths option ;-) )

sanne.grinovero · **Posted:** Mon Sep 03, 2012 11:11 am

Quote:

Ok, so every update to an indexed field in an object leads to a full re-index. At first glance, it makes sense to read everything from the database and reindexing that. But at second glance, to me, it doesn't. Why not, as you say read the document, just add/overwrite the dirty fields and write again?

You can't work on a per-field detail. You have to delete the full Document, and insert a brand new one, including all fields as in the original version.

Quote:

I don't see how this bloats the index size.

I mean, if you wanted to reload the values from the index, you would need to use STORE=YES on all fields. Which massively bloats the index size (like from 600MB to 50GB in one experiment I run once..).

Quote:

It would only bloat the index size if you assume that a read/write scenario would require the index to return the same content as the database. But why would that be a requirement? That's up to the developer.

Not really up to the developer. If you don't feed to Lucene the same input as the first time, you're not indexing the same contents..

Quote:

The second level cache option is an option, yes. But in my scenario I avoid it all together by just relying on the index. And with Extra Lazy content, I don't want to use the cache because it would require me to add a lot more memory for a corner case (when I should be able to optimize this).

Yes you *could* optimize this: https://issues.apache.org/jira/browse/LUCENE-3837
but still it's extremely complex and tricky, and I really doubt that it will be faster than a hashmap lookup (as 2nd level caches basically are implemented)

You should be able to find a good compromise in using 2nd level cache on key entities, tuning them for your use case. Measure & tune, you won't need gazillions of memory, actually it will require much less memory than by extracting values from Lucene's index or attempting to update them.

I'll be very happy if you can proof me wrong, but got good results following these rules..

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

I've added a Jira for this
https://hibernate.onjira.com/browse/HSEARCH-1185

sanne.grinovero · **Posted:** Mon Sep 03, 2012 11:30 am

why, if it's impossible?
This is just the design of Lucene: to get super fast queries, the data structure is optimized for queries. It's not optimized for writes, nor updates.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

Quote:

Not really up to the developer. If you don't feed to Lucene the same input as the first time, you're not indexing the same contents..

This is my point. If you read the indexed document, and add the changes, then save, you are indexing the same contents with just the changes. Perhaps I just don't get what you are saying, or misunderstand the nature of how Lucene works, but I don't see why you would have to Store.YES all fields for this scenario.

Perhaps what I'm not realizing is that the problem is not the Store.YES fields, but the Index.YES, Store.NO fields. Is that the point: that Lucene has no way of reindexing those values in the scenario that I'm suggesting?

sanne.grinovero · **Posted:** Mon Sep 03, 2012 12:30 pm

Quote:

This is my point. If you read the indexed document, and add the changes, then save, you are indexing the same contents with just the changes.

I think we identified the core misunderstanding :)
You can't "read the indexed document". Once you have written the document to the index, it's only going to store statistics about its contained terms.

When you run a Query, the Lucene API only tells you "this query matched documents number 1, 6 and 36": it just returns a sorted list of integers, with rankings about the relevancy (a float); using these integers as a document identifier, you can then extract only what you have stored in the document with Store.YES, nothing more!

On top of this, each field for each and every document you're attempting to extract (to reverse a Store.YES) is incredibly expensive and will likely involve multiple disk seeks, while a database can deal with load operation in much smarter ways. Hibernate Search goes a long way to optimize the loading of the fields it needs to retrieve the full entity from the database.

Quote:

Perhaps what I'm not realizing is that the problem is not the Store.YES fields, but the Index.YES, Store.NO fields. Is that the point: that Lucene has no way of re-indexing those values in the scenario that I'm suggesting?

That's it. There might be ways, but computational wise, it's way too complex. Not the right tool for doing this.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

This one keeps bothering me. I've seen that
https://issues.apache.org/jira/browse/LUCENE-3837
may provide a fundamental solution to this issue but it doesn't exactly look "around the corner", so I'm left with the question of how to deal with scenarios where there are a high number of 1-n relations with a "@ContainedIn" style requirement from the n-side.

The best example is the facebook wall.
Let's say you have the following objects

Code:

User
   @OneToOne
   Photo photo;

WallPost
   @ManyToOne
   User owner

   @OneToMany
   List<User> likes;

WallComment
   @ManyToOne
   User owner;

   @ManyToOne
   WallPost post;

   @OneToMany
   List<User> likes;

Let's say that all fields for the WallPost are indexed and stored with the WallPost and WallComment for performance purposes, including the nickname of the user. Because we're talking facebook, realtime response is important.

So, let's say we have objects with 10000 comments and likes. And then one of those users changes his nickname. This would lead to a cascade of database reads because all the affected Documents would have to be deleted and reinserted resulting in full reads for all the related documents.

I don't really see a way out of this problem unless through updateable documents. In which case, in stead of melting down to the a smoking pile of rubble, the server would just shrug. But I'd be real happy to find another way other than "avoid architectures like this".

Cheers,
Marc

sanne.grinovero · **Posted:** Wed Nov 14, 2012 8:12 am

Hi,
I agree that it would be nice to find a good solution for this but as you say it's not trivial. If you want to experiment with things we can discuss that ad maybe suggest some approaches.

Quote:

Let's say that all fields for the WallPost are indexed and stored with the WallPost and WallComment for performance purposes, including the nickname of the user. Because we're talking facebook, realtime response is important.

Sorry I disagree on that, wall posts on facebook are definitely not requiring realtime updates. Their chat terminal might need quick responses (but not realtime) but that one doesn't need fulltext indexing.. they have to workaround the same limitations.
Finally the relation Person->[many]WallPost can easily be remapped as WallPost->[one]Person and avoid the problem completely, again by limiting what kind of queries are available. Think about gmail: their search engine on emails is very bad, still I don't think it's fair to say that Google doesn't know how to implement a search engine. I'm pretty sure they face a similar technical limitation due on aggregation and write frequency.

Quote:

So, let's say we have objects with 10000 comments and likes. And then one of those users changes his nickname. This would lead to a cascade of database reads because all the affected Documents would have to be deleted and reinserted resulting in full reads for all the related documents.

Good observation, and likely one of the reasons for which most services don't allow users to change their nickname. Many new websites allow that, but they might be taking advantage of the fact people don't change nickname every day.
On twitter I'm confident the @MyName tag is remapped to an internal identifier, so that I can still find my stuff in case my nick changes: if you change nick, you're not rest to zero followers.

Quote:

I don't really see a way out of this problem unless through updateable documents. In which case, in stead of melting down to the a smoking pile of rubble, the server would just shrug. But I'd be real happy to find another way other than "avoid architectures like this".

I'm happy to discuss some solutions :)
One idea would be to use the "join" operation in Lucene indexes which splits the document in two main blocks: http://www.searchworkings.org/blog/-/blogs/412000
I assume this could allow you do not need to reindex one of them but I'm not sure, would be cool if you could investigate that.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

Hey Sanne,

Thanks for your extensive reply. It gave me some new ideas with the BlockJoinQuery (I was fixated on the document update I didn't consider other options). Initially this seemed like the ideal candidate for the described use case as the documents stay atomic and are "joined" in a block at indexing time.

A Block Join is an index time aspect so it would impact the way HSearch executes indexing, searching and annotations.

@Annotations
I think a Block Join style index could be considered as an alternative to includePaths. In stead of indexing a related entities fields, you're creating a block between the parent and one or more child documents.
@IndexedEmbedded(indexAsBlock=true)

Indexing
Block indexed documents would have to be indexed using the addDocuments API of the IndexWriter

Searching
In order to search against a Blocked group of documents, you need to use the BlockJoinQuery. So this would impact the Query API and the underlying Query engine. A BlockJoinQuery requires you to create separate queries for both the parent and children in order.

Code:

qb.block().parent(
   qb.keyword().onField("site.id").matching(site.getId()).createQuery()
).child(
   qb.keyword().onField("user.id").matching(user.getId()).createQuery()
)

I'm not sure what the impact will be on search performance.

Updating
Since Blocks just create relations between existing documents, you would think that updating the username or photo of the User in the described use case would not impact all of the Posts. In terms of cascading queries, this would provide significant performance improvements for HSearch.

However, according to article below

"If you need to re-index a parent document or any of its child documents, or delete or add a child, then the entire block must be re-indexed. This is a big problem in some cases, for example if you index "user reviews" as child documents then whenever a user adds a review you'll have to re-index that shirt as well as all its SKUs and user reviews. "

Which is precisely what we're hoping to avoid.

Document retrieval
In the use case I described document retrieval was key. I wanted an updateable display name and photo. So I need to retrieve both User related fields and the entire Post.
Using BlockJoinQuery you will primarily just retrieve the Post and not the User. In order to retrieve the User related field that you are interested in, you need to add a BlockJoinCollectorQuery to the BlockJoinQuery.

Related info
http://java.dzone.com/articles/searching-relational-content

So, for the moment what I think I need to do is extract the User id from the Post and retrieve the associated fields in a second query. Not the approach am aiming for of course but it should still be manageable.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

Some additional thoughts

Quote:

Sorry I disagree on that, wall posts on facebook are definitely not requiring realtime updates.

Ok, realtime is not the right word. What I mean is that if a User changes a photo or nickname, you have to have a reasonably quick reflection of these changes in all their WallPosts.

Quote:

Finally the relation Person->[many]WallPost can easily be remapped as WallPost->[one]Person and avoid the problem completely, again by limiting what kind of queries are available. Think about gmail: their search engine on emails is very bad, still I don't think it's fair to say that Google doesn't know how to implement a search engine. I'm pretty sure they face a similar technical limitation due on aggregation and write frequency.

Yes, it can and I did. I'm personally not familiar with Hadoop and if it alleviates any of these issues. However, the User is basically @ContainedIn the WallPost. So if the user changes, all the WallPosts will have to change, assuming you index this user data with the WallPost. Obviously, I'm going to have to do some refactoring to change this, as I don't see a way out. But you can see how this kind of sucks. The duplication of storing user data with a WallPost I see as a small price to pay for the performance. But if I refactor the User data out and get it separately, I will have to run an extra query that retrieves all the users in the result of WallPost and all of its Comments and then splice that data into the result again. It just seems like a very ugly way to deal with this.

Caching User objects will alleviate some of this pain, but still.

I guess if there is no updateable document or workable join concept available, a more simple workaround might be to use a "fetch style join". HSearch could retrieve the associated User document during the query process. This way the User document could stay atomic and reflect its actual state.

The question is how to implement such a fetch operation. HSearch would have to retain a list of @DocumentIds and Entities that are requested this way during the query process, retrieve the associated Documents separately and integrate them with the search result.

Annotations
A developer could just annotate the User reference with @IndexedEmbedded(includePaths={"id"}) to make sure the @DocumentId is stored.
Otherwise, I don't thing anything would have to change.

HSearch would have to know which entities should be retrieved this way. This could be part of the FullTextQuery

Code:

FullTextQuery ftq = fts
.createFullTextQuery(mj.createQuery(), WallPost.class)
.setFetchMode("user", FetchMode.JOIN)
.setProjection(Projections.Document)
.setResultTransformer(new WallPostResultTransformer())
.setFirstResult(start)
.setMaxResults(max);

By indicating setFetchMode, HSearch would retrieve the entire associated Document and insert this as the user Tuple in the WallPost Document in case of projection or as an entity in case of entity retrieval.

Of course, HSearch would have to make sure that this retrieval goes as efficiently as possible, getting all the associated Users at the end of the query process.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

I also looked at the idea of using a result transformer for this, but the transformTuple is per object and transformList only executes on entity retrievals.

Basically, you would something that basically allows you to transform the list of tuples and retrieve the JOIN Documents before the resulttransformer executes.

Of course, one could say that this is starting to sound more and more like the kind of thing a database is good at. One could also just project all the properties through a regular database query, but when it comes to something like a Wall, I really like the read performance an index gives me, especially since the index in my scenario is on the server and the db is on the network. It's an interesting question whether in this scenario with high number reads AND writes (comments, likes), the issue of full document initialization will actually make HSearch perform far worse than Hibernate ORM. In fact, it probably does.

mschipperheyn · **Joined:** Wed Nov 05, 2003 7:22 pm **Posts:** 211

So, what I ended up doing was just storing the userId on the Post index and retrieving the user separately with a customized ResultTransformer.
Basically,
* I retrieve the full ProjectConstants.Document without transforming the result and retrieve all userIds from the result list
* retrieve all related users from the User index
* execute a customized result transformer that sets the user object on the relevant Post and Post children.

It's not an ideal solution, especially because I have to loop over the result set to retrieve all the user ids. But it's still fast.

So, I removed a number of fields from the WallPost and retrieved them from its own index. This works in the case of the user object because we're interested in reading values such as photo and nickname.

However, it doesn't solve the situation where you actually need to store an indexedEmbedded field that you use as part of a querybuilder selection.

sanne.grinovero · **Posted:** Tue Nov 20, 2012 12:52 pm

Hi,
I got a bit confused on your last post. Are you saying that you are loading the user ids from a separate Lucene index rather than the database?

That surprises me a bit for two reasons:
- usually loading any stored field from the index is quite slow, it's CPU intensive and also quite IO heavy. People generally prefer to load from a database for such simple data extractions. Is this performing better in your case? We had recently introduced a FieldSelector to reduce the index extraction overhead: I'm wondering if that's helping with this.
- How do you deal with transactions? Index operations might be stale (async) or generally just not as reliable as the database.