[HibernateSearch: possible to search "across relationships"?

alexaverbuch · **Joined:** Wed Mar 21, 2012 2:56 pm **Posts:** 6

Hi,
I'm new to the entire Hibernate/Search/Lucene world, so please let me know if I need to rephrase my question such that it makes sense.

*** my problem ***
take an example system that tracks student applications to colleges

assume there are 3 tables:
STUDENT(s_id [pk], s_name, s_gpa)
APPLIED(s_id, c_name)
COLLEGE(c_name [pk], c_state, c_enrollment)

assume the following query is performed very often:
"find all STUDENTS with GPA>x, who APPLIED to a COLLEGE that has ENROLLMENT>y"

*** question 1 ***
is it possible to configure HibernateSearch such that it efficiently indexes & searches "across relationships".
in other words, when I add a new application to the database, is HibernateSearch smart enough to realize that it should index more than just the APPLIED table, but should actually index the entire "relationship" (i.e. STUDENT-APPLIED-COLLEGE) that spans three tables?

*** question 2 ***
this is a bit of a generalization of the first question.
does the database schema have any direct impact on the efficiency of search in HibernateSearch?
or, is it possible to easily define "policies" that index the data I insert into the database in any way I wish?

Thanks in advance for all help!

Best Regards,
Alex

sanne.grinovero · **Posted:** Mon Mar 26, 2012 2:30 pm

Hi, welcome

Quote:

*** question 1 ***
is it possible to configure HibernateSearch such that it efficiently indexes & searches "across relationships".
in other words, when I add a new application to the database, is HibernateSearch smart enough to realize that it should index more than just the APPLIED table, but should actually index the entire "relationship" (i.e. STUDENT-APPLIED-COLLEGE) that spans three tables?

That really depends on the query. If is "smart enough" for many queries, but some queries can't be resolved. Your example query is a good example of things which it can solve easily. To understand what it can not solve, you have to understand the way objects are mapped to the index: every possible "hit" of a query needs to match a single Document in the index. So you could have a Document having fields

- student name
- GPA
- college name
- enrollment

In addition, each of these fields can have multiple values (as opposed to a relational database), BUT you can't co-relate values from different fields.
Example: assuming you allow your students to be registered in multiple colleges, you can search for "students with some name, studying some kind of enrollment at college A, B or C", but you CAN NOT do "students with some name enrolling on X at college B" as, the X would match as well if it was studied at college C for the same student.

What it usually boils down, is via Search you solve most queries, and some use cases will need a relational query or post-filtering of results.

Quote:

*** question 2 ***
this is a bit of a generalization of the first question.
does the database schema have any direct impact on the efficiency of search in HibernateSearch?
or, is it possible to easily define "policies" that index the data I insert into the database in any way I wish?

Generally you can map any schema to any index without significant performance impact. The main difference to consider is that Lucene operations don't allow for partial updates, so if your student changes name, you'll have to load college and enrollment to reindex the whole relation; if you're really extreme on performance you might want to load the needed graph in a fetch join, or optimize the schema for it, but I think that's really far fetched since most operations would likely not change these values, and a 2nd level cache would be more effective anyway (and good to have even if you don't use Search)

alexaverbuch · **Joined:** Wed Mar 21, 2012 2:56 pm **Posts:** 6

Hi Sanne,
Thanks a lot for the detailed response!

First of all, sorry my example "schema" wasn't explained clearly.
By enrollment I meant "number of students at that college".
I actually had no concept of "course/program" in the schema (and I guess that's what you understood as "enrollment").

Quote:

To understand what it can not solve, you have to understand the way objects are mapped to the index: every possible "hit" of a query needs to match a single Document in the index.
you can't co-relate values from different fields.
assuming you allow your students to be registered in multiple colleges
you CAN search for: "students with some name, studying some kind of enrollment at college A, B or C"
you CAN NOT search for: "students with some name, enrolling on X at college B" as, the X would match as well if it was studied at college C for the same student.

What exactly do you mean by co-relate?
Does that mean that queries can express an _OR_ operator but not an _AND_, i.e. "course X _AND_ college B" is not possible?

Also, did you mean the following?
you CAN search for "students with ANY name; studying ANY enrollment/course; at college A, B or C"
you CAN NOT search for, "students ANY name, studying EXACTLY enrollment X; at college EXACTLY B"

Quote:

Generally you can map any schema to any index without significant performance impact. The main difference to consider is that Lucene operations don't allow for partial updates, so if your student changes name, you'll have to load college and enrollment to reindex the whole relation; if you're really extreme on performance you might want to load the needed graph in a fetch join, or optimize the schema for it, but I think that's really far fetched since most operations would likely not change these values

My question was more about expressiveness than performance (at this stage anyway).
From your answer I understood that any SCHEMA-to-INDEX mapping is functionally possible. Great.
FYI, the application will be very read heavy (at first), so I don't mind sacrificing write performance if it's not drastic.

Quote:

A 2nd level cache would be more effective anyway (and good to have even if you don't use Search)

What do you mean by a 2nd level cache?
My understanding was that HibernateSearch would allow me to place the entire Search index in RAM (assuming sufficient RAM).
Is that not the case?
Can a caching layer be enabled in the Hibernate/HibernateSearch configuration, or are you thinking more along the lines of a product like Memcached?

Thanks again,
Alex

sanne.grinovero · **Posted:** Mon Mar 26, 2012 5:52 pm

Quote:

By enrollment I meant "number of students at that college".
I actually had no concept of "course/program" in the schema (and I guess that's what you understood as "enrollment").

yes I made that mistake. I hope my examples still can make sense, under my interpretation ;)

Quote:

What exactly do you mean by co-relate?
Does that mean that queries can express an _OR_ operator but not an _AND_, i.e. "course X _AND_ college B" is not possible?

No, both OR and AND (and many more) operators are possible. The problem is when one student is linked to multiple colleges, and each has attibutes to test for. All attributes are "in the same bag" for that student (Document) so you can check for and/or/etc.. on those attributes, but not relating them to - for example- the same college instance.

The point is you have to understand that you only have Documents with named fields, and each field can have multiple unordered values: that's all you have in the index to create your queries.

To make an additional example, you can not search for enrollment unless you where storing an enrollement attributed specifically in the College entity: I mean you can not express a Count() operation attempting to evaluate the value of enrollment dynamically.

Of course you still have the database available as well, so all what is not solvable by a full text query can be solved old-school.

Quote:

My question was more about expressiveness than performance (at this stage anyway).
From your answer I understood that any SCHEMA-to-INDEX mapping is functionally possible. Great..
FYI, the application will be very read heavy (at first), so I don't mind sacrificing write performance if it's not drastic

Yes, nothing to worry then. perfect fit for Hibernate Search.

Quote:

What do you mean by a 2nd level cache?
My understanding was that HibernateSearch would allow me to place the entire Search index in RAM (assuming sufficient RAM).
Is that not the case?
Can a caching layer be enabled in the Hibernate/HibernateSearch configuration, or are you thinking more along the lines of a product like Memcached?

I'm talking about Hibernate (Core) 2nd level cache. That's a feature to avoid unnecessary loading of entities from the database, often used.
Lucene queries target the index indeed, but then load from the database unless you use projection. So it's often effective to use a second level cache as wel.

About "entire Search index in RAM": make sure you test performance if that is what you're after. The disk-based index is often more efficient and can be extremely fast.

alexaverbuch · **Joined:** Wed Mar 21, 2012 2:56 pm **Posts:** 6

Quote:

The problem is when one student is linked to multiple colleges, and each has attributes to test for. All attributes are "in the same bag" for that student (Document) so you can check for and/or/etc.. on those attributes, but not relating them to - for example- the same college instance.

Do you mean "all attributes (name, city, etc.) for a given college are in the same document AND that document may also contain multiple colleges", i.e. a document has multiple "college_name" attributes, multiple "college_city" attributes, etc.?

{
document_id : 1
student_id : 42
student_name : ford
college_name : stanford, mit
college_city : palo alto, cambridge
}

If so, I assume a "solution" would be to define a document as storing the data for only one college application (1 student + 1 college), and creating a new document each time an application is made to a college?

Quote:

The point is you have to understand that you only have Documents with named fields, and each field can have multiple unordered values..

I guess this is very useful for storing documents with user-generated "tags"?

Quote:

I'm talking about Hibernate (Core) 2nd level cache. That's a feature to avoid unnecessary loading of entities from the database, often used.
Lucene queries target the index indeed, but then load from the database unless you use projection. So it's often effective to use a second level cache as well.

Great! I hadn't come across this feature yet. Unless it weakens some transactional guarantees, I can't imagine why anyone would choose to not use it...

sanne.grinovero · **Posted:** Mon Mar 26, 2012 7:06 pm

Quote:

If so, I assume a "solution" would be to define a document as storing the data for only one college application (1 student + 1 college), and creating a new document each time an application is made to a college?

yes that's what happens when you @Index the "office application entity". I´m just pointing out the basics.. hope the @IndexedEmbedded and @ContainedIn examples in the docs + some unit tests help you with the details.

Quote:

I guess this is very useful for storing documents with user-generated "tags"?

Yes, but it's much more than that. old school apps based on a relational database (only) need tags to be specified by users. With a Lucene index you can consider a "tag cloud" all terms of the text you index, and still efficiently match it. i.e. you don't have to ask users to tag stuff manually (explicitly) to be able to implement tag-clouds, more-like-this or other automated classifiers.