HibernateSearch and extracted text from file (huge String)

stipenovkmet · **Joined:** Sat Feb 04, 2012 5:12 pm **Posts:** 4

I have one field which is @Lob and I store my extracted text content (with Tika) in it. Content is stored to DB, but for some long text (cca 1 000 000 chars) hibernate search doesn't index this field. For short documents, content is indexed. There is no any Exception on CRUD. Can anyone help? My field config is:

Code:

@Field(
    name = "un_tok_searchableTextContent", 
    index = Index.UN_TOKENIZED, 
    store = Store.NO)
@Lob
public String getSearcheableTextContent() {
    return _searcheableTextContent;
}

I also did setup this property in my applicationContext.xml:

Code:

<prop key="hibernate.search.default.indexwriter.max_field_length">10000000</prop>
// default is 10000

sanne.grinovero · **Posted:** Sun Feb 05, 2012 8:53 am

Hi,
are you sure you mean to map this large string as UN_TOKENIZED ?

How do you know it's not being indexed, because you can't find it after indexing? Did you check the index contents with tools like Luke ?
http://code.google.com/p/luke/

stipenovkmet · **Joined:** Sat Feb 04, 2012 5:12 pm **Posts:** 4

1. Well, I am sure, because how else can I search speciffic phrase (full text search)? Or is there other way to do this? I am also indexing this field in other ways (tokenized, ngramtokenized, etc...), which is successfull for those large strings as i can see in Luke.

2. Yes, I checked it through Luke, and there are for example only 7 fields, but I inserted 15 docs... All other indexes are ok (for example, I have 15 indexed titles in Luke).

Also, I tried to use logger to se what is happening, but didn't get any error nor any usefull info. My setup looks like:

Code:

log4j.logger.org.hibernate.search=INFO
log4j.logger.org.hibernate.search.type=ALL
log4j.logger.org.hibernate.search=debug
log4j.logger.org.apache.lucene=INFO
log4j.logger.org.apache.lucene.analysis.standard.StandardAnalyzer=debug
log4j.logger.org.apache.lucene.index.IndexWriter=debug
log4j.logger.org.apache.lucene.type=ALL

Any idea?

sanne.grinovero · **Posted:** Tue Feb 07, 2012 11:17 am

Quote:

log4j.logger.org.apache.lucene=INFO

Lucene doesn't use Log4J so it won't log anything.

Quote:

Yes, I checked it through Luke, and there are for example only 7 fields, but I inserted 15 docs

You're not necessarily going to get a new field for each document, there is no relation.

Quote:

1. Well, I am sure, because how else can I search speciffic phrase (full text search)?

Yes you could use a PhraseQuery.

UN_TOKENIZED means that it will match only exact queries, such as TermQuery, and is often not practical if it's a long string; this is actually so unlikely that I think we don't have a test to cover it. I'll add one for the sake of completeness, but I'd suggest you to look into PhraseQuery or other Query options as what you're trying to do would be very inefficient, and not very flexible.

stipenovkmet · **Joined:** Sat Feb 04, 2012 5:12 pm **Posts:** 4

Quote:

Lucene doesn't use Log4J so it won't log anything.

What about HibernateSearch, is right config like this:

Code:

log4j.logger.org.hibernate.search=INFO
log4j.logger.org.hibernate.search.type=ALL
log4j.logger.org.hibernate.search=debug

Quote:

You're not necessarily going to get a new field for each document, there is no relation.

Can You explain me this? If I un_tokenize titles, I have as many titles as inserted docs (one title for every doc). How is posible not to have same number of un_tokenized text fields?

Quote:

Yes you could use a PhraseQuery.

UN_TOKENIZED means that it will match only exact queries, such as TermQuery, and is often not practical if it's a long string; this is actually so unlikely that I think we don't have a test to cover it. I'll add one for the sake of completeness, but I'd suggest you to look into PhraseQuery or other Query options as what you're trying to do would be very inefficient, and not very flexible.

Maybe I didn't understand it from HibSearchInAction book, but I think that I must have UN_TOKENIZED field to search phrase in text content (with slop factor)? Or I am not right?

Btw, Thanks for fast replays.

sanne.grinovero · **Posted:** Wed Feb 08, 2012 12:44 pm

Code:

log4j.logger.org.hibernate.search=debug

This should be correct.

Quote:

Can You explain me this? If I un_tokenize titles, I have as many titles as inserted docs (one title for every doc). How is posible not to have same number of un_tokenized text fields?

ah ok if you assume untokenized, then it's likely correct. But Luke can not always extract all values back from the index, especially if they are not STORED as well.

Quote:

Maybe I didn't understand it from HibSearchInAction book, but I think that I must have UN_TOKENIZED field to search phrase in text content (with slop factor)? Or I am not right?

No that's not correct. PhraseQuery requires the text to be tokenized (analyzed).

stipenovkmet · **Joined:** Sat Feb 04, 2012 5:12 pm **Posts:** 4

Thanks for replay. As you explained, I was wrong in understanding concept: everything works fine now - Phrase query on analyzed field. There is no need for me to index such large string un_tokenized now. Thank you for your help.

sanne.grinovero · **Posted:** Thu Feb 09, 2012 9:03 am

great, thank you for letting me know.