Search: How can I exclude numbers from my index?

Jova · **Joined:** Fri May 13, 2011 1:01 pm **Posts:** 16

Hi everyone,
I'm using Hibernate Search 3.4 and I would like to exclude numbers from my index: is there a way to do it using standard filters? I've checked the documentation (and the various Filter in org.apache.solr.analysis) but found nothing so I started developing a custom Filter: here is the code

Code:

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.analysis.FilteringTokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class NumbersRemoverFilter extends FilteringTokenFilter {
   private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
   private static Log log = LogFactory.getLog(NumbersRemoverFilter.class.getName());
   
   public NumbersRemoverFilter(boolean enablePositionIncrements, TokenStream input) {
      super(enablePositionIncrements, input);
   }

   @Override
   protected boolean accept() throws IOException {
      for (int i = 0; i < termAtt.buffer().length; i++) {
         if (Character.isLetter(termAtt.charAt(i))) {
            log.debug("accept true " + new String(termAtt.buffer()) + " length = " + termAtt.buffer().length);
            return true;
         }
      }
      
      log.debug("accept false " + new String(termAtt.buffer()) + " length = " + termAtt.buffer().length);
      return false;
   }
}

and this is the factory

Code:

import org.apache.lucene.analysis.TokenStream;

public class NumbersRemoverFilterFactory extends org.apache.solr.analysis.BaseTokenFilterFactory {
   public TokenStream create(TokenStream input) {
      return new NumbersRemoverFilter(true, input);
   }
}

I've used the filter in my analyzer in this way

Code:

...
@AnalyzerDef(name = "customanalyzer",
  tokenizer = 
     @TokenizerDef(factory = StandardTokenizerFactory.class), 
     filters = {
       @TokenFilterDef(factory = StandardFilterFactory.class),
       @TokenFilterDef(factory = NumbersRemoverFilterFactory.class),
       @TokenFilterDef(factory = LengthFilterFactory.class, params = {
          @Parameter(name="min", value="4" ),
          @Parameter(name="max", value="20" )
       }),
       @TokenFilterDef(factory = LowerCaseFilterFactory.class),
       @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class),
       @TokenFilterDef(factory = StopFilterFactory.class, params = {
            @Parameter(name="words", value= "stoplist.properties" )
        })
  })
@Analyzer(definition = "customanalyzer")
...

With my custom filter searches now don't return any result :-(
Is there something wrong with my filter?

There's also a thing I don't understand: in the filter I log each token and I expect to see every token of my document logged, but that doesn't happen and I see only some token logged. Am I missing something?

Thanks in advance,
Andrea

sanne.grinovero · **Posted:** Thu Oct 06, 2011 11:38 am

Hi Andrea,
the general approach you're using looks like correct, there must be a subtle mistake in the code.. did you try unit testing your implementation and see that it only rejects what you would expect?

Quote:

There's also a thing I don't understand: in the filter I log each token and I expect to see every token of my document logged, but that doesn't happen and I see only some token logged. Am I missing something?

You have to consider that there are other TokenFilters enabled.. there is a whole stack enabled and only those tokens which are accepted by other token filters are passed to your implementation. To properly test your custom filter, you should remove all other filters so that it makes it easier to understand.

Code:

log.debug("accept true " + new String(termAtt.buffer()) + " length = " + termAtt.buffer().length);

That code is going to kill your performance, as it might be invoked millions of times during an indexing process.
Make sure to always check if debug is enabled before creating the log message, use if (log.isDebugEnabled()) log.debug(...);

Jova · **Joined:** Fri May 13, 2011 1:01 pm **Posts:** 16

Hi Sanne,
thank you for your reply.

I didn't try real unit testing, I'm indexing a table with a small number of records with some log statements (thanks for your suggestions, I will wrap them with "if (log.isDebugEnabled())").

I tried indexing with only my custom filter and the result is the same (no results for every search): sorry in my first post I forgot to insert the resulting log of the indexing process, now I've noted that my NumberRemoverFilter logs only one record for every flush. My indexing code is

Code:

final Integer BATCH_SIZE = 500;
FullTextSession fullTextSession = ...;

ScrollableResults results = fullTextSession
   .createCriteria(MyIndexedClass.class)
       .setFetchSize(BATCH_SIZE)
       .scroll(ScrollMode.FORWARD_ONLY);

int index = 0;
MyIndexedClass obj = null;

while( results.next() ) {
  ++index;
  obj = (MyIndexedClass)results.get(0);
    
  fullTextSession.index(obj);
  if ((index % BATCH_SIZE) == 0) {
    fullTextSession.flushToIndexes();
    fullTextSession.clear();
    log.info("Flush !! Indexed " + index + " records.");
  }
}

So it seems that my TokenFilter is called only once for each flush: in the log I can see each token of an indexed record, i.e.

Code:

...
Oct 2011 09:22:12,492 -- [IndexBuilder] -- Flush !! Indexed 1000 records.
Oct 2011 09:22:12,499 -- [NumbersRemoverFilter] -- accept true VS             length = 14
Oct 2011 09:22:12,499 -- [NumbersRemoverFilter] -- accept true CANTIERE       length = 14
Oct 2011 09:22:12,499 -- [NumbersRemoverFilter] -- accept true METROPOLITANA  length = 14
Oct 2011 09:22:12,499 -- [NumbersRemoverFilter] -- accept true LOTTOPOLITANA  length = 14
Oct 2011 09:22:12,501 -- [IndexBuilder] -- Flush !! Indexed 1500 records.
...

I've checked the table in the database and the corresponding field contains the string

VS. CANTIERE METROPOLITANA LOTTO 2

So that raises to me other questions:

why my TokenFilter is called only once for each flush?
termAtt.buffer() in my accept() method is dirty: if the previous token was longer then I find part of it as trailing garbage and it's always padded to 14 charactes, if shorter than 14; is there something wrong with the way I index?

Thanks,
Andrea

sanne.grinovero · **Posted:** Fri Oct 07, 2011 6:38 am

Quote:

termAtt.buffer() in my accept() method is dirty: if the previous token was longer then I find part of it as trailing garbage and it's always padded to 14 charactes, if shorter than 14; is there something wrong with the way I index?

Right, now I see the bug in your code:
you should use

Code:

termAtt.length()

NOT

Code:

termAtt.buffer().lenght()

As you have seen already, the buffers are being reused for efficiency and the lenght of the bugger might be longer than the value you need to consider.

Quote:

why my TokenFilter is called only once for each flush?

Interesting. does it not depend on your data? Are you sure there are more values on which the NumbersRemoverFilter is applied? It's not going to be invoked on null fields, empty strings, or if it's a relation maybe some entities don't have this relation?

Is this version 3.4.0 ? and which Lucene version?

Jova · **Joined:** Fri May 13, 2011 1:01 pm **Posts:** 16

Great Sanne ! THANK VERY MUCH !!
I've fixed the bug, restored the original version with the others TokenFilters and now:

my TokenFilter logs a lot more than once between two flushes (of course I didn't count but I would say it's called for every record)
searching now works

I forgot to mention that at the end of the logged buffer I see "strange" ^@ characters, i.e. if I open the log in VI I see

Code:

OFFICINA^@^@^@^@^@^@

Probably accessing those strange characters, in my bugged filter, causes an error (in my test Character.isLetter(termAtt.charAt(i))) and this breaks the indexing process: that would explain why I see only one processing of my TokenFilter and why I can't find anything in the index. Is it possible?
I'm using Hibernate Search 3.4 and Lucene 3.3

Thanks you very much again,
Andrea

sanne.grinovero · **Posted:** Fri Oct 07, 2011 11:40 am

Quote:

OFFICINA^@^@^@^@^@^@

did you fix the logging string too, using the correct length() ?

Quote:

Probably accessing those strange characters, in my bugged filter, causes an error (in my test Character.isLetter(termAtt.charAt(i))) and this breaks the indexing process: that would explain why I see only one processing of my TokenFilter and why I can't find anything in the index. Is it possible?

Yes it is, but didn't you see any error reported in the logs? Indexing is performed in background threads, so exceptions won't be rethrown to your application, but they should be logged;

Jova · **Joined:** Fri May 13, 2011 1:01 pm **Posts:** 16

s.grinovero wrote:

Quote:

OFFICINA^@^@^@^@^@^@

did you fix the logging string too, using the correct length() ?

No I log the whole buffer with new String(termAtt.buffer()).

s.grinovero wrote:

Quote:

Probably accessing those strange characters, in my bugged filter, causes an error (in my test Character.isLetter(termAtt.charAt(i))) and this breaks the indexing process: that would explain why I see only one processing of my TokenFilter and why I can't find anything in the index. Is it possible?

Yes it is, but didn't you see any error reported in the logs? Indexing is performed in background threads, so exceptions won't be rethrown to your application, but they should be logged;

No I didn't see any error in my log: I surrounded my code with a try/catch block

Code:

    try {
        ...
    } catch(Throwable t) {
        log.error("Error in accept", t);
    }

and now in the log I can see

Code:

java.lang.IndexOutOfBoundsException
  at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.charAt(CharTermAttributeImpl.java:133)
  at com.gformula.find.util.NumbersRemoverFilter.accept(NumbersRemoverFilter.java:23)
  at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
  at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:58)
  at org.apache.lucene.analysis.LowerCaseFilter.incrementToken(LowerCaseFilter.java:60)
  at org.apache.lucene.analysis.ISOLatin1AccentFilter.incrementToken(ISOLatin1AccentFilter.java:46)
  at org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:58)
  at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:135)
  at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:278)
  at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:766)
  at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2056)
  at org.hibernate.search.backend.impl.lucene.works.AddWorkDelegate.performWork(AddWorkDelegate.java:76)
  at org.hibernate.search.backend.impl.lucene.PerDPQueueProcessor.run(PerDPQueueProcessor.java:106)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:619)

I haven't explicitly enabled Hibernate Search logging but just log4j for my classes: probably HS doesn't find a place to log to?
I'll try to configure Hibernate Search to log in my file.

Thanks,
Andrea