filter special chars

InteroStrubbl · **Joined:** Tue Jul 14, 2009 6:13 am **Posts:** 12

hi, is there any way to filter special chars like this one: "Ø" out of my fields I want to index?
until now I tested it with the StopFilter("Ø"). But it doesn't work. When I search the index with Luke for this symbol I still get results.

Can anyone help please?

sanne.grinovero · **Posted:** Thu Aug 13, 2009 12:33 pm

Hi,
a StopFilter will only remove tokens which have a complete match to what you want to stop, not partial.
So having a stopfilter configured with, for example, the letter "A" will not remove the words containing the char "A", if there are also more characters.
So the removal of your symbol will only remove it when it's alone, is that what you need?
If you want to convert all special accents (like "èé" to "ee") there are some filters available in the Lucene and SolR distributions, if you want to remove chars (like "WØhatØ is thiØs strØnge chØr?" to "What is this strnge chr?" you'll need to write a custom TokenFilter.

InteroStrubbl · **Joined:** Tue Jul 14, 2009 6:13 am **Posts:** 12

s.grinovero wrote:

Hi,
a StopFilter will only remove tokens which have a complete match to what you want to stop, not partial.
So having a stopfilter configured with, for example, the letter "A" will not remove the words containing the char "A", if there are also more characters.

at the moment I need to solve this problem. there are standalone chars I want to remove like this special char I mentioned. But putting this symbol in the stop list doesn't remove it from the index. Contrary putting any letters in the stop list works. Perhaps I have an encoding problem? But where do I have to start for solving it?

s.grinovero wrote:

So the removal of your symbol will only remove it when it's alone, is that what you need?
If you want to convert all special accents (like "èé" to "ee") there are some filters available in the Lucene and SolR distributions, if you want to remove chars (like "WØhatØ is thiØs strØnge chØr?" to "What is this strnge chr?" you'll need to write a custom TokenFilter.

do you have any howto where I can read how to write my own TokenFilter?

thanks :)

sanne.grinovero · **Posted:** Fri Aug 14, 2009 12:47 pm

yes you might have an encoding problem, or your analyzer is converting the bad char in something different, so it doesn't match your blacklist and is not correctly discarded.

To correctly build a TokenFilter I suggest you look into the Lucene and SolR sources, they have lots of examples and you might find something like you need in the sandbox or contrib directories of these projects.

If that doesn't help, I've to suggest you to ask into the Lucene forums as they might know a better solution than me :)

InteroStrubbl · **Joined:** Tue Jul 14, 2009 6:13 am **Posts:** 12

the answer is:
i have to put the lower-cased variante of this symbol (ø instead of Ø) into my stop list as the lowercasefilter is acting right before the stopfilter :) now it works