Best way to deal with UTF-8 byte array fields?

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

It seems that our hibernate mapped fields (or at least the getters/setters) have to be byte arrays instead of Strings in order to use international characters (supplementary) with mysql and hibernate...
https://forum.hibernate.org/viewtopic.php?f=1&t=1003567&p=2428021

So byte arrays of UTF-8 characters at least reduces our memory usage for strings as now the vast majority of time (in our case) we'll have strings with the number of characters being the number of bytes of memory used. Whereas java Strings use up more than twice the number of bytes of memory as number of characters. Since the data we're dealing with takes up a tremendous amount of memory, this is important.

Now I need to modify our Hibernate Search mappings. I thought, easy enough, just make a byte array field bridge and convert the byte arrays to java Strings (and vice-versa). But now my memory usage actually increases compared to before because I'll still have the same Strings but also the new byte arrays. This can be a problem for us. I know they Strings will get garbage collected afterwards, but this kind of garbage collecting has been unusably slow for us.

So my question is, is there a better way to deal with UTF-8 byte arrays with Hibernate Search? Can Hibernate Search deal with them more directly, maybe without a field bridge?

sanne.grinovero · **Posted:** Wed Mar 31, 2010 3:13 am

sorry all internal code deals with Strings, so the quick answer is no.

I did develop a large scale search application needing full internationalization and it was using utf8 on MySQL, Hibernate and Hibernate Search and didn't have encoding problems? I wonder what's different, unfortunately I'm not working for that company currently and can't checkout the sources; I don't remember which kind of columns was being used, but we definitely mapped to String properties on the hibernate mappings.

I guess you'll have to work a bit on garbage collection finetuning, that won't do miracles but can help a lot in some cases.

Indexing large parts of text is definitely not a light weight operation and we're interested in any means to speed it up, it would be interesting to benchmark a change to bytes (many changes needed) but I'll warn you that Lucene is being very smart while handling String so I won't presume it will be more efficient before doing some serious benchmark.
Also a problem would be that in this case the only performing solution would be to always map properties as byte[], that's not desirable in using a not-so low level language as Java - I'd prefer to find a solution to have the best performing solution tuned for String, if it was possible to solve your problem.

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

Thank you for the quick response Sanne. So the basic problem with hibernate and mysql is that mysql only supports a subset of unicode. You can specify a mysql column as being a varchar in UTF-8. But if a UTF-8 character requires 4-bytes, as many chinese and japanese characters do, then mysql will give you an error when trying to persist it. In your case, maybe you didn't care about those 4-byte (aka supplemental) characters and so varchar was fine.

So full internationalization with mysql means replacing varchar fields with varbinary fields. No problem, just set the connection parameter to UTF-8 and you can still write strings to the varbinary fields and they'll be persisted properly. And you can read strings as well. The problem is with hibernate. Hibernate, for some reason, can't read the strings right. While entities are persisted to varbinary fields properly, reading them causes the supplemental characters to be replaced with unicode "replacement" characters. I guess hibernate sees that the type of the field is varbinary and gets confused (we set hibernate to read utf-8).

The only way to make hibernate work, that we could figure out, is to map the varbinary fields to byte arrays. Oracle told us that we can compile mysql 6.0 from sources and it finally does support full internationalization. Unfortunately that's not an option for us, but for Hibernate Search I guess it's not worth worrying about this. Either Hibernate Core will fix their stuff eventually or Mysql 6.0 will eventually be final.

Now if there's a workaround or fix for Hibernate Core I'd love to hear it! We've posted on their forum, stackoverflow and countless other places...

sanne.grinovero · **Posted:** Wed Mar 31, 2010 8:39 am

I quickly checked on JIRA and couldn't find an issue about this. This should be reported if you would like some developer to look into it. Maybe it's even trivial but core developers are extreme busy and won't consider with priority the many issues on the forums as they're 99% user errors, especially if they're hard to reproduce.

I'm sure you understand it's very time consuming to try reproducing all issues and then find out somebody should have read the manual better, or is using years-old verions..

The best strategy to get this fixed quickly is to open an issue with an attached testcase - you've seen how quick it happened with previous error, Emmanuel and others do care a lot for quality but time is limited so we have to help in making sure stuff is reported correctly.

So could you try creating a testcase? Some hints:
[*] the testcase shouldn't be a project but a patch compatible with trunk, so the developers can quickly apply it and possibly commit your testcase to prevent regressions-
[*] The test should FAIL, so that it clearly points out what you expect.

have a look at how I refactored your patch, you might remember I had some trouble in understanding what you want, I had to convert it to Search's conventions and then make it fail.
Your testcase is committed now along with your minimal test model, that makes sure future versions of Search won't fail anymore on this :)

mueller · **Joined:** Mon Mar 10, 2008 6:40 pm **Posts:** 114

Excellent suggestion Sanne. I'm not yet familiar with how best to work with open source projects to enhance or fix bugs. I'm going to open a JIRA issue.

sharktank · **Joined:** Wed May 22, 2013 11:11 am **Posts:** 1

Hope this helps to other who stop by this topic searching for solution.
1:

Code:

public class ContentStringType extends org.hibernate.type.StringType {
    
    private static final long serialVersionUID = 1L;
 
    public Object get(ResultSet rs, String name) throws SQLException {
       String content=null;
       try {
          content= new String(rs.getBytes(name),"UTF-8");
       } catch (UnsupportedEncodingException e) {
          // TODO Auto-generated catch block
          e.printStackTrace();
       }
          return content;
    }
    
}

2:
Entity:

Code:

@Type(type="com.company.hibernate.entity.ContentStringType")
    @Column(name="content")
    public String getContent() {
        return this.content;
  }

3: Make sure your db url has useUnicode=true &characterEncoding=utf-8