-->
These old forums are deprecated now and set to read-only. We are waiting for you on our new forums!
More modern, Discourse-based and with GitHub/Google/Twitter authentication built-in.

All times are UTC - 5 hours [ DST ]



Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 
Author Message
 Post subject: Best way to deal with UTF-8 byte array fields?
PostPosted: Tue Mar 30, 2010 10:48 pm 
Regular
Regular

Joined: Mon Mar 10, 2008 6:40 pm
Posts: 114
It seems that our hibernate mapped fields (or at least the getters/setters) have to be byte arrays instead of Strings in order to use international characters (supplementary) with mysql and hibernate...
https://forum.hibernate.org/viewtopic.php?f=1&t=1003567&p=2428021

So byte arrays of UTF-8 characters at least reduces our memory usage for strings as now the vast majority of time (in our case) we'll have strings with the number of characters being the number of bytes of memory used. Whereas java Strings use up more than twice the number of bytes of memory as number of characters. Since the data we're dealing with takes up a tremendous amount of memory, this is important.

Now I need to modify our Hibernate Search mappings. I thought, easy enough, just make a byte array field bridge and convert the byte arrays to java Strings (and vice-versa). But now my memory usage actually increases compared to before because I'll still have the same Strings but also the new byte arrays. This can be a problem for us. I know they Strings will get garbage collected afterwards, but this kind of garbage collecting has been unusably slow for us.

So my question is, is there a better way to deal with UTF-8 byte arrays with Hibernate Search? Can Hibernate Search deal with them more directly, maybe without a field bridge?


Top
 Profile  
 
 Post subject: Re: Best way to deal with UTF-8 byte array fields?
PostPosted: Wed Mar 31, 2010 3:13 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
sorry all internal code deals with Strings, so the quick answer is no.

I did develop a large scale search application needing full internationalization and it was using utf8 on MySQL, Hibernate and Hibernate Search and didn't have encoding problems? I wonder what's different, unfortunately I'm not working for that company currently and can't checkout the sources; I don't remember which kind of columns was being used, but we definitely mapped to String properties on the hibernate mappings.

I guess you'll have to work a bit on garbage collection finetuning, that won't do miracles but can help a lot in some cases.

Indexing large parts of text is definitely not a light weight operation and we're interested in any means to speed it up, it would be interesting to benchmark a change to bytes (many changes needed) but I'll warn you that Lucene is being very smart while handling String so I won't presume it will be more efficient before doing some serious benchmark.
Also a problem would be that in this case the only performing solution would be to always map properties as byte[], that's not desirable in using a not-so low level language as Java - I'd prefer to find a solution to have the best performing solution tuned for String, if it was possible to solve your problem.

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: Best way to deal with UTF-8 byte array fields?
PostPosted: Wed Mar 31, 2010 5:46 am 
Regular
Regular

Joined: Mon Mar 10, 2008 6:40 pm
Posts: 114
Thank you for the quick response Sanne. So the basic problem with hibernate and mysql is that mysql only supports a subset of unicode. You can specify a mysql column as being a varchar in UTF-8. But if a UTF-8 character requires 4-bytes, as many chinese and japanese characters do, then mysql will give you an error when trying to persist it. In your case, maybe you didn't care about those 4-byte (aka supplemental) characters and so varchar was fine.

So full internationalization with mysql means replacing varchar fields with varbinary fields. No problem, just set the connection parameter to UTF-8 and you can still write strings to the varbinary fields and they'll be persisted properly. And you can read strings as well. The problem is with hibernate. Hibernate, for some reason, can't read the strings right. While entities are persisted to varbinary fields properly, reading them causes the supplemental characters to be replaced with unicode "replacement" characters. I guess hibernate sees that the type of the field is varbinary and gets confused (we set hibernate to read utf-8).

The only way to make hibernate work, that we could figure out, is to map the varbinary fields to byte arrays. Oracle told us that we can compile mysql 6.0 from sources and it finally does support full internationalization. Unfortunately that's not an option for us, but for Hibernate Search I guess it's not worth worrying about this. Either Hibernate Core will fix their stuff eventually or Mysql 6.0 will eventually be final.

Now if there's a workaround or fix for Hibernate Core I'd love to hear it! We've posted on their forum, stackoverflow and countless other places...


Top
 Profile  
 
 Post subject: Re: Best way to deal with UTF-8 byte array fields?
PostPosted: Wed Mar 31, 2010 8:39 am 
Hibernate Team
Hibernate Team

Joined: Fri Oct 05, 2007 4:47 pm
Posts: 2536
Location: Third rock from the Sun
I quickly checked on JIRA and couldn't find an issue about this. This should be reported if you would like some developer to look into it. Maybe it's even trivial but core developers are extreme busy and won't consider with priority the many issues on the forums as they're 99% user errors, especially if they're hard to reproduce.

I'm sure you understand it's very time consuming to try reproducing all issues and then find out somebody should have read the manual better, or is using years-old verions..

The best strategy to get this fixed quickly is to open an issue with an attached testcase - you've seen how quick it happened with previous error, Emmanuel and others do care a lot for quality but time is limited so we have to help in making sure stuff is reported correctly.

So could you try creating a testcase? Some hints:
[*] the testcase shouldn't be a project but a patch compatible with trunk, so the developers can quickly apply it and possibly commit your testcase to prevent regressions-
[*] The test should FAIL, so that it clearly points out what you expect.

have a look at how I refactored your patch, you might remember I had some trouble in understanding what you want, I had to convert it to Search's conventions and then make it fail.
Your testcase is committed now along with your minimal test model, that makes sure future versions of Search won't fail anymore on this :)

_________________
Sanne
http://in.relation.to/


Top
 Profile  
 
 Post subject: Re: Best way to deal with UTF-8 byte array fields?
PostPosted: Wed Mar 31, 2010 12:34 pm 
Regular
Regular

Joined: Mon Mar 10, 2008 6:40 pm
Posts: 114
Excellent suggestion Sanne. I'm not yet familiar with how best to work with open source projects to enhance or fix bugs. I'm going to open a JIRA issue.


Top
 Profile  
 
 Post subject: Re: Best way to deal with UTF-8 byte array fields?
PostPosted: Wed May 22, 2013 11:18 am 
Newbie

Joined: Wed May 22, 2013 11:11 am
Posts: 1
Hope this helps to other who stop by this topic searching for solution.
1:
Code:
public class ContentStringType extends org.hibernate.type.StringType {
   
   private static final long serialVersionUID = 1L;

   public Object get(ResultSet rs, String name) throws SQLException {
      String content=null;
      try {
         content= new String(rs.getBytes(name),"UTF-8");
      } catch (UnsupportedEncodingException e) {
         // TODO Auto-generated catch block
         e.printStackTrace();
      }
         return content;
   }
   
}


2:
Entity:
Code:
@Type(type="com.company.hibernate.entity.ContentStringType")
    @Column(name="content")
    public String getContent() {
        return this.content;
  }


3: Make sure your db url has useUnicode=true &characterEncoding=utf-8


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 6 posts ] 

All times are UTC - 5 hours [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
© Copyright 2014, Red Hat Inc. All rights reserved. JBoss and Hibernate are registered trademarks and servicemarks of Red Hat, Inc.