generic BLOB / CLOB strategy for Oracle, MySQL and Postgres

ErikVisser · **Joined:** Thu Jan 29, 2004 5:33 am **Posts:** 9

Hi all,

I'm another Hibernate user who resents the incompatibility that the major databases (don't feel offended if your favourate db is not in the subject :) show with respect to clobs and blobs. Because the problems are generally the same for clobs and blobs, I will continue to call them blobs, but the argument holds for both types.

Basically I got this crazy idea: why not create the clob/blob from the parts that are generic for the databases. This is what I propose:

--- idea ---

Store your large binary objects in the form of a linked list made up fixed size bytearrays. The fixed size bytearrays are of type="binary" in hibernate mapping.

--- example ---

There are many ways you can actually implement this, so I will try to make it more visual by _a_ way to do it:

Define a class (say "Image.java" ) and a mapping file ( "Image.hbm" ). The mapping defines a list of bytearrays ( type="binary" ) in a field named blobParts.

Image.java has an additional getter method getImageBytes() that appends all blobParts into a single byte[]. The setter method setImage(byte[] imageBytes) splits the image bytearray parts that fit in the blobParts ( say 255 bytes ).

--- considerations ---

My first thought was that it might work, but would perform very badly. I did some preliminary tests comparing variable-size blobs to linked-list composite blobs and found that the difference in performance is not large. Somewhere between negligeable and a factor 2.

It does depend on the size of the byte-array that you can use, which you can sneakely influence using the <column slq-type="..." > mapping element. To be honest, I have not tried -not supplying a sql-type- so I don't know if that will actually work. However, for all I care you could use a type="string" and still make it work.

Using database specific column type for the hibernate "binary" type does bring you one advantage: you can optimize the buffer size for your database:
- For Oracle you use "blob" and stick to the 4k limit.
- For Postgres you use "bytea" and you can set the buffer-size as you please.
- For MySQL you use "longblob" and find that the buffer size is limited by some limit on packet size, something of 0.5 Mb.

Using these sql-types I achieved the very acceptable performance results.

Now the reason I used the slq-type is that you 'can' do it and still have database independent code. However, it will take a lot more code.

--- second idea ---

When you load the hibernate mapping files (Image.hbm), do not just pass the class to the Hibernate Configuration, but make use of the ability to load the info via a stream. Load the file yourselve and modify the sql-type at runtime to match the database you are using.

Now this sounds like a lot of work, and indeed I am still working on my personal strategy for this. I will post the code once I get it to work fully, but what I have done is to SAX parse the .hbm files and look for the "column" element and add the attributes pertinent for the database you use.

--- more... next week ---

For anyone who has made it to thus far and is still interested in how this works out... wait until the weekend has passed. I continue working on it comming monday and will post my files in the hibernate-wiki once they work.

However, in the meantime, I am very interested in some feedback. Do people like the idea and can they live with a factor 2 in performance cut? Are there ways to do it simpler / more elegant? Should there be an api to transform .hbm files at load time in the next hibernate release?

Cheers,
Erik Visser

christian · **Posted:** Sat Nov 13, 2004 6:01 am

Good ideas, I'd like to do some testing with this myself. Looking forward to your Wiki page. I also don't think performance should be an issue here (as long as it is in reasonable limits), most people store large scalar values (texts, binaries) in the database because they have to (they don't have a transactional filesystem), not because they have to do it fast.

ErikVisser · **Joined:** Thu Jan 29, 2004 5:33 am **Posts:** 9

As promised, the first part of the code is:

http://www.hibernate.org/236.html

But I don't have enough time today to provide a real working example, will have to do that during the week.

Also the vendor specific code takes more time than I hoped, will add that in a separate wiki-page, because it occurred to me that it is really another trick.

Cheers,
Erik

areus · **Joined:** Wed Nov 17, 2004 1:06 pm **Posts:** 4

Hi,

I used the same aproach to implement a Directory for Lucene, using Hibernate.

Code:

    
<class name="org.ibit.lucene.hibernate.LuceneFile" table="LHI_LUCFIL">
        <id name="name" unsaved-value="any" column="LUF_NAME">
            <generator class="assigned"/>
        </id>
        <property name="modified" column="LUF_MODIFI"/>
        <property name="length" column="LUF_LENGTH"/>

        <list name="buffers" cascade="all" lazy="true" table="LHI_LUCBUF">
            <key column="LUB_CODLUF"/>
            <index column="LUB_NUMBER"/>
            <element type="binary" column="LUB_DATA" length="4096" />
        </list>
    </class>

The code to manage the buffers is in Lucene subclasses.

I did some tests on Postgresql and Mysql, and I found out that the best parts size for performance is 1024 - 4096 bytes.

On Mysql, writing files took about twice the time it takes on filesystem.

s3frank · **Joined:** Thu Nov 25, 2004 10:08 pm **Posts:** 5

areus could you share the code that deals with the splitting and joining of the byte arrays? I am pulling my hair out on the regular binary with Oracle mapping and it is just nog working without or without streams...just can't get it to work.

the approach presented here sounds very viable and efficient.

Thanks,

Frank

mchyzer · **Joined:** Sat Jul 17, 2004 5:16 pm **Posts:** 143

This is off-topic for this post, not not the followup... Oracle is one of the better DBs to do blobs with (possible to stream to it, unlike Mysql and postgres). Works fine with hibernate. Though when inserting a blob, get the Blob reference from the OracleConnection, not from hibernate. I am using only streams, not byte arrays, so might be different for those.

Chris

ps. I think if Oracle doesnt do JDBC correctly, that perhaps Hibernate could check if it is an OracleConnection and do it correctly for the one-off (sorry if this is not pure, sorry if you think I am saying there is a bug in Hibernate, since I am not, please dont yell at me :) ).

christian · **Posted:** Fri Nov 26, 2004 3:38 am

Actually, Oracle has really bad LOB handling. Make sure you don't use the Oracle drivers, then you don't need any OracleStatements or anything else weird, just plain JDBC.

ErikVisser · **Joined:** Thu Jan 29, 2004 5:33 am **Posts:** 9

Frank,

Example code is right behind the link in my second post :o

Be sure to use a BUFFER_SIZE of 2000 for Oracle.

Cheers

areus · **Joined:** Wed Nov 17, 2004 1:06 pm **Posts:** 4

The code works the Lucene Director way, so I don't know if it will fit your needs, but here it goes:

This is the bean , mapped to the db

Code:

public class LuceneFile implements Serializable {

    private String name;
    private long modified = System.currentTimeMillis();
    private long length = 0;
    private List buffers = new LinkedList();

    public LuceneFile() {
    }

    public LuceneFile(String name) {
        this.name = name;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public long getModified() {
        return modified;
    }

    public void setModified(long modified) {
        this.modified = modified;
    }

    public long getLength() {
        return length;
    }

    public void setLength(long length) {
        this.length = length;
    }

    protected List getBuffers() {
        return buffers;
    }

    protected void setBuffers(List buffers) {
        this.buffers = buffers;
    }

    public void addBuffer(byte[] buffer) {
        buffers.add(buffer);
    }

    public byte[] findBuffer(int index) {
        return (byte[]) buffers.get(index);
    }

Now, to implement a Lucene Directory I needed to implement org.apache.lucene.store.InputStream.readInternal to deal with reading the data parts:

Code:

    public final void readInternal(byte[] dest, int destOffset, int len) {
        int remainder = len;
        int start = pointer;
        while (remainder != 0) {
            int bufferNumber = start / bufferSize;
            int bufferOffset = start % bufferSize;
            int bytesInBuffer = bufferSize - bufferOffset;
            int bytesToCopy = bytesInBuffer >= remainder ? remainder : bytesInBuffer;

            byte[] buffer = file.findBuffer(bufferNumber);

            System.arraycopy(buffer, bufferOffset, dest, destOffset, bytesToCopy);
            destOffset += bytesToCopy;
            start += bytesToCopy;
            remainder -= bytesToCopy;
        }
        pointer += len;
    }

"pointer" is the current file pointer, "file" is the LuceneFile bean, and bufferSize is the data parts size.

The rest of the code to read a file is in org.apache.lucene.store.InputStream.

To deal with writes I needed to implement org.apache.lucene.store.OutputStream.flushBuffer

Code:

    
protected void flushBuffer(byte[] src, int len) {
        int bufferNumber = pointer / bufferSize;
        int bufferOffset = pointer % bufferSize;
        int bytesInBuffer = bufferSize - bufferOffset;
        int bytesToCopy = bytesInBuffer >= len ? len : bytesInBuffer;

        if (bufferNumber == file.getBuffers().size()) {
            file.addBuffer(new byte[bufferSize]);
        }

        byte[] buffer = file.findBuffer(bufferNumber);
        System.arraycopy(src, 0, buffer, bufferOffset, bytesToCopy);

        if (bytesToCopy < len) {           // not all in one buffer
            int srcOffset = bytesToCopy;
            bytesToCopy = len - bytesToCopy;        // remaining bytes
            bufferNumber++;
            if (bufferNumber == file.getBuffers().size()) {
                file.addBuffer(new byte[bufferSize]);
            }
            buffer = file.findBuffer(bufferNumber);
            System.arraycopy(src, srcOffset, buffer, 0, bytesToCopy);
        }

        pointer += len;
        if (pointer > file.getLength())
            file.setLength(pointer);

        file.setModified(System.currentTimeMillis());
    }

pointer, bufferSize and file are the same.

The rest of the code is in org.apache.lucene.store.OutputStream.

The code is an adaptation of the one in org.apache.lucene.store.RAMDirectory.

ErikVisser · **Joined:** Thu Jan 29, 2004 5:33 am **Posts:** 9

Chris,

My first approach to creating a vendor agnostic BLOB implementation included checking for the actual Blob class returned by Hibernate, and performing different actions depending on wheter or not it was an Oracle BLOB or a MySQL blob. So I don't necessarily think that it is evil to do some vendor specific work, if that it what it takes to get there.

So, how does your approach work? Do you get your OracleConnection from Hibernate, or do you work the other way around?

Cheers

s3frank · **Joined:** Thu Nov 25, 2004 10:08 pm **Posts:** 5

Erik I was thinking of using 4000 as the buffer cos I thought that the this would be the highest value possible when I use Hibernate and batching the inserts and updates. Ideally I would like to have the following:

Meta data class that contains info such as name, size, modified etc about the blob data. The meta data class contains a lazy list of type binary which is mapped as a regular master details relationship in a blob table.

I setup hibernate.properties to NOT use streams for binary (I believe it will then use setBytes()) and I do not specify a batch_size smaller then 0, I specify value of 5 or whatever.

The meta data class is reference via one-to-one / many-to-one etc and is not lazy loaded...but the collection with real blob data is always lazy this way. The meta data class has helper non mapped accessor methods that deal with the actuall fetching and stiching together / splitting of data.

When save / update -> hibernate will see the collection and just perform regular insert of the blob data. My blobs are hardely update, so I will do a delete of all elements in the list when a new byte array is set...which is problaby the only / fasted way anyway.

Question, wil this approach work and am I right in saying that all I need to do this way is to write / reuse / borrow the array handling code you have?

Thanks a million for your quick response, been at this for almost 3 days now and it's starting to annoy me really bad.

Frank

ErikVisser · **Joined:** Thu Jan 29, 2004 5:33 am **Posts:** 9

Frank,

sounds like it would work.

The only reason I can think of for having a separate class that contains blob data is that you could use it from multiple classes, while still keeping a minimum amount of tables. Myself, I cannot do this for reasons not relevant here, but I don't see why not. An advantage is that you have the split/stich code is in that single class.

Buffer sizes and Oracle... You would be best just to run a couple of tests, to see what does best for you. My experience was 2000 with RAW, (2000 is the limit for RAW). For me, RAW performs better than BLOB with a buffer-size of 4000.

I have not even thought about the influence of batch_size.

Good hunting!

mchyzer · **Joined:** Sat Jul 17, 2004 5:16 pm **Posts:** 143

ErikVisser wrote:

Chris,

So, how does your approach work? Do you get your OracleConnection from Hibernate, or do you work the other way around?

Yeah, that was a bit of a pain. I have a wrapper around Hibernate, so on that constructor I keep the connection in a threadLocal. So you can have one hibernate connection open at a time if you are using Blobs. Also, I have the javabean property that I use as a File, and a getter and setter than has a similar property that is the Blob mapped with Hibernate. That getter and setter translates the Blob to a file when Hibernate gets and sets it (when the connection is open). On the getter, and setter, is where I need the threadlocal...

Code:

    BLOB fastBlob = null;
    
    //if there is no connection, maybe we are coming from
    //hibernate, and there is a threadlocal connection
    if (this.connection == null) {
      this.connection = HibernateSession.staticConnection();
    }
    
    Connection theConnection = this.connection;
    
    if (this.connection instanceof P6LogConnection) {
      theConnection = ((P6LogConnection)this.connection).getJDBC();
    }
    
    //get a blob based on the oracleConnection    
    fastBlob = BLOB.createTemporary(theConnection, true,
        BLOB.DURATION_SESSION);

    try {
      is = new FileInputStream(file);
      OutputStream os = fastBlob.getBinaryOutputStream();
      FastFileUtils.sendInputStream(is, os);
    } catch (FileNotFoundException fnfe) {
      ...

Quote:

Actually, Oracle has really bad LOB handling. Make sure you don't use the Oracle drivers, then you don't need any OracleStatements or anything else weird, just plain JDBC.

Hmmm, well, it seems like the Oracle 9 driver will stream to and from (I hope) without byte[]. In that way it seems better than MySQL and Postgres (though they have one in beta). I dont know how to do this without using the Oracle JDBC driver, I dont use an OracleStatement though. Anyways, it seems to work for now, though I am in the early stages (havent load tested or put anything in prod yet).

Regards,
Chris

MjH · **Posted:** Mon Nov 29, 2004 12:42 pm

christian wrote:

...Make sure you don't use the Oracle drivers...

Christian, which JDBC drivers do you suggest for use with Oracle?

christian · **Posted:** Mon Nov 29, 2004 4:35 pm

I've benchmarked and tested the API of the DataDirect Oracle driver and various verions of Oracle drivers and databases, esp. with LOBs. The DataDirect driver has a standard JDBC API and was at least as fast as any OCI driver. Note that this has been done very ad-hoc and half a year ago, do your own testing.