Why BTrees?

BTrees are one of programmings well kept secrets. Sure we all learn about Binary Trees in Computing 101. And then we really forget about trees as persistent storage. BTrees, and in particular the B+Trees implemented by Virtual Machinery are very different from the Binary trees of your college days. They are the highly efficient data storage mechanisms underpinning todays most powerful commercial databases (e.g. DB2, Oracle SQL Server, mySQL). These databases interpose an SQL layer between you and the underlying BTree structure, Virtual Machinery's BTree implementation strips away this layer and gives you access to the raw power underneath.

Java developers often want a simple persistence solution without the overhead of a SQL database. Essentially what they want is a persistent HashMap - this is what you get from a B+Tree. Virtual Machinery's Btree implementation offers you a pure Java solution that can be used in J2SE, J2EE and mobile environments including J2ME, Android and Blackberry. The same storage file format is used in all environments (J2ME, J2SE, J2EE and Android) allowing the reading of data created on one platform to be used on another platform. In addition we now offer a read-only library that can read B+Trees created by the Virtual Machinery Java product in the Apple iOS environment. This means that you can use the same B+Tree files across J2ME (including Asha), Android and iPhone/iPad devices. The product includes demos that illustrate the storage and retrieval of image, text and sound data from the same B+Tree file in these environments

The design of B+Trees makes them ideal for disk-based data as fewer disk reads are required to get to a particular piece of data. The B+Tree is in two parts - one part is a hierarchical tree structure consisting purely of indices and the other is a flat data structure containing the data and built as a linked list. This combination gives an ideal compromise between short-path access to individual data and efficient index-sequential traversal of data. A good example of this would be a telephone company. It needs to provide rapid access to customers details via their phone number in, for example, a call centre application. It also has to iterate across every account to produce the monthly bill. Using a classical Binary tree for such an index-sequential traversal is very inefficient since the tree structure has to be navigated each time to find the next key. Typically commercial implementations such as Virtual Machinery's use a double-linked list meaning that the tree can be traversed in either direction.

If you use relatively small indices your tree will not be very deep. Typically you want only two or three disk accesses to read a piece of data and choose the number of indices per page to reflect that. By cacheing the full tree structure representing the indices in memory you can reduce the number of disk accesses to one. Virtual Machinery's B+Tree implementation allows you to do just that.

Insertion and deletion both rely on locating where the data is or should be so they also benefit from this arrangement. Most of the complexity in a BTree is involved in these operations as some changes may potentially cause a significant rearrangement of the indices. Again this can be mitigated by keeping the indices in memory. Insertion and deletion of data is simple as it will affect at most four pages - a page may have to be split on an insert, or two pages may have to be combined after a deletion.

Another advantage is that corruption due to computer/disk failure is much more likely to happen in the index part of the B+Tree. For this reason B+Trees are often organised as two files - one containing the index and the other the data. The tree can frequently be rebuilt from the data file - losing only data in blocks directly affected by the failure.

The B-tree was first described in 1972 by Rudolf Bayer and Edward M. McCreight. It is probable that the first commercial application of the B+ tree was IBM's VSAM - an early structured data access product that appeared in 1974 as a result of collaboration between IBM labs in Germany and the US.

You can find out more about the internal workings of a B+Tree here. And more about B+Trees in general here.

Click here to find pricing and licensing details and buy Virtual Machinerys BTree implementation online.

Portions of this page are reproduced from work created and shared by Google and used according to terms described in the Creative Commons 2.5 Attribution License.

Some of the icons in this page are generously provided by MySiteMyWay

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.