Help Guide Privacy Policy Disclaimer Contact us
  Advanced SearchBrowse




Journal Article

Review - Data Placement In Bubba


Weikum,  Gerhard
Databases and Information Systems, MPI for Informatics, Max Planck Society;

There are no locators available
Fulltext (public)
There are no public fulltexts available
Supplementary Material (public)
There is no public supplementary material available

Weikum, G. (2000). Review - Data Placement In Bubba. ACM SIGMOD Digital Review, 2.

Cite as:
This paper, which came out of the Bubba project at MCC, was the first to address the physical database design problem for parallel database servers, with particular focus on the partitioning and allocation of (relational) data across multiple disks or processing nodes. These issues are key to good performance tuning. To this end, the paper introduced the fundamental notion of data heat as a measure for the disk access load attributed to a data unit or collection of units, and the notion of temperature to normalize heat by the consumed space. Based on these metrics, the paper developed an elegant framework and heuristic algorithms for choosing which data should be placed on which disk so as to balance the disk load, and which data should be cached in memory so as to minimize the overall disk load. I had the great opportunity of spending a postdoc year in the Bubba group at MCC where I could learn about this subject directly from the paper's authors. Later, their work was my main inspiration when I started working on dynamic data placement and migration in the early nineties. In this research of mine the notions of heat and temperature proved to be extremely useful for reasoning about load distribution and for developing algorithms that continuously adjust the allocation of data based on online statistics about access patterns, for example, to "cool down" hot disks. I have also seen fairly recent papers on the caching of query results in data warehouses to benefit greatly from the Bubba tuning framework. The paper by Copeland et al. is a true landmark paper, especially when you consider that this work was done before the industrial advent of parallel database systems. The problem of automating the physical database design for a cluster-based parallel data server, in the spirit of a zero-admin, self-tuning solution, has still not been solved in a truly comprehensive, industrial-strength manner, but this seminal paper is an excellent starting point and absolutely mandatory reading for everybody working on this highly relevant problem.