Activity: Database Design

Roles and Activities > Developer Role Set > Database Designer > Database Design

Purpose To ensure that persistent data is stored consistently and efficiently. To define behavior that must be implemented in the database.
Steps Map Persistent Design Classes to the Data Model Optimize the Data Model for Performance Optimize Data Access Define Storage Characteristics Define Reference Tables Define Data and Referential Integrity Enforcement Rules Distribute Class Behavior to the Database Review the Results
Input Artifacts: Supplementary Specifications Design Guidelines Design Model Use-Case Realization Data Model	Resulting Artifacts: Change Request Data Model
Role: Database Designer
Tool Mentors: Designing and Modeling Databases Using Rational Rose Data Modeler

Workflow Details:

Analysis & Design

Design the Database

The steps presented below assume a relational data model. The steps for an object database are similar in nature but differ in the details.

The reader is assumed to be familiar with database concepts and terminology, as covered by references such as [DAT99].

Map Persistent Design Classes to the Data Model

Purpose

To create, define and refine the data model to support storage and retrieval of persistent classes.

Guidelines: Data Model

Once the structure of the persistent design classes (design classes whose persistent property is set) stabilizes, they should be mapped into the data model. The guidelines presented above describe a coarse-grained strategy for performing the mapping, but ultimately a framework and tools are necessary to perform the conceptual mapping and generate calls to the framework to perform the mapping at run-time. The importance of tools and run-time data access frameworks becomes essential when the work becomes highly iterative, as it is impractical to manage the mapping process and related interface code generation without a reliable framework and tool set.

Optimize the Data Model for Performance

Purpose

To optimize the database data structures for performance.

Objects which are retrieved together can be stored together for improved performance.

In the case of a relational data model, the initial mapping generally yields a simple class-to-table mapping. If objects from different classes need to be retrieved at the same time, the RDBMS uses an operation called a table join to retrieve the rows related to the objects of interest. For frequently accessed data, join operations can be very computationally expensive. To eliminate the cost of the join, a standard relational technique is de-normalization.

De-normalization combines columns from two or more different tables into the same table, effectively pre-joining the information. De-normalization reflects a trade-off between more expensive update operations in favor of less expensive retrieval operations. De-normalization also reduces the performance of the system in queries which are only interested in the attributes of one of the objects which are effectively joined in the de-normalized table, since all attributes are retrieved on every query. But for cases where the application normally wants all attributes, there can be a significant performance improvement.

De-normalizing more than two tables is rare and increases the cost of inserts and updates as well as the cost of non-join queries. Limiting de-normalization to two tables is a good policy unless strong and convincing evidence can be offered on the benefits.

De-normalization can be inferred from the Design Classes in cases where classes are nested. Nested classes should always be mapped to a de-normalized table.

Some object databases allow a concept similar to de-normalization, in which related objects are clustered together on disk and retrieved in single operations. The concept in use is similar: to reduce the object retrieval time by reducing the work the system must do to retrieve related objects from the database.

In some cases, optimizing the data model may unmask problems in the Design Model, either performance bottlenecks, poor modeling, or incomplete designs. In this event, discuss the problems with the Designer of the class, triggering change requests where appropriate.

Optimize Data Access

Purpose

To provide for efficient data access using indexing.

Once the table structure has been designed, determine the types of queries that will be performed against the data. Indexing is used by the database to speed access. The following indexing strategies should be considered:

The primary key column of the table should always be indexed. Primary key columns are used frequently as search keys and for join operations.
For tables smaller than 100 rows in size with only a few columns benefit little from indexing. Small tables generally fit easily in the database cache.
Tables smaller than about 100 rows benefit little from indexing because they tend to be in-memory in the database cache (the actual number of rows is related to how many rows will fit in a data block, which is typically between 512 and 2048 bytes in size. The unit of I/O a database performs is not a row but a block, when viewed from the physical perspective).
Indexes should also be defined for frequently executed queries, or queries which must retrieve data quickly (generally any searches done while a person may be waiting). An index should be defined for each set of attributes which are used together as search criteria. For example, if the system needs to be able to find all Orders on which a particular product is ordered, there would need to be an index on the Line Item table, on the product number column.
Indexes should generally be defined only on columns used as identifiers, not on numeric values (things like account balances) or textual information (such as order comments). Identifier column values tend to be assigned when the object is created and then remain unchanged for the life of the object.
Indexes should be on simple numbers (integer and number data types) rather than floating point numbers, and should rarely be on strings. Comparison operations are much simpler and faster on numbers than they are on strings, and given the large data volumes processed on a query or a large join, small savings add up quickly. Indexes on numeric columns tend to take significantly less space as well.

On the down-side, the use of indexes is not free; the more indexes on a table the longer inserts and updates will take to process. The following precautions should temper the use of indexes:

Don't index just to speed up an infrequently executed query, unless the query occurs at a critical point and maximum speed is essential.
In some systems, update and insertion performance is more important than query performance. A common example is in factory data acquisition applications where quality data is captured in real-time. In these systems, there are only occasional online queries, and most of the data is analyzed periodically by batch reporting applications that perform statistical analysis on the data. For data-acquisition systems, remove all indexes to achieve maximum through-put. If indexes are needed, they can be re-built just before the batch reporting/analysis applications run, then dropped when the reporting/analysis is completed.
Keep in mind that indexes have a hidden cost: indexes cost time to update (a tax paid on every insert, update, or delete), and occupy disk space. Be sure you get value from using them.

Many databases offer a choice of index types. The most common include:

B-tree indexes. The most frequently used kind are based on balanced b-tree index data structures. They are good when the index key values are randomly distributed and tend to have wide variability. They tend to perform poorly when data being indexed is already in sequential ordered.
Hashed indexes. Less frequently, index key values are hashed. Hashing offers better performance when the range of index key values is known, relatively unchanging, and unique. Hashing relies upon using the key value to calculate the address of the data of interest. Because of the need for predictability, hash indexes tend to be useful only for medium-sized look-up tables which change very infrequently.

Choice of indexing strategy can have a large impact on performance, as can the timing of index creation can also affect performance. Bulk data-loads should be performed without indexes (this can be achieved by dropping the index, loading the data and then re-creating the index). The reason for this is that as each row is added, the index structure is re-balanced. Since subsequent rows will change the optimal index structure, the work done re-balancing the index as each row is inserted is largely wasted. It's faster and more efficient to load data without indexes, then re-create the index when the data load is done. Some databases provide bulk data-loaders to do this automatically.

Define Storage Characteristics

Purpose

To define the space requirements and disk page organization of the database.

Databases perform I/O not on rows or records, or even whole tables, but on disk blocks. The reason for this is simple: block I/O operations are usually optimized in software and hardware on the system. As a result, the physical organization of the tables and indexes in the database can have a dramatic impact on the performance of the system. There are several dimensions:

The density of information in the disk pages
The location of disk pages on disk, and across disk drives
The amount of disk space to allocate to the table

Disk Page Density

The density of disk pages depends on the extent to which data is expected to change over time. Without getting into the details, a less-dense page is more capable of accepting changes in values or addition of data over time, while a fuller data page provide better read performance since more data is retrieved per block read.

To simplify the disk management, group tables by the extent to which they tend to change. Three groups is a good start: highly dynamic, somewhat dynamic, and mostly static. The highly dynamic tables should be mapped onto disk pages that have a good deal of empty space in them (perhaps 30%), the somewhat dynamic less empty space (perhaps 15%), and the mostly static very little empty space (perhaps 5%). The indexes for the tables should be similarly mapped.

Disk Page Location

Next, determine where to put the disk pages. The goal here is to try to balance the workload across a number of different drives and heads to reduce or eliminate bottlenecks. Some guidelines to consider:

Never put data on the same disk as the operating system, its temporary files, or the swap devices. These drives will be busy enough as it is without adding additional workload to them.
Try to put data that is accessed simultaneously on different drives to try to balance the workload. Some systems support parallel I/O channels, so if possible, put the data on different channels.
Try to put the indexes on a different drive from the data it indexes, also to spread workload.
Try to use raw I/O if possible. Raw I/O by-passes the normal file buffering done by the operating system. Since the DBMS maintains its own buffers, there is no need to have the OS do it as well. Raw I/O is somewhat more complex to administer, so its use must be carefully considered.
For complex queries, consider striping the data. Striping involves putting data pages from the same tables (or set of objects) on different drives to reduce disk head-seek time and potentially take advantage of smart controllers supporting parallel I/O.

The primary rule is that there are never enough drives over which to spread the I/O. I/O balancing is an iterative, experimental process. Data access performance prototyping during the elaboration phase, coupled with appropriate instrumentation to monitor physical and logical I/O, will highlight performance problems early while there is still time to adjust the database design.

Disk Space Allocation

Using the characteristics of the persistence Design Mechanism, estimate the number of objects that need to be stored. The amount of disk space required to store the objects will vary from DBMS to DBMS, but the size estimate typically is a function of:

size = (fixed overhead for the table) + (number of rows * (avg. row size/avg. data density))

where the avg. data density is the page density percentage. In addition, be sure to account for the size of the indexes as well. The Database Administrator (DBA) manual for the product will provide the precise size estimation formulae. When calculating disk space, make sure to account for growth due to additions of data.

Use the disk space allocation calculations along with the page location to place data as well. Some databases allow tables to span physical drives, some do not. If yours does not, you may have to sub-optimize the placement of data to accommodate the data volumes. A second alternative is the use of RAID (which stands for 'redundant arrays of independent disks') drives, which allow many independent disk drives to appear as a single large disk which can be easily expanded. RAID drives effectively stripe the data automatically, but offer less control over physical placement of data, making fine optimization more difficult.

Define Reference Tables

Purpose

To define standard reference tables used across the project.
To define default values for data attributes.

Often there are standard look-up tables, validation tables, or reference tables used throughout the project. Since the data in these tables tends be frequently accessed but changes infrequently, it is worth special consideration. In the design model, these tables are things like standard product codes, state or province codes, postal or zip codes, tax tables, area code validation tables, or other frequently accessed information. In financial systems this might be lists of policy codes, insurance policy rating categories, or conversion rates. Look in the design model for classes which are primarily read, providing validation information for a large number of clients.

To tie-in earlier steps, the data should be located on a fast drive, in relatively packed data pages. If it is small, don't bother indexing it, since for small tables indexing actually adds more overhead. A small frequently accessed table will also tend to remain in memory, as the least-recently-used (LRU) caching algorithms tend to keep these tables pegged in the data cache.

Make sure the database cache is large enough to keep all reference tables in memory, if possible, plus normal "working set space" for queries and transactions. Often. the secret to increasing database performance is reducing I/O. Relatively speaking, memory is inexpensive, so if you need more, buy it.

Once the reference table structures are defined, determine a strategy for populating the reference table. Since these are accessed early in the project, determining the reference values and loading the tables often needs to occur relatively early in the project. While the Data Designer is not responsible for obtaining the data, she is responsible for determining how and when the reference tables will be refreshed.

Define Data and Referential Integrity Enforcement Rules

Purpose

To ensure the integrity of the database.

Data integrity rules, also known as constraints, ensure that data values lie within defined ranges. Where these ranges can be identified, the database can enforce them. (This is not to say that data validation should not be done in the application, but only that the database can server as a 'validator of last resort' in the event that the application does not work correctly). Where data validation rules exist, define the database constraints to enforce them.

In addition, referential integrity rules can be enforced by the database. These ensure that for every foreign key value there is a primary key value in a referenced table. Definition of foreign key constraints is also often used by the query optimizer to accelerate query performance. Based on associations in the Design Model, create foreign key relationship constraints on the associated tables in the Data Model. In many cases, the foreign key enforcement rules will use the reference tables discussed in the previous step.

Distribute Class Behavior to the Database

Purpose

To determine the behavior of the class which can be distributed to, and implemented by, the database.

Most databases support a stored procedure capability. A stored procedure is executable code which runs within the process space of the database management system, and provide the ability to perform database-related actions on the server without having to transfer data across a network. Their judicious use can the improve performance of the system.

Stored procedures usually come in two flavors: actual procedures and triggers. Procedures are executed explicitly by an application, generally have parameters, and can provide an explicit return value. Triggers are invoked implicitly when some database event occurs (insert a row, update a row, delete a row, etc), have no parameters (since they are invoke implicitly), and do not provide explicit return values.

In database systems that lack constraints, triggers are often used to enforce referential and data integrity. Otherwise, they tend to be used when an event needs to trigger (or cause) another event.

The Design classes should be examined to see if they have operations which should be implemented using the stored procedure or trigger facility. Candidates include:

Any operations which primarily deal with persistent data (creating, updating, retrieving or deleting it).
Any operations in which a query is involved in a computation (such as calculating the average quantity and value of a product in inventory).
Operations which need to access the database to validate data.

Remember that improving database performance usually means reducing I/O; as a result, if performing a computation on the DBMS server will reduce the amount of data passed over the network, the computation should probably be performed on the server.

Work with the Designer of the class to discuss how the database can be used to improve performance. The Designer will update the operation method to indicate that one or more stored procedures can be/should be used to implement the operation.

Review the Results

Purpose

To ensure the quality and integrity of the Data Model

Continuously throughout the activity, consider the Checkpoints: Data Model to assess the completeness and quality of the effort.

Rational Unified Process