Java Database Best Practices-Java Database Best Practices

2.1 Relational Concepts

Before we approach the details of relational data architecture, it helps to establish a base understanding of relational concepts. If you are an experienced database programmer, you will probably want to move on to the next section on normalization. In this section, we will review the key concepts behind relational databases critical to an in-depth understanding of relational data architecture.

Databases and Database Engines

Developers new to database programming often run into problems understanding just what a database is. In some contexts, it represents a collection of data like the music library. In other contexts, however, it may refer to the software that supports that collection, a process instance of the software, or even the server machine on which the process is running.

Technically speaking, a database is really the collection of related data and the relationships supporting the data. The database software—a.k.a the database management system (DBMS)—is the software, such as Oracle, Sybase, MySQL, and DB2, that is used to store that data. A database engine, in turn, is a process instance of the software accessing your database. Finally, the database server is the computer on which the database engine is running.

In the industry, this distinction is often understood from context. I will therefore continue to use the term "database" interchangeably to refer to any of these definitions. It is important, however, to database programming to understand this breakdown.

2.1.1 The Relational Model

A database is any collection of related data. The files on your hard drive and the piles of paper on your desk all count as databases. What distinguishes a relational database from other kinds of databases is the mechanism by which the database is organized—the way the data is modeled. A relational database is a collection of data organized in accordance with the relational model to suit a specific purpose.

Relational principles are based on the mathematical concepts developed by Dr. E. F. Codd that dictate how data can be structured to define data relationships in an efficient manner. The focus of the relational model is thus the data relationships. In short, by organizing your data according to the relational model as opposed to the hierarchical principles of your filesystem or the random mess of your desktop, you can find your data at a later date much easier than you would have had you stored it some other way.

A relationship in relational parlance is a table with columns and rows.^[1] A row in the database represents an instance of the relation. Conceptually, you can picture a table as a spreadsheet. Rows in the spreadsheet are analogous to rows in a table, and the spreadsheet columns are analogous to table attributes. The job of the relational data architect is to fit the data for a specific problem domain into this relational model.

^[1] You will sometimes see a row referred to as a tuple—especially in more theoretical discussions of relational theory. Columns are often referred to as attributes or fields.

Other Data Models

The relational model is not the only data model. Prior to the widespread acceptance of the relational model, two other models ruled data storage:

The hierarchical model
The network model

Though systems still exist based on these models, they are not nearly as common as they once were. A directory service like ActiveDirectory or OpenLDAP is where you are most likely to engage in new hierarchical development.

Another model—the object model—is slowly coming into favor for limited problem domains. As its name implies, it is a data model based on object-oriented concepts. Because Java is an object-oriented programming language, it actually maps best to the object model. However, it is not as widespread as the relational model and is definitely not proven to support systems on the scale of the relational model.

2.1.2 Entities

The relational model is one of many ways of modeling data from the real world. The modeling process starts with the identification of the things in the real world that you are modeling. These real world things are called entities. If you were creating a database to catalog your music library, the entities would be things like compact disc, song, band, record label, and so on. Entities do not need to be tangible things; they can also be conceptual things like a genre or a concert.

BEST PRACTICE: Capture the "things" in your problem domain as relational entities.

An entity is described by its attributes. Back to the example of a music library, a compact disc has attributes like its title and the year in which it was made. The individual values behind each attribute are what the database engine stores. Each row describes a distinct instance of the entity. A given instance can have only a single value for each attribute.

Table 2-1 describes the attributes for a CD entity and lists instances of that entity.

Table 2-1. A list of compact discs in a music library

Artist

Title

Category

Year

The Cure

Pornography

Alternative

1983

Garbage

Garbage

Grunge

1995

Hole

Live Through This

Grunge

1994

The Mighty Lemon Drops

World Without End

Alternative

1988

Nine Inch Nails

The Downward Spiral

Industrial

1994

Public Image Limited

Compact Disc

Alternative

1986

Ramones

Mania

Punk

1988

The Sex Pistols

Never Mind the Bollocks, Here's the Sex Pistols

Punk

1977

Skinny Puppy

Last Rights

Industrial

1992

Wire

A Bell Is a Cup Until It Is Struck

Alternative

1989

You could, of course, store this entire list in a spreadsheet. If you wanted to find data based on complex criteria, however, the spreadsheet would present problems. If, for example, you were having a "Johnny Rotten Night" party featuring music from the punk rocker, how would you create this list? You would probably go through each row in the spreadsheet and highlight the compact discs from Johnny Rotten's bands.

Using the data in Table 2-1, you would have to hope that you had in mind an accurate recollection of which bands he belonged to. To avoid taxing your memory, you could create another spreadsheet listing bands and their members. Of course, you would then have to meticulously check each band in the CD spreadsheet against its member information in the spreadsheet of musicians.

2.1.3 Constraints

What constitutes identity for a compact disc? In other words, when you look at a list of compact discs, how do you know that two items in the list are actually the same compact disc? On the face of it, the disc title seems as if it might be a good candidate. Unfortunately, different bands can have albums with the same title. In fact, you probably use a combination of the artist name and disc title to distinguish among different discs.

The artist and title in our CD entity are considered identifying attributes because they identify individual CD instances. In creating the table to support the CD entity, you tell the database about the identifying attributes by placing a constraint on the database in the form of a unique index or primary key. Constraints are limitations you place on your data that are enforced by the DBMS. In the case of unique indexes (primary keys are a special kind of unique index), the DBMS will prevent the insertion of two rows with the same values for the entity's identifying attributes. The DBMS would prevent, for example, the insertion of another row with values of 'Ramones' and 'Mania' for the artist and title values in a CD table having artist and title as a unique index. It won't matter if the values for all of the other columns differ.

BEST PRACTICE: Use constraints to help enforce the data integrity of your system.

Constraints like unique indexes help the DBMS help you maintain the overall data integrity of your database. Another kind of constraint is formally known as an attribute domain. You probably know the domain as its data type. Choosing data types and indexes along with the process of normalization are the most critical design decisions in relational data architecture.

2.1.3.1 Indexes

An index is a constraint that tells the DBMS about how you wish to search for instances of an entity. The relational model provides for three main kinds of indexes:

Index: An index in the generic sense is a simple tool that tells the DBMS what kind of searches you intend to perform. With this information, the DBMS can organize information to make the searches go quickly. A very crude way to think of an index is as a Java HashMap in which the key is your index attribute and the values are arrays of matching rows.
Unique index: A unique index is an index whose values are guaranteed to be unique. In other words, instead of an array of matching rows, this index is like a HashMap that returns a single value for its key. The index created earlier for the artist and title columns in the CD table is an example of a unique index.
Primary key: A primary key is a special unique index that acts as the main identifier for the row. A table can have any number of unique indexes, but it can have only one primary key.

We can examine the impact of indexes by creating the CD entity as a table in a MySQL database and using a special SQL command called the EXPLAIN command. The SQL to create the CD table looks like this:

CREATE TABLE CD (
    artist        VARCHAR(50)   NOT NULL,
    title         VARCHAR(100)  NOT NULL,
    category      VARCHAR(20),
    year          INT
);

The EXPLAIN command tells you what the database will do when trying to run a query. In this case, we want to look at what happens when we are looking for a specific compact disc:

mysql> EXPLAIN SELECT * FROM CD
    -> WHERE artist = 'The Cure' AND title = 'Pornography';
+-------+------+---------------+------+---------+------+------+------------+
| table | type | possible_keys | key  | key_len | ref  | rows | Extra      |
+-------+------+---------------+------+---------+------+------+------------+
| CD    | ALL  | NULL          | NULL |    NULL | NULL |   10 | where used |
+-------+------+---------------+------+---------+------+------+------------+
1 row in set (0.00 sec)

The important information in this output for now is to look at the number of rows. Given the data in Table 2-1, we have 10 rows in the table. The results of this command tell us that MySQL will have to examine all 10 rows in the table to complete this query. If we add a unique index, however, things look much better:

mysql> ALTER TABLE CD ADD UNIQUE INDEX ( artist, title );
Query OK, 10 rows affected (0.20 sec)
Records: 10  Duplicates: 0  Warnings: 0
mysql> EXPLAIN SELECT * FROM CD
    -> WHERE artist = 'The Cure' AND title = 'Pornography';
+-------+-------+---------------+--------+---------+-------------+------+
| table  | type   | possible_keys   | key     | key_len   | ref           | rows |
+-------+-------+---------------+--------+---------+-------------+------+
| CD     | const  | artist          | artist  |     150   | const,const   |    1 |
+-------+-------+---------------+--------+---------+-------------+------+
1 row in set (0.00 sec)
mysql>

The same query can now be executed simply by examining a single row.

BEST PRACTICE: Make indexes for attributes you intend to search against.

Unfortunately, the artist and title probably make a poor unique index. First of all, there is no guarantee that a band will actually choose distinct names for its albums. Worse, in some circumstances, bands have chosen to have the same album carry different names. Public Image Limited's Compact Disc is an example of such an album. The cassette version of the album is called Cassette.

Even if artist and title were solid identifying attributes, they still make for a poor primary key. A primary key must meet the following requirements:

It can never be NULL.
It must be unique across all entity instances.
The primary key value must be known when the instance is created.

In addition to these requirements, good primary keys have the following characteristics:

The primary key should never change value.
The primary key attributes should have no meaning except to uniquely identify the entity instance.

It is very common for people to find attributes inherent in an entity and chose one or more of those identifying attributes as a primary key. Perhaps the best example of this practice is the use of an email address as a primary key. Email addresses, however, can and do change. A change to a primary key attribute can cause an instance to become inaccessible to anyone with old information about the instance. In plain English, it can break your application.

Another example of a common primary key with meaning is a U.S. Social Security number. It is supposed to be unique. It is never supposed to change. You, however, have no control over its uniqueness or whether it changes. As it turns out, sometimes the uniqueness of Social Security numbers is violated. In addition, they do sometimes change. Furthermore, in many cases, the law restricts your ability to share this information. It is therefore best to choose a primary key with no external meaning; you will control exactly how it is used and have the full power to enforce its uniqueness and immutability.

BEST PRACTICE: Never use meaningful attributes or attributes whose values can change as primary keys.

The solution is to create a new attribute to serve as the primary identifier for instances of an entity. For the CD table, we will call this new attribute the cdID. The SQL to create the table then looks like this:

CREATE TABLE CD (
    cdID          INT           NOT NULL,
    artist        VARCHAR(50)   NOT NULL,
    title         VARCHAR(100)  NOT NULL,
    category      VARCHAR(20),
    year          INT,
    PRIMARY KEY ( cdID ),
    INDEX ( artist, title ),
    INDEX ( category ),
    INDEX ( year )
);

You may have noted that my naming style does not redundantly name columns like title cdTitle. Yet I chose to name the primary key for the CD table cdID instead of id. This choice basically makes the use of data modeling tools a lot simpler. In short, data modeling tools look for natural joins—joins between two tables when the common columns share the same name, data type, and value. I discuss natural joins in more detail in Chapter 10.

BEST PRACTICE: Include the table name in the primary key name to assist data modeling tools.

Ideally, you always search on unique indexes. In the real world, however, you will select on attributes like the year or genre that are not unique. You can still help the database organize the underlying data storage by creating plain indexes. In general, you want any attribute you commonly search on to be indexed. An index does, however, come with some downsides:

Indexes are stored apart from the table data. Every index thus adds to the disk space requirements of the database.
Every change to the table requires every index to be updated to reflect the changes.

In other words, if you have a table on which you perform a significant number of write operations, you want to minimize your indexes to those attributes that appear frequently in queries.

Finally, as you have already seen, you can have indexes—including primary keys—that are formed out of any number of identifying columns so long as those columns together sufficiently identify a single entity instance. It is always a good idea, however, to build primary keys out of the minimal number of columns possible.

BEST PRACTICE: Use the smallest number of columns possible in your primary keys.

2.1.3.2 Domains

The proper choice of data type is another critical aspect of relational data architecture. It constrains the kind of data that can be stored for a given attribute. By creating an email attribute as a text value, you prevent people from storing numbers in the field. A time-oriented domain like a SQL DATE enables you to perform time arithmetic on date values.

The domains that exist in a relational database depend on the DBMS of choice. Those that support the SQL specification generally support a core set of data types. Just about every database engine comes with its own, proprietary data types. When modeling a system, you should use SQL-standard data types.

Primary keys deserve special consideration when you are putting domain constraints on an entity. Because they are the primary mechanism for getting access to an entity instance, it is important that the database is able to do quick matches against primary key values. In general, numeric types form the best primary keys. I recommend the use of 64-bit, sequentially generated integers for primary key columns. The only exception is for lookup tables.

A lookup table is a small table with a known, finite set of data like a table containing a list of states or, with respect to the music library example, a set of genres. In the case of lookup tables, they more often than not have codes against which you will do most lookups. For example, you will almost always retrieve the state of Maine from a State table by its abbreviation ME. It therefore makes more sense to use fixed character data types like SQL's CHAR for primary keys in lookup tables. The length of these fixed character values should be no more than a few characters.

BEST PRACTICE: Use fixed character data types like CHAR for primary keys in lookup tables.

The data types for other kinds of attributes vary with the diversity in the kinds of data you will want to store in your databases. These days, many databases even support the creation of user-defined data types. These pseudo-object data types prove particularly useful in the development of Java database applications.

2.1.4 Relationships

The creation of relationships among the entities in the database lies at its heart. These relationships enable you to easily answer the question, "On what compact discs in my library did Johnny Rotten play?" Unlike other models, the relational model does not create hard links between two entities. In the hierarchical model, a hard relationship exists between a parent entity and its child entities. The relational model, on the other hand, creates relationships by matching a primary key attribute in one entity to a foreign key attribute in another entity.

The relational model supports three kinds of entity relationships:

One-to-one
One-to-many
Many-to-many

With any of these relationships, one side of the relationship may be optional. An optional relationship allows the foreign key to contain NULL values to indicate the relationship does not exist for that row.

2.1.4.1 One-to-one relationships

The one-to-one relationship is the most rare relationship in the relational model. A one-to-one relationship says that for every instance of entity A, there is a corresponding instance of entity B. It is so rare that its appearance in a data model should be met with skepticism as it generally indicates a design flaw. You indicate a one-to-one relationship in the same way you indicate a one-to-many relationship.

BEST PRACTICE: Recheck your design whenever you encounter one-to-one relationships, as they are often indicators of problematic design choices.

2.1.4.2 One-to-many relationships

A one-to-many relationship means that for every instance of entity A, there can be multiple instances of entity B. As Figure 2-1 shows, the "many" side of the relationship houses the foreign key that points to the primary key of the "one" side of the relationship.

Figure 2-1. A One-to-Many Relationship

Table 2-2 lists data from a Song table whose rows are dependent on rows in the CD table.

Table 2-2. The Song entity with a foreign key from the CD entity

Attribute

Domain

Notes

NULL?

songID

INT

PRIMARY KEY

No

cdID

INT

FOREIGN KEY

No

title

VARCHAR(100)

No

length

INT

No

Under this design, one compact disc is associated with many songs. The placement of cdID into the Song table as a foreign key indicates the dependency on a row of the CD table. In databases that manage foreign key constraints, this dependency will prevent the insertion of songs into the Song table that do not already have a corresponding CD. Similarly, the deletion of a disc will cause the deletion of its associated songs. You should note, however, that not all database engines support foreign key constraints. Of those that do support them, you often have the option of turning them on or off.

Why would you want foreign key constraints off? Many application environments—particularly multitier distributed object systems—prefer to manage dependencies in the object layer instead of the database. It is generally a trade-off between a combination of speed with object purity and guaranteed data integrity. When foreign key constraints are not checked in the database, updates occur more quickly. Furthermore, you do not end up with a situation in which objects exist in the middle tier that have been automatically deleted by the database. On the other hand, if your middle-tier logic is not sound, your application can damage the data integrity without proper foreign key constraints.

You now have a proper relationship between compact discs and their songs. To ask which songs are on a particular compact disc, you need to ask the Song table which songs have the disc's cdID. Assuming you are looking for all songs from the disc Garbage (cdID 2), the SQL to find the songs looks like this:

SELECT songID, title FROM Song WHERE cdID = 2;

More powerfully, however, you can ask for all songs from a compact disc by the disc title:

SELECT Song.songID, Song.title
FROM Song, CD
WHERE CD.title = 'Last Rights'
AND CD.cdID = Song.cdID;

The last part of the query where the cdID was compared in both tables is called a join. A join is where the implicit relationship between two tables becomes explicit.

2.1.4.3 Many-to-many relationships

A many-to-many relationship allows an instance of entity A to be associated with multiple instances of entity B and an instance of entity B to be associated with multiple instances of entity A. These relationships require the creation of a special table to manage the relationship. You may hear these tables referred to by any number of names: composite entities, join tables, cross-reference tables, and so forth. This extra table creates the relationship by having the primary keys of each table in the relationship as foreign keys. It then uses the combination of foreign keys as its own compound primary key. If, for example, we had an Artist table in our music library, we indicate a many-to-many relationship between an Artist and a CD through an ArtistCD join table. Table 2-3 shows this special table.

Table 2-3. The ArtistCD table creates a many-to-many relationship between Artist and CD

Attribute

Domain

Notes

NULL?

cdID

INT

FOREIGN KEY, PRIMARY KEY

No

artistID

INT

FOREIGN KEY, PRIMARY KEY

No

You can now ask for all of the compact discs by Garbage:

SELECT CD.cdID, CD.title
FROM CD, ArtistCD, Artist
WHERE ArtistCD.cdID = CD.cdID
AND ArtistCD.artistID = Artist.artistID
AND Artist.name = 'Garbage';

BEST PRACTICE: Use join tables to model many-to-many relationships.

Another useful aspect of join tables is that you can use them to contain information about a relationship. If, for example, you wanted to track guest artists on albums, where would you store that information? It really is not an attribute of an artist or a compact disc. It is instead an attribute of the relationship between the two entities. To capture this information, you would therefore add a column to ArtistCD called guest. Finding which compact discs on which Sting appeared as a guest artist would then be as simple as:

SELECT CD.cdID, CD.title
FROM CD, ArtistCD, Artist
WHERE ArtistCD.cdID = CD.cdID
AND ArtistCD.artistID = Artist.artistID
AND Artist.name = 'Sting'
AND ArtistCD.guest = 'Y';

2.1.5 NULL

NULL is a special value in relational databases that indicates the absence of a value. If you have a pet store site that gathers information on your users, for example, you may track the number of pets your users have. Without the concept of NULL, you have no proper way to indicate that you do not know how many pets a user has. Applications commonly resort to nonsense values (like -1) or unlikely values (like 9999) as a substitute for NULL.

BEST PRACTICE: Use NULL to represent unknown or missing values.

Though the basic concept of NULL is pretty straightforward, beginning database programmers often have trouble figuring out how NULL works in database operations. A basic example would come about by adding a new column to our Song table that is a rating. It can be NULL since it is unlikely anyone wants to rate every single song in their library. The following SQL may not do what you think:

SELECT songID, title FROM Song WHERE rating = NULL;

No matter what data is in your database, this query will always return zero rows. Relational logic is not Boolean; it is three-value logic: true, false, and unknown. Most NULL comparisons therefore result in NULL since a NULL comparison is indeterminate under three-value logic. SQL provides special mechanisms to test for NULL in the form of IS NULL and IS NOT NULL so that it is possible to ask for the unrated songs:

SELECT songID, title FROM Song WHERE rating IS NULL;

[ Team LiB ]