DekGenius.com
[ Team LiB ] Previous Section Next Section

8.1 Introducing W3C XML Schema

W3C XML Schema is a standard that provides additional control over the structure of XML documents. A formalized structure allows for the following tasks:


validation

Ensuring that a document has all required elements and attributes, in the required order.


documentation

Informing users and developers what elements and attributes are required.


querying

If you know the document's structure, you can navigate it more efficiently.


data binding

When you know the document's structure, you can mirror it in other data structures and transfer data back and forth between them more efficiently.


editing

If you know the document structure, editing tools can provide guidance in creating and manipulating a document.

Simultaneously with the development of W3C XML Schema, other groups that saw the need for formalized XML document structure developed other schema languages. RELAX NG and Schematron are the results of some of these efforts; however, neither have the cachet of being an official W3C recommendation.

.NET supports W3C XML Schema version 1.0, Part 1 (XML Schemas for Structures) and Part 2 (XML Schemas for DataTypes). In addition, the XML Schema recommendation also includes Part 0, a primer. If you are interested in learning more about XML Schema than this book can provide, the Primer is a good place to start.

The official W3C recommendations for Part 0, Part 1, and Part 2 are available at http://www.w3.org/TR/xmlschema-0/, http://www.w3.org/TR/xmlschema-1/, and http://www.w3.org/TR/xmlschema-2/, respectively.


8.1.1 Using W3C XML Schema

An XML Schema document (XSD), like an XSLT stylesheet, is itself an XML document. It may contain an XML declaration, and must contain a namespace declaration for the URI http://www.w3.org/2001/XMLSchema. This namespace is traditionally mapped to the prefix xs. The document element of an XSD is xs:schema; the simplest possible XSD, therefore, is the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" />

Of course, this XSD defines no structure, so it is mostly useless. To be more useful, it should include at least one element, representing the document element of the XML document it describes:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Customer" />
</xs:schema>

The xs:element element is called a particle. A particle can be thought of as representing a single unit of markup, or a grouping of such units. Other particles include xs:attribute, xs:choice, and xs:sequence, among others. xs:all, xs:sequence and xs:choice are also compositors, elements that define groups of particles.

A document using this schema would need to have the following content in order to be valid:

<Customer />

You may have already noticed that I've deviated from the style used in earlier parts of this book by capitalizing the first letter of the Customer element. I'll be capitalizing the first letter of every element and attribute name in this XSD. Hold that thought! I'll explain the different style in a little while.


Still not very useful, is it? Let's add a little more to the XML Schema, customer.xsd:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" version="1.0">
  <xs:element name="Customer">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Name" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

The xs:complexType schema element indicates that its enclosing element's content is more than just simple text; it actually has structure. This can be thought of as the real minimum requirement for using XML Schema, because a schema for a document with an empty document element is not very useful at all.

The xs:sequence element contains an ordered list of elements. Other compositors include xs:choice, which indicates that any one of the listed elements may appear, and xs:all, which indicates that the listed elements may appear in any order.

With xs:sequence, I've now defined a document structure that looks like this:

<Customer>
  <Name>Amalgamated Construction</name>
</Customer>

In order to constrain the number of times an element may appear in a sequence, you can add the minOccurs and maxOccurs attributes. Once you have done that, you might as well define the type of data that appears in the Name element as well. The new schema looks like this:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" version="1.0">
  <xs:element name="Customer">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Name" minOccurs="1" maxOccurs="1" type="xs:token" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

Now you're constrained to exactly one Name element, and its content may consist of any valid XML token (a string with any whitespace collapsed). By virtue of its data type constraint, this relatively simple XSD is already more complex than anything that could have been defined with a DTD.

The values of minOccurs and maxOccurs both default to 1, so this change was not strictly necessary, and I'll omit them in the rest of the examples if they have the default values. The value of minOccurs must be a nonnegative integer, while maxOccurs may be any nonnegative integer greater than or equal to minOccurs, or the literal string "unbounded".

The type attribute can take on any of quite a number of values, for predefined types. It can also hold custom types, as you'll see in a moment.

This schema is acceptable, but customers have more information that could appear in the XML document. Customers should also have a customer ID and an address:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" version="1.0">
  <xs:element name="Customer">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Name" type="xs:token" />
        <xs:element name="Address" maxOccurs="unbounded" type="xs:string" />
      </xs:sequence>
      <xs:attribute name="Id" type="xs:ID" />
    </xs:complexType>
  </xs:element>
</xs:schema>

The document can now have one or more Address elements, containing data of type xs:string (that is, character data with whitespace retained) to hold freeform address information. According to the schema, the Address elements must come after the Name element, because xsd:sequence constrains the order of elements. I've also added an Id attribute to the Customer element. Id's value is of type xs:ID (it must contain only alphanumeric data or the punctuation marks _, -, ., and :; must begin with a non-numeric character; and must be unique amongst all attributes of type xs:ID in the document).

That Address element is not quite right, though. Although a freeform address may work well enough for many purposes, it really doesn't take proper advantage of XML's promise of structured data. Instead, a better document structure would look like this:

<Customer id="customer.8873">
  <Name>Amalgamated Construction</Name>
  <Address>
    <Street>81 San Leandro Blvd</Street>
    <Street>Suite 5D</Street>
    <City>Albequerque</City>
    <State>NM</State>
    <Zip>08765-9999</Zip>
  </Address>
</Customer>

The XSD for this document could be the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" version="1.0">
  <xs:element name="Customer">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Name" type="xs:token" />
        <xs:element name="Address" maxOccurs="unbounded">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="Street" maxOccurs="3" type="xs:string" />
              <xs:element name="City" type="xs:string" />
              <xs:element name="State" type="xs:string" />
              <xs:element name="Zip" type="USZipCodeType" />
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="Id" type="xs:ID"/>
    </xs:complexType>
  </xs:element>

  <xs:simpleType name="USZipCodeType">
    <xs:restriction base="xs:token">
      <xs:pattern value="\d{5}(-\d{4})?" />
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

With these changes, Address becomes an element which must have from one to three Street elements, and exactly one each of City, State, and Zip elements. I also added in a new twist by defining a simple type called USZipCodeType.

The xs:simpleType element defines USZipCodeType, a type that can be used in multiple places within the XSD. In this case, the type represents a United States zip code, which must be composed of either five numerals, or five numerals followed by a hyphen and four numerals; that is, nnnnn or nnnnn-nnnn. This pattern is expressed by the regular expression \d{5}(-\d{4})?. The xs:restriction and xs:pattern elements work together to restrict the value to a token that matches the regular expression in the value attribute.

XML Schema's regular expression syntax is based on Perl regular expressions, with some minor differences. To learn more about regular expressions, see Mastering Regular Expressions, 2nd Edition (O'Reilly).


Clearly you can keep going with this pattern of adding elements and attributes until the document is perfectly modeled. To sound a familiar refrain, XML Schema can do a lot more than this; see Eric van der Vlist's XML Schema (O'Reilly) to learn more.

8.1.2 When to Use W3C XML Schema

You should probably create XML Schema for your XML documents if any of the following apply:

  • The order and number of nodes in your document must be constrained.

  • The data within your document's nodes must be constrained more specifically than a DTD allows.

  • You wish to generate code to read and write XML to and from .NET types or a relational database using XmlSerializer or XmlDataDocument.

Conversely, you should strongly consider sticking with validation by DTD only if all of the following apply:

  • Nodes may appear in your XML document in any order and number.

  • The data in your document's nodes may be free-form, and need not be constrained.

  • You will be reading and writing data to and from XML documents only using XmlReader, XmlWriter, XmlDocument, XPathDocument, and the other built-in .NET XML types.

8.1.3 Other Ways to Constrain XML Structure

Although DTDs also provide a means of constraining XML, and validation using DTDs is supported by .NET, they do not provide as much control over XML content as XML Schema does. For example, a DTD cannot specify the required order of elements or attributes, nor can it enforce the data type of element or attribute content. XML Schema was actually designed to make up for DTD's lack of functionality, and does so quite well.

A common complaint about XML Schema, however, is that it is actually too complex. RELAX NG was developed concurrently with XML Schema, has been adopted as a standard by the Organization for the Advancement of Structured Information Standards (OASIS), and has been accepted as a draft international standard of the International Organization for Standardization (ISO). RELAX NG actually began its life as two competing validation languages, RELAX and TREX, which were merged in 2001. RELAX NG is as capable of describing XML Structures as XML Schema, and arguably simpler to use. However, .NET does not support validation with RELAX NG.

There is nothing to keep some enterprising developer from building a RELAX NG validator for .NET, of course. Also, James Clark's Trang processor (http://www.thaiopensource.com/relaxng/trang.html) lets you work in RELAX NG and convert your results to W3C XML Schema.

    [ Team LiB ] Previous Section Next Section