[ Team LiB ] |
6.1 What Is XPath?XPath is a specification that allows you to address individual parts of an XML document, originally intended for use in the XSLT transformation language and the XPointer syntax for XML fragment identifiers. However, XPath is quite useful on its own, and is available for standalone use in .NET.
XPath 1.0 became a formal recommendation of the W3C in November, 1999, although XPath 2.0 is currently a working draft, still evolving as of this writing. The official XPath recommendation is located on the web at http://www.w3.org/TR/xpath. The essence of XPath is that you can select certain nodes from within an XML document through a simple XPath expression. In addition, XPath allows you to do some simple string, numeric, and Boolean data transformation on selected nodes. XPath expressions take the form of strings with a certain well-known syntax. This syntax is not explicitly XML itself; it is similar to filesystem pathnames and URLs, and this is where XPath gets its name. In addition to addressing nodes by name, XPath syntax enables pattern matching, so that you can select individual nodes by their attribute or content values. In this section, I'll discuss the structure and syntax of XPath expressions, and some of the functions built in to the specification. 6.1.1 Introduction to the XPath SpecificationJust like DOM, XPath operates on a tree-based view of an XML document. The XPath tree is built of the same node types used in DOM, except that CDATA sections, entity references, and document type declarations are not directly addressable. Their content is, however; the net result is that you can navigate to a text node's content, but you cannot tell whether that content contains plain text, CDATA, expanded entity references, or some combination thereof. You cannot access document type declarations at all with XPath. For this discussion, I'll return to the inventory example from Chapter 5. That example included an inventory database that looked similar to the one in Example 6-1; here I've added some additional products. Example 6-1. Angus Hardware inventory database<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE inventory SYSTEM "inventory.dtd"> <inventory> <!-- Warehouse inventory for Angus Hardware --> <date year="2002" month="7" day="6" /> <items> <item quantity="15" productCode="R-273" description="14.4 Volt Cordless Drill" unitCost="189.95" /> <item quantity="23" productCode="1632S" description="12 Piece Drill Bit Set" unitCost="14.95" /> <item quantity="10023" productCode="GN0250" description="1/4 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="9887" productCode="GN0375" description="3/8 inch Galvanized Steel Nails, 1/2 pound box" unitCost="189.95" /> <item quantity="8761" productCode="GN0500" description="1/2 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="3441" productCode="GN0625" description="5/8 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="9987" productCode="GN0750" description="3/4 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="10002" productCode="GN0875" description="7/8 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> <item quantity="596" productCode="GN1000" description="1 inch Galvanized Steel Nails, 1/2 pound box" unitCost="4.95" /> </items> </inventory> 6.1.1.1 Parts of an XPath expressionTo introduce the proper terminology, each part of the XPath expression is called a location step. Each location step is made up of an axis, a node test, and zero or more predicates. Location steps are separated by the slash character (/). The axis specifies the tree relationship between the nodes selected by the location step and the context node. Many axes have abbreviations which, while very convenient, are not always obvious to someone new to XPath. Table 6-1 shows the axes, their abbreviations, and brief descriptions of their meanings.
The node test specifies the type and name of the nodes selected by the location step. Node tests include text( ), which selects the text content of the context node; comment( ), which selects all the child nodes of the context node that are comments; processing-instruction( ), which selects all the child nodes of the context node that are processing instructions; and node( ), which is the default, and selects all children of the context node. The child axis is the default for any location step that does not have an explicit axis. A predicate further refines the set of nodes selected by the location step. Predicates can include selecting a specific element by position, as well as functions like count( ). Predicates always appear in square brackets ([ ]).
I'll show you some of these terms in their proper context as we go along. 6.1.1.2 Selecting elementsIf you have an XML document such as the inventory database in Example 6-1, you might wish to select certain nodes from it. For example, you might want to know the date the inventory numbers were recorded. The following XPath expression would return the date element: /child::date The double colon (::) separates the axis from the element being selected. Since child is the default axis, this can also be expressed in the abbreviated syntax: /date Every XPath expression has a context node. The context node is the node from which the search begins. In most cases, an XPath implementation allows you to select the node you wish to use as the context node. However, you can explicitly indicate that the search is to begin from the root element by beginning the expression with /. Following the slash, the string date indicates that the expression is to return all nodes that are descendants of the root node, and have the name date.
For the inventory document example, this expression would return the element <date year="2002" month="7" day="6" />. If there are other nodes elsewhere in the tree with the name date, each of them would be returned as well. You can make your search more specific by including only those nodes with the name date that are children of any node named inventory, using this expression: /child::inventory/child::date And again, this can be expressed with the abbreviated syntax: /inventory/date In much the same vein, you could navigate to the items element with any of the following expressions; they can be considered equivalent if the context node is the root element: //child::inventory/child::items //inventory/items /inventory/items inventory/items The single leading slash (/), as explained previously, is an axis that indicates that the context node is to be ignored and the search is to be done starting at the root. The double leading slash (//) has a slightly different meaning: at any point within the expression, it indicates that the search is to include the context node as well as all its descendants, although at the beginning of the expression the double slash is equivalent to a single slash. The expression with no leading slash indicates that the search is relative to the context node. // is actually just an abbreviation for the descendant-or-self::node( )/ axis. So another equivalent to the expressions above would be: descendant-or-self::node( )/inventory/child::items This expansion and replacement of axes really could go on forever. Once you have retrieved the items element, you can make it the context node for your next XPath expression. You can then return the list of item elements with this expression: item You can then iterate through each of these item nodes, doing as you wish with them. If you have an item element and wish to gather information about the inventory date, you can use the double period axis (..), which is an abbreviation for parent::node( ). This axis selects the parent of the current node. So, to get the date element from an inventory element's context, you could use this expression: ../../date The double period can be used anywhere in the expression. For example, you can combine some of the previous forms to return the date element in a fairly inefficient yet entirely legal way. This sort of construct really comes into its own when you start to build XPath expressions dynamically: //item/../../date
You can also select multiple elements at once, with the pipe character (|). The following expression selects both the date and item elements from the document: //item|//date 6.1.1.3 Selecting attributesXPath defines a special character to select an attribute node. The at sign (@) axis indicates that the node to select is an attribute. @ is an abbreviation for attribute::. Attributes can be intermingled with other nodes in the XPath expression. Thus, the following expression selects the year attribute of the date element: //inventory/date/@year And again, although it is an odd and somewhat inefficient way to do it, you could select the month attribute from any element that has a year attribute with this expression: //@year/../@month You can also use wildcards for element and attribute names. An asterisk (*) matches all element nodes, and @* matches all attribute nodes. This expression returns all attributes for all elements: //*/@* Finally, the node( ) function selects all nodes, of all types. You may find it helpful to expand the axis abbreviations into their full axes as an aid to learning. For example, //inventory/date/@year is equivalent to descendant-or-self::node( )/child::date/attribute::year, which, while specific, is not exactly terse. 6.1.1.4 Selecting text, comments, and processing instructionsXPath also defines several functions to select the other types of nodes. The first of these, text( ), selects any text node. The data returned will concatenate all text, whitespace, CDATA, and entity references into a continuous stream of characters, as long as there is no markup separating them: //text( )
The comment( ) function selects comments. Each comment is returned as a separate node, even if there is no text or markup between them: //comment( ) As the name implies, the processing-instruction( ) function selects processing instructions: //processing-instruction( ) With all the expressions you've seen so far, you can move up or down the node hierarchy at will, by inserting the appropriate axis. For example, you can select all the attributes of the parent nodes of any processing instructions with this expression: //processing-instruction( )/../@* 6.1.1.5 Selecting nodes by valueHowever, there are times when selecting all the elements or attributes with a particular name is not enough. You may want to find all the elements with a particular attribute value. For this purposes, XPath defines predicates. The following expression selects any item elements that have a productCode attribute whose value is equal to GN0500: //item[@productCode='GN0500'] You might also want to find all the items for which fewer than 10,000 units are in stock. The following XPath expression would discover that, and select their description attributes: //item[@quantity<10000]/@description XPath also supports the relational operators <, >, <=, >=, and !=, as well as and and or. Most values are converted automatically to an appropriate numeric or Boolean value, if the operator requires that type.
6.1.2 When to Use XPathYou should use XPath when you have an XML node in memory and you wish to navigate directly to a particular child node. This presumes that you have either created or loaded an XmlDocument in memory. You can also load an XML document directly into an XPathDocument from a Stream, URL, TextReader, or XmlReader. This method obviates the need to create an XmlDocument at all, and is more efficient than the DOM, since the XPathDocument is a read-only representation of the XML document. XPath is a good substitute for XmlReader when you have already read an entire document into memory, and the document is to be processed randomly. If you have an extremely large XML document, or you wish to access it strictly sequentially, however, there can be a performance advantage to writing an XmlReader client that handles parsing events. For example, if you are only interested in a certain node within the document, there is no need to load the entire document into memory; you should write an XmlReader client to handle the specific parsing event that indicates the node in question has been read, and skip the rest. |
[ Team LiB ] |