DekGenius.com
[ Team LiB ] Previous Section Next Section

27.6 Internet-Related Modules

Python is used in a wide variety of Internet-related tasks, from making web servers to crawling the Web to "screen-scraping" web sites for data. This section briefly describes the most often used modules used for such tasks that ship with Python's core. For more detailed examples of their use, we recommend Lundh's Standard Python Library and Martelli and Ascher's Python Cookbook (O'Reilly). There are many third-party add-ons worth knowing about before embarking on a significant web- or Internet-related project.

27.6.1 The Common Gateway Interface: The cgi Module

Python programs often process forms from web pages. To make this task easy, the standard Python distribution includes a module called cgi. Chapter 28 includes an example of a Python script that uses the CGI, so we won't cover it any further here.

27.6.2 Manipulating URLs: The urllib and urlparse Modules

Universal resource locators are strings such as http://www.python.org that are now ubiquitous. Three modules—urllib, urllib2, and urlparse—provide tools for processing URLs.

The urllib module defines a few functions for writing programs that must be active users of the Web (robots, agents, etc.). These are listed in Table 27-9.

Table 27-9. Functions of the urllib module

Function name

Behavior

urlopen(url [, data])

Opens (for reading) a network object denoted by a URL; it can also open local files:

>>> page = urlopen('http://www.python.org')
>>> page.readline(  )
'<HTML>\012'
>>> page.readline(  )
'<!-- THIS PAGE IS AUTOMATICALLY GENERATED.DO NOT EDIT. -->\012'

urlretrieve(url [, filename][, hook])

Copies a network object denoted by a URL to a local file (uses a cache):

>>> urllib.urlretrieve('http://www.python.org/', 
'wwwpython.html')

urlcleanup( )

Cleans up the cache used by urlretrieve.

quote(string[, safe])

Replaces special characters in string using the %xx escape. The optional safe parameter specifies additional characters that shouldn't be quoted; its default value is:

>>> quote('this & that @ home')
'this%20%26%20that%20%40%20home'

quote_plus(string[, safe])

Like quote( ), but also replaces spaces by plus signs.

unquote(string)

Replaces %xx escapes by their single-character equivalent:

>>> unquote('this%20%26%20that%20%40%20home')
'this & that @ home'

urlencode(dict)

Converts a dictionary to a URL-encoded string, suitable to pass to urlopen( ) as the optional data argument:

>>> locals(  )
{'urllib': <module 'urllib'>, '__doc__': None, 'x':
3, '__name__': '__main__', '__builtins__': <module
'__builtin__'>}
>>> urllib.urlencode(locals(  ))
'urllib=%3cmodule+%27urllib%27%3e&__doc__=None&x=3&
__name__=__main__&__builtins__=%3cmodule+%27
__builtin__%27%3e'

The module urllib2 focuses on the tasks of opening URLs that the simpler urllib doesn't know how to deal with, and provides an extensible framework for new kinds of URLs and protocols. It is what you should use if you want to deal with passwords, digest authentication, proxies, HTTPS URLs, and other fancy URLs.

The module urlparse defines a few functions that simplify taking URLs apart and putting new URLs together. These are listed in Table 27-10.

Table 27-10. Functions of the urlparse module

Function name

Behavior

urlparse(urlstring[, default_scheme[,allow fragments]])

Parses a URL into six components, returning a six tuple (addressing scheme, network location, path, parameters, query, fragment identifier):

>>> urlparse('http://www.python.org/
FAQ.html')
('http', 'www.python.org', '/FAQ.html', '', '', '')

urlunparse(tuple)

Constructs a URL string from a tuple as returned by urlparse( )

urljoin(base, url[,allow fragments])

Constructs a full (absolute) URL by combining a base URL (base) with a relative URL (url):

>>> urljoin('http://www.python.org', 
'doc/lib')
'http://www.python.org/doc/lib'

27.6.3 Specific Internet Protocols

The most commonly used protocols built on top of TCP/IP are supported with modules named after them. The telnetlib module lets you act like a Telnet client. The httplib module lets you talk to web servers with the HTTP protocol. The ftplib module is for transferring files using the FTP protocol. The gopherlib module is for browsing Gopher servers (now fairly rare). In the domains of mail and news, you can use the poplib and imaplib modules for reading mail files on POP3 and IMAP servers, respectively and the smptlib module for sending mail, and the nntplib module for reading and posting Usenet news from NNTP servers.

There are also modules that can build Internet servers, specifically a generic socket-based IP server (SocketServer), a simple web server (SimpleHTTPServer), a CGI-compliant HTTP server (CGIHTTPSserver), and a module for building asynchronous socket handling services (asyncore).

Support for web services currently consists of a core library to process XML-RPC client-side calls (xmlrpclib), as well as a simple XML-RPC server implementation (SimpleXMLRPCServer). Support for SOAP is likely to be added when the SOAP standard becomes more stable.

27.6.4 Processing Internet Data

Once you use an Internet protocol to obtain files from the Internet (or before you serve them to the Internet), you often must process these files. They come in many different formats. Table 27-11 lists each module in the standard library that processes a specific kind of Internet-related file format (there are others for sound and image format processing; see the library reference manual).

Table 27-11. Modules dedicated to Internet file processing

Module name

File format

sgmllib

A simple parser for SGML files.

htmllib

A parser for HTML documents.

formatter

Generic output formatter and device interface.

rfc822

Parse RFC-822 mail headers (i.e., "Subject: hi there!").

mimetools

Tools for parsing MIME-style message bodies (a.k.a. file attachments).

multifile

Support for reading files that contain distinct parts.

binhex

Encode and decode files in binhex4 format.

uu

Encode and decode files in uuencode format.

binascii

Convert between binary and various ASCII-encoded representations.

xdrlib

Encode and decode XDR data.

mailcap

Mailcap file handling.

mimetypes

Mapping of filename extensions to MIME types.

base64

Encode and decode MIME base64 encoding.

quopri

Encode and decode MIME quoted-printable encoding.

mailbox

Read various mailbox formats.

mimify

Convert mail messages to and from MIME format.

mail

A package for parsing, handling, and generating email messages.

27.6.5 XML Processing

Python comes with a rich set of XML-processing tools. These include parsers, DOM interfaces, SAX interfaces, and more, as shown in Table 27-12.

Table 27-12. Some of the XML modules in the core distribution

Module name

Description

xml.parsers.expat

An interface to the Expat nonvalidating XML parser

xml.dom

Document Object Model (DOM) API for Python

xml.dom.minidom

Lightweight DOM implementation

xml.dom.pulldom

Support for building partial DOM trees from SAX events

xml.sax

Package containing SAX2 base classes and convenience functions

xml.sax.handlers

Base classes for SAX event handlers.

xml.sax.saxutils

Convenience functions and classes for use with SAX.

xml.sax.xmlreader

Interface that SAX-compliant XML parsers must implement.

xmllib

A parser for XML documents.

See the standard library reference for details, or the Python Cookbook (O'Reilly) for example tasks easily solved using the standard XML libraries. The XML facilities are developed by the XML Special Interest Group, which publishes versions of the XML package in-between Python releases. See http://www.python.org/topics/xml for details and the latest version of the code. For expanded coverage, consider Python and XML, by Christopher A. Jones and Fred L. Drake, Jr. (O'Reilly).

    [ Team LiB ] Previous Section Next Section