DekGenius.com
I l@ve RuBoard Previous Section Next Section

8.6 Coping with Disaster

When disaster strikes, it really helps to know what to do. Knowing to duck under a sturdy table or desk during an earthquake can save you from being pinned under a toppling monitor. Knowing how to turn off your gas can save your house from conflagration.

Likewise, knowing what to do in a network disaster (or even just a minor mishap) can help you keep your network running. Living out in California, as we do, we have some experience and some suggestions.

8.6.1 Short Outages (Hours)

If your network is cut off from the outside world (whether "the outside world" is the rest of the Internet or the rest of your company), your name servers may start to have trouble resolving names. For example, if your domain, corp.acme.com, is cut off from the rest of the Acme Internet, you may not have access to your parent (acme.com) name servers or to the root name servers.

You'd think this wouldn't affect communication between hosts in your local domain, but it can. For example, if you type:

% telnet selma.corp.acme.com

on a host running an older version of the resolver, the first domain name the resolver looks up is selma.corp.acme.com.corp.acme.com (assuming your host is using the default search list—remember this from Chapter 6?). The local name server, if it's authoritative for corp.acme.com, can tell that's not a kosher domain name. The next lookup, however, is for selma.corp.acme.com.acme.com. This prospective domain name is no longer in the corp.acme.com zone, so the query is sent to the acme.com name servers. Or rather, your local name server tries to send the query there and keeps retransmitting until it times out.

You can avoid this problem by making sure the first domain name the resolver looks up is the right one. Instead of typing:

% telnet selma.corp.acme.com

it's better to type:

% telnet selma

or:

% telnet selma.corp.acme.com.

(Note the trailing dot.) These result in a lookup of selma.corp.acme.com first.

BIND 4.9 and later resolvers don't have this problem, at least not by default. 4.9 and newer resolvers check the domain name as-is first, as long as the name has more than one dot in it. So, if you type:

% telnet selma.corp.acme.com

even without the trailing dot, the first name looked up is selma.corp.acme.com.

If you are stuck running a 4.8.3 BIND or older resolver, you can avoid querying off-site name servers by taking advantage of the configurable search list. You can use the search directive to define a search list that doesn't include your parent zone's domain name. For example, to work around the problem corp.acme.com is having, you could temporarily set your hosts' search lists to just:

search corp.acme.com

Now, when a user types:

% telnet selma.corp.acme.com

the resolver looks up selma.corp.acme.com.corp.acme.com first (which the local name server can answer), then selma.corp.acme.com, the correct domain name. And this works fine, too:

% telnet selma

8.6.2 Longer Outages (Days)

If you lose network connectivity for a long time, your name servers may have other problems. If they lose connectivity to the root name servers for an extended period, they'll stop resolving queries outside their authoritative zone data. If the slaves can't reach their master, sooner or later they'll expire the zone.

In case your name service really goes haywire because of the connectivity loss, it's a good idea to keep a site-wide or workgroup /etc/hosts around. In times of dire need, you can move resolv.conf to resolv.bak, kill the local name server (if there is one), and just use /etc/hosts. It's not flashy, but it'll get you by.

As for slaves, you can reconfigure a slave that can't reach its master to temporarily run as a primary master. Just edit named.conf and change the type substatement in the zone statement from slave to master, then delete the masters substatement. If more than one slave for the same zone is cut off, you can configure one as a primary master temporarily and reconfigure the others to load from the temporary primary.

Alternatively, you can just increase the expire time in all of your slaves' backup zone data files, and then signal the slaves to reload the files.

8.6.3 Really Long Outages ( Weeks)

If an extended outage cuts you off from the Internet—say for a week or more—you may need to restore connectivity to root name servers artificially to get things working again. Every name server needs to talk to a root name server occasionally. It's a bit like therapy: the name server needs to contact a root to regain its perspective on the world.

To provide root name service during a long outage, you can set up your own root name servers, but only temporarily. Once you're reconnected to the Internet, you must shut off your temporary root servers. The most obnoxious vermin on the Internet are name servers that believe they're root name servers but don't know anything about most top-level domains. A close second is the Internet name server configured to query—and report—a false set of root name servers.

That said, and our alibis in place, here's what you have to do to configure your own root name server. First, you need to create db.root, the root zone data file. The db.root file will delegate to the highest-level zones in your isolated network. For example, if movie.eduwere to be isolated from the Internet, we might create a db.root file for terminator that looked like this:

$TTL 1d
. IN SOA terminator.movie.edu. al.robocop.movie.edu. (
                 1        ; Serial
                 3h       ; Refresh
                 1h       ; Retry
                 1w       ; Expire
                 1h )     ; Negative TTL

  IN NS terminator.movie.edu. ; terminator is the temp. root

; Our root only knows about movie.edu and our two
; in-addr.arpa domains

movie.edu. IN NS terminator.movie.edu.
           IN NS wormhole.movie.edu.

249.249.192.in-addr.arpa. IN NS terminator.movie.edu.
                          IN NS wormhole.movie.edu.

253.253.192.in-addr.arpa. IN NS terminator.movie.edu.
                          IN NS wormhole.movie.edu.

terminator.movie.edu. IN A 192.249.249.3
wormhole.movie.edu.   IN A 192.249.249.1
                      IN A 192.253.253.1

Then we need to add the appropriate line to terminator 's named.conf file:

// Comment out hints zone
// zone . {
//              type hint;
//                      file "db.cache";
//              };

zone "." {
                type master;
                file "db.root";
};

Or, for BIND 4's named.boot file:

; cache    .   db.cache  (comment out the cache directive)
primary  .   db.root

We then update all of our name servers (except the new, temporary root) with a db.cache file that includes just the temporary root name server (it's best to move the old root hints file aside—we'll need it later, once connectivity is restored).

Here are the contents of the file db.cache :

.  99999999  IN  NS  terminator.movie.edu.

terminator.movie.edu.  99999999   IN  A  192.249.249.3

That will keep movie.eduname resolution going during the outage. Then, once Internet connectivity is restored, we can delete the new master zone statement from named.conf, uncomment the hint zone statement on terminator, and restore the original root hints files on all our other name servers.

    I l@ve RuBoard Previous Section Next Section