Re: Life after source quench


Charles Hedrick (hedrick@athos.rutgers.edu)
Tue, 17 Nov 87 22:38:02 EST


Over the weekend we found a bug in bind 4.7 that would lead to
uncontrolled streams of requests to name servers. Indeed the same bug
exists as far back as 4.4, but may not have quite the same drastic
effect in earlier releases. The nature of the bug is that when bind
sends a reply, it does not always have the bit turned on that declares
it to be a response. To see why this has the effect, consider the
two-level server configuration supported by 4.7. There are 3 primary
name servers at Rutgers. All other name servers forward requests to
one or more of them. Suppose a random server tries to do a lookup of
a name that for some reason can't be resolved. It will forward the
request to one or more of the primary servers. They will send
requests to the appropriate servers, one of which is presumably yours.
Once this times out (presumably because all of the servers for the
domain are dead or something), the primary server will send a message
back to the original one saying with no answer. However if the
response bit is off, this will look like a request. The original name
server will now send off requests to each of the 3 primary servers,
and the cycle will restart. Note that one initial request has now led
to 3 new queries, one triggered by the answer from each of the primary
servers. Unfortunately, bind does not detect when it is being asked
identical questions. (It does detect duplicates, in the sense of a
query with the same query ID being issued several times. But in this
case each new query will have a different ID.) Thus we have an
explosion, which ultimately will be limited only by our name servers'
CPU and the transmission line. The servers involved are a Sun 3, a
Sun 4, and a Pyramid, and we feed into NSFnet with a T1 line. I
believe we are capable of saturating the NSFnet backbone. I think
this explains the observed results.

We have fixed the bug in bind, and have not seen any storms such as
this since.

There was one additional problem. The reason we were attacking your
server in the first place was because we had the wrong list of root
servers. Unlike some earlier releases, bind 4.7 keeps track of
root servers dynamically. It writes out the current state of its
cache every hour. When the system is rebooted, it starts from the
most recent cache state, not from a cold start. So if it once gets
a bad root name server, and if that server lists itself as a root
name server, this server will continue being used as a root
forever. Furthermore, if the bad server is on NSFnet, it will have
better response than the real root servers, and so will be used
preferentially. So all it takes is for your server to get listed
once as a root server. That is very easy to happen. Bind's basic
algorithm is the following:
  send a request to an appropriate name server
  if there is an answer return it to the user and terminate
  put all data from the response into the cache, whether it is an answer,
        in the authority section, or additional data
  recompute the appropriate set of name servers to use
Because it puts all data from the authority and additional data sections
into the cache, all it needs is for one server to list you as a root
server in the authority section of a response.

We regard this as a bug, and have fixed it. We no longer accept root
name server information unless it is in the answer section of the
response. Unless there is a serious bug in somebody's code, root NS
records will get into an answer section only if we ask explicitly who
the root NS records are. We will only ask that question of an
existing root server. Since we put that patch in, we have stopped
seeing random hosts showing up as root name servers.

While we have fixed these bugs, there are presumably many other sites
out there that still have them. Both problems are present in all
versions of bind that we know. I have posted these problems to the
mailing list of 4.7 beta-testers. I have an annotated listing of all
of our patches for anyone that wants to put them into their own copy
of 4.7, or indeed an earlier release.

Please tell me if you see any other misbehaviors caused by our name
servers.



This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:39:56 GMT