Xerox experience with broken mailer behaviour


JLarson.pa@Xerox.COM
Sun, 14 Feb 88 16:18:38 PST


We at Xerox were badly burned recently by sites which;

1) Do not use the domain name system to look up IP addresses

2) Pick the "best" address (usually net 10) from the host table,
    and never try any other address in the list.

Our story should explain why I now consider 2) to be severely broken mailer
behaviour. We installed an IP gateway at our PSN where the Xerox.Com mail
gateway used to be, so the address for Xerox.Com changed from a net 10 address
to a net 13 subnet address. The Xerox.Com domain name server14-Feb-88 18:57:04-PST,7286;000000000000
Received: from SUMEX-AIM.Stanford.EDU by SRI-NIC.ARPA with TCP; Sun 14 Feb 88 18:52:07-PST
Received: from PANDA.PANDA.COM by SUMEX-AIM.Stanford.EDU with Cafard; Sun, 14 Feb 88 18:47:49 PST
Date: Sun, 14 Feb 88 18:18:09 PST
From: Mark Crispin <MRC@PANDA.PANDA.COM>
Subject: trying multiple addresses
To: Header-People@MC.LCS.MIT.EDU
cc: TCP-IP@SRI-NIC.ARPA
Postal-Address: 1802 Hackett Ave.; Mountain View, CA 94043-4431
Phone: +1 (415) 968-1052
Message-ID: <12374794814.7.MRC@PANDA.PANDA.COM>

     I am sure the advocates of trying multiple mail addresses would
feel quite differently if they had to pay per-packet charges for network
access. Historically, only a small percentage of network connection
failures -- typically less than 1% -- have been due to a dysfunctional
IP address. The remaining (= overwhelming majority of) failures have
been due to dysfunctional networks, dysfunctional hosts, or dysfunctional
servers.

     It is possible that trying a different IP address may help in the
dysfunctional network case, although typically the "non-best" IP addresses
all involve the dysfunctional network in some way (look at some network
topology maps some time). This is a relatively rare case anyway.

     Many times, the "non-best" IP address is substantially inferior to
the point where it should not be used under ANY circumstance. No site
outside of Stanford should *ever* use SAIL's, Score's, or SUMEX-AIM's
net 36 IP address; the gateway between net 10 and net 36 (as well as the
net 36 subnet from that gateway) is seriously overloaded.

     If I understand JLarson.pa correctly, he's saying that Xerox.COM
will use SUMEX-AIM's net 36 address just because they couldn't connect
to the net 10 address the last time. If this is common behavior it's
no wonder those of us who must use the net 10/36 gateway find it so
unusable. Will I have to instruct the servers on multi-homed net 10/36
hosts to refuse connections on net 36 from non-net 36 hosts to get them
to stop?

     What about those guys multi-homed on a "free" and a pay-per-packet
X.25 net? Do they appreciate this behavior?

     The *correct* solution to this problem is NOT kludgy algorithms in
the mailer. The correct solution is multi-part, and involves:
1) complete the migration from the host table to the domain system. The
   NIC simply cannot keep up with the changes in network topology (as the
   Xerox experience showed), and, frankly, it's unreasonable for us to
   expect them to.
2) domain database managers need to keep their name servers updated with
   changes to network topology. TTL's should not be allowed to be so long
   that topology changes go unnoticed by resolvers for excessive periods
   of time.
3) better support needs to exist in the domain infrastructure for "best"
   IP address selection.

     This last point is important. Presently, it is up to the local host
to decide upon a "best" IP address, based on quite incomplete information.
Many hosts (all Unix hosts?) simply pick the first IP address listed in
the NIC host table (or returned as A RR's from the domain system). TOPS-20
selects in priority order: (1) first IP address from a directly connected
net that is "preferred" (e.g. a fast LAN), (2) first IP address from a
directly connected net that is "default" (e.g. a core net such as ARPANET),
(3) first IP address from any other directly connected net, (4) first IP
address. "First IP address" means first from the address list from the
host table (or a set of A RR's from the domain system). Note that there
is nothing whatsoever to do with "net 10".

     Almost 100% of the time, this makes the best possible choice of an
IP address. It's only in those very few cases (which come up perhaps
2 or 3 times a YEAR!!!) where an otherwise highly desirable path breaks
for a long period of time that a problem comes up. I consider it highly
objectionable to cycle through every other IP address (waiting a minute
or more for an IP retransmission timeout if the network is courteous
enough to tell me the other guy ain't there) every time I attempt to
connect to a dead host.

     JLarson's suggestion is less objectionable, but it involves one
piece of software (the mailer) telling a completely different piece of
software (host table or domain resolver) that the IP address given it
was sick. Nobody wants to do the work to the host table software to
add such a feature. It might be doable with the domain resolver (SRA
can comment on this); it certainly wouldn't be hard for the mailer to
pass on the word to the domain resolver.

     The problem is, what does "this IP address was sick" really mean?
How does "retransmission timeout" differ from "host dead" (a type 7
1822 message) differ from "host sent a reset" (refused the connection)
differ from any of the other ways a connection failed? In which one(s)
of these do you say try another IP address, and in which one(s) do you
assume the host is really down, or really doesn't want to talk now?

     Again, what do you do about those cases when we really shouldn't
be using a particular IP address because of charging, or other
administrative issues?

     The domain system may be able to help; it was always my belief (I
remember suggesting this at the meeting when the domain system concept
was first invented) that nameservers should be allowed to tailor their
responses based on who was asking the question. A domain query should
be something like: "I am on net 128.43 seeking an SMTP server for
FOO.BAR.COM, which is the best address for me to use?" and later on "I
am on net 128.43 seeking an SMTP server for FOO.BAR.COM and I already
tried 69.105.8.3, is there any other I should try?"

     The point is that a perfectly valid answer may be "if 69.105.8.3
ain't answering, he ain't up; try again later."

     This also gives the remote organization (which presumably knows
the status of their hosts) control over the IP address selection criteria,
based upon their knowledge instead of the local host's educated guesswork.

     Please, no flames. If you're going to babble on and on about how I
should break my mailer to conform to your fantasy of how the world should
work, send it to *NUL: or /dev/null or whatever you call it. Furthermore,
I'm not interested in any comments about a host table based means of IP
address selection. The systems I support do not use host tables (and, for
the record, are currently the only TOPS-20's supporting MX mailing). I
can't help but feel that if the problem of a sick "best" IP address happens
to a domain-based mailer, that the fault is that of the management of the
nameserver for that organization and not that of the mailer.

     If you have constructive observations, then let's talk. Remember
that this is not about porting arguably "better" (or "worse") ideas from a



This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:40:42 GMT