Index Home About Blog
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Date: Thu, 30 Oct 2003 21:04:49 +0100
Message-ID: <bnrqt2$4s2$1@osl016lin.hda.hydro.com>

Stephen Fuld wrote:

> "Robert Myers" <rmyers@rustuck.com> wrote in message
> news:vb21qvkr7gfsa731tcm870h2vkvabp6kt0@4ax.com...
>
> snip
>
>
>>That is one hot chip.  Even allowing for all the SERDES to be done by
>>custom circuitry, there is barely the bandwidth in the 500MHz PPC to
>>touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
>>processing 6*3.2Gbit/sec bit streams), so whatever "message
>>preparation" it does must not include even calculating a checksum, nor
>>even in all likelihood even touching the body of the message.  Maybe
>>that's just garden variety router design.
>
>
> The idea that a processor should be used to calculate a checksum is one born
> out of the peculiar blindness of some people with a network background who
> can't get beyond thinking that TCP/IP is "the" protocol.  In most cases, a
> reasonably designed protocol will allow the checksum to be
> calculated/checked in a modest piece of dedicated hardware as the data flows
> by on its way into or out of the chip.

I used to be totally in this camp, but there is one very good reason for
having a cpu-to-cpu checksum on any kind of transfer:

It will catch those _very_ rare but also _very_ destructive cases where
things that "cannot happen" still do, i.e. an Ethernet card that
silently drops bytes, while confirming the CRC checksum, but only if at
least one transmitter is breaking the Ethernet spec.

I've seen this happen with at laest two different "server-level" network
cards, working inside boxes with at some form of ECC on all buses.

I.e. all i/o was protected at all levels, but we still got silent data
corruption about once a day or so, when replicating GB's of data to
multiple servers.

Today I would have used rsync on top of ssh instead.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"



From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Date: Fri, 31 Oct 2003 09:00:30 +0100
Message-ID: <bnt4qu$s1l$1@osl016lin.hda.hydro.com>

George William Herbert wrote:

> Terje Mathisen  <terje.mathisen@hda.hydro.com> wrote:
>
>>I used to be totally in this camp, but there is one very good reason for
>>having a cpu-to-cpu checksum on any kind of transfer:
>>
>>It will catch those _very_ rare but also _very_ destructive cases where
>>things that "cannot happen" still do, i.e. an Ethernet card that
>>silently drops bytes, while confirming the CRC checksum, but only if at
>>least one transmitter is breaking the Ethernet spec.
>
> The problem with this...
>
> You have to trust something, unless you're running something
> like a Stratus or Tandem box with lockstepped CPUs.  Normally
> that is 'I trust the CPU'.  In a SMP box, that is 'I trust the CPUs'.

[long nice description of how and why the system has to be trusted as
some point]

As I wrote in the first paragraph, I used to agree totally.

However, having been bitten twice, I still don't feel that this is
enough to require sw checksums on everything, just that the capability
of turning it on is _very_ nice to have.

In my last paragraph (which you snipped) I noted that today I do similar
transfers using rsync over ssh, which means that I instead trust the
integrity checking of the compressed+encrypted ssh channel.

BTW, we very recently got into a _very_ strange situation where a couple
of network links would crash (ending up with zero open window space in
both directions) during offsite backup operations, and we eventually
figured out that this was repeatable using any of about 4 different
files (on a given link).

Those same files would crash the link using either SMB file sharing, NFS
mounting or FTP file transfer.

From this description I immediately pointed the finger at the involved
network links, guessing that we might have located a data
pattern-related firmware bug in the io cards, or in the
compression/decompression modules. It turned out that compression was
disabled.

After nearly a month of finger-pointing, the link vendor finally
admitted that they had forgotten to enable clocking on the link ("we
assumed you wanted to do that on your router i/o cards"), so any
sufficently long stretch of nulls (probably, it might have been some
other pattern) was enough to cause it to lose sync and then reset.

BTW, if the required pattern was long enough (say larger than 8 bytes),
any sufficently good compression and/or encryption module would have
been enough to effectively mask the error "forever".

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"



From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Message-ID: <boh16v$sbk$1@build.pdx.osdl.net>
Date: Fri, 07 Nov 2003 21:10:02 GMT

Bill Todd wrote:

> You may believe that that makes him as 'blind' to the obvious (to you)
> need for end-to-end validation as he seems to think you and those you
> defend are. But it doesn't make him a liar - it just makes you incompetent
> for calling him one (since while the design choices in TCP are debatable,
> the question of his honesty in this case does not appear to be).

What drugs are you people on?

Why do you continue to ignore the _undeniable_fact_ that the people who
distrusted hardware checksums were 100% _right_?

Bill, there's no "(to you)" part here to be obvious about. Anybody who
claims that TCP was silly to have an end-to-end checksum in addition to all
the low-level checksums is not only blind, he's also deaf and can't listen
to people who told him that he was wrong.

Don't get me wrong - checksums at a transport level are good (whether it be
ethernet or anything else), but they have clearly historically been
insufficient. Anybody who dismisses that really is not only blind, but is
in total denial about his blindness.

End-to-end checksums are a ReallyGoodIdea(tm). That is doubly true when they
are almost free to compute.

And the end-to-end checksums really should be computed by software at as
high a level as practically possible. In the case or normal networking,
that is the TCP layer, since requiring _all_ networked applications to do
it is not practical. But if you want to do your own checksums on your own
even higher level, go for it - without the fear of being branded an idiot.

Because quite frankly, the only _idiot_ here is the person who complains
about how other people did the right choice in the name of data integrity.
Arguing against data integrity checking is stupid to the extreme.

And yes, you want the integrity checks at multiple levels. Exactly because
different levels catch different kinds of errors:

 - transport checksums can be (and are) designed for the kinds of problems
   you see at a transport level.

   At this level, you tend to want a simple CRC or something that is cheap
   to do in hardware, and catches the issues that come up on the actual
   _wire_.

 - software checksums are good for catching the case (not at all hard to
   imagine or even uncommon) where hardware (or the drivers that drive it)
   does something wrong - DMA engine trouble, cache coherency and
   serialization bugs, and just plain hardware _errors_.

   Again, a CRC is fine, but even just a silly ones-complement checksum is
   actually not necessarily a bad mistake, since this level tends to catch
   totally different kinds of bugs (dropped or stale data, rather than any
   "wire" problems). So this checksum is designed to be IN ADDITION TO the
   transport level trying to do a "best effort" kind of thing.

 - high-level application checksums can be very useful to catch OS level
   corruption (hey, it happens), and malicious tampering (that also happens)

   At this level, try to go for a cryptographically secure hash, since here
   tampering or other "high-level" issues are the most likely issue. Again,
   this checksum does not make the lower level checksum unnecessary.

In short: checksums are good. Arguing against them is STUPID. And it is just
_incredibly_ stupid to do so when other people in the thread point to
studies that have shown them effective in real life!

In short: don't trust the hardware. Don't trust the software. And don't even
trust the user. If your data matters to you, you not only do backups, you
have ways to _verify_ the data and their backups. At every level, not just
the lowest one.

And if the data doesn't matter that much to you, just be happy that CPU's
are fast enough that the time wasted on being careful is usually not the
main problem.

                Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Message-ID: <boh9jr$1md$1@build.pdx.osdl.net>
Date: Fri, 07 Nov 2003 23:30:01 GMT

Rick Jones wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
>> End-to-end checksums are a ReallyGoodIdea(tm). That is doubly true
>> when they are almost free to compute.
>
>> And the end-to-end checksums really should be computed by software
>> at as high a level as practically possible. In the case or normal
>> networking, that is the TCP layer, since requiring _all_ networked
>> applications to do it is not practical.
>
> Not being certain if you are using should in the IETF sense of should
> vs must, but do you then also think that folks doing things realted to
> TCP should stop trying for zero-copy solutions?

Far be it from me to discourage people from trying to push the technology
they are interested in. Go wild.

But as a consumer of that technology, you should think twice about what it
actually means. It means trusting your data to a hardware and software
combination that you really have no real reason to trust.

Quite frankly, most hardware is buggy - ask anybody who has ever written a
driver for pretty much _anything_. There are always workarounds etc.

Software tends to have bugs too. Software that drives hardware (drivers) in
my experience has _more_ bugs than normal software. It's fundamentally "one
off", and it is fundamentally limited to just one set of testing
parameters: the hardware.

Put another way: most zero-copy special TCP stuff is a hell of a lot less
tested and has gotten a lot less attention than your average "real" TCP
stack. And that's totally ignoring the fact that if you're doing your
checksums in this special stack on the other side of the system bus, your
checksums simply WILL NOT check for bus problems.

In other words, you have _fundamentally_ less reliable software, on
_fundamentally_ less reliable (more complex) hardware, with _fundamentally_
weaker checking of the results. Get the picture?

So you make your own judgements. Me, personally, I suspect that the only
place where it really makes sense is where you have either incredibly good
engineering (made-up statistic of the day: "87% of all engineers think they
are better than average"), or you have other ways to make sure it's ok (ie
higher-level verification). Or you _really_ don't care, and you know that
you don't care.

In that sense, it's a bit like ECC. A _lot_ of computers are sold without
ECC, and hey, that's a perfectly fine engineering decision. In a lot of
situations you might like ECC, but you're so eager for lower price or
higher performance that you just decide that ECC isn't a deal breaker for
you. And that's fine. Engineering is all about making trade-offs, not about
"absolute truths".

Think about the new G5 supercomputer without ECC - is that really
acceptable? Sure it is, at least in some cases - because ECC is totally
superfluous if you are willing and able to verify your results some other
way.

But does that mean that not having ECC is a good idea in general? Hell no.
It should be seen as a _weakness_, but one you may decide is acceptable.

So if you have really extreme situations, and you UNDERSTAND what it means
to not do end-to-end checksums, and you fundamentally trust your hardware
(hah!) and your software (double hah!) _or_ you are willing to take your
chances, then sure - go for it.

                        Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Message-ID: <bohpbm$8gg$1@build.pdx.osdl.net>
Date: Sat, 08 Nov 2003 04:00:01 GMT

Robert Myers wrote:

> The idea that someone would take a hardware risk an that is acceptably
> small at the PC level and scale it up to a Top 500 "Supercomputer"
> says more than anyone should want to know about the state of computer
> simulation today.

Well, yeah, I have got to admit that I don't know anything but the sound
bites about what they do to try to "correct" for it, and my initial
reaction was that they were crazy.

But that said, there probably _are_ problems where you want the result fast,
and timeliness really is more important than correctness.  I realize that
that makes some people very uncomfortable, but on the other hand I also
realize that a number of research people at universities are a lot more
interested in the question "how can we calculate this quickly" than in the
actual result.

If that means that they can prototype some simulation on hardware that they
can afford, and then have to actually _verify_ the results somewhere else,
then so be it. It is potentially still a very useful machine for that.

> And, if you had sat through enough scatter plot presentations of
> simulations badly correlated with experiments, you wouldn't be too
> thrilled with the idea of verifying things "some other way."

That isn't my argument. Quite frankly, I'm uncomfortable even with regular
single-user desktops that don't have ECC, because it's so damn painful to
try to debug problems that end up being due to unreliable memory.

But I do acknowledge that ECC makes for a more expensive system (and often
slows things down a bit too - the modules tend to be slower and your access
patterns are slightly more constrained), and memory errors in the absence
of ECC are unlikely enough that it _is_ usually an acceptable situation to
forego it on the desktop.

Similarly, I bet that people can do interesting things even on the G5
cluster. Is it "good practices"? No. You'd be crazy to build something that
required a high degree of confidence in the correctness and reliability of
the results that way. But if you know that, it's quite possibly an
acceptable trade-off.

To get back on track, I think TCP offload is an acceptable trade-off in
theory. I'm actually not very interested in it personally, since _I_ don't
think the trade-offs are worth it. To me, the downsides of TCP offload are
_huge_: not only do you lose end-to-end protection and are at the mercy of
the reliability of the PCI bus rather than ECC memory (and the actual TCP
offload engine itself, which is imho likely to be less reliable still), but
you also end up losing a lot of flexibility.

So I'm definitely not a fan of TCP offload. I think it sucks. But hey,
that's just my opinion, and others will have other priorities. And that's
all I'm saying - if you know what the downsides are, you can make your own
decision if it is acceptable to _you_.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Message-ID: <bp2roc$qct$1@build.pdx.osdl.net>
Date: Fri, 14 Nov 2003 15:20:02 GMT

Jan C. Vorbrüggen wrote:
>
>> Why do you continue to ignore the _undeniable_fact_ that the people who
>> distrusted hardware checksums were 100% _right_?
>
> I interpret this to mean "there is not one correct hardware checksum
> implementation". I betcha Rob Warnock and SGI would strongly disagree,
> as would likely others.

Note that in my explanations, I went into considerable detail about the lack
of _software_ testing too.

The real reason to trust the software checksum is not that "hardware is
crap". No. The real reason to trust the software checksum is that that one
has gotten a huge amount of testing, over literally _tens_ of generations
of hardware, and over a much wider variety of circumstances than the
alternatives.

It is also done with the single piece of hardware in the system that is the
most well-tested by far: CPU and memory.

In contrast, hardware-assist opens up _both_ software and hardware to
fundamentally less robust paths. Historically, even the high end has gotten
it totally and undeniably wrong.

Just as an example: with the regular end-to-end checksum, the obvious and
natural way to do it is to just calculate it as the OS copies the buffer.
The code tends to be 20-100 lines of highly optimized assembly, and it
really only ends up having a few cases to take care of (all basically
having to do with alignment and taking care of the first/last bytes). And
this thing gets tested A LOT.

Yes, you then have the nastier parts (ie actually connecting up the
fragments for when you build up the packet from the parts), but they all
get tested over years and years of development, and there is a huge amount
of sharing in the software world - there aren't that many common TCP
stacks.

Now, compare that to the hardware case: even if the hardware was perfect
(and trust me, it won't be, not at SGI, not at Sun, not at Intel, certainly
not at Broadcom. And that's not even talking about the low-end crap!), you
have a couple of software paths that are error-prone as _hell_ when you try
to avoid copying and offload everything to the network card.

To take just one example, instead of having one highly optimized copy-
checksum routine, you end up having some _fundamentally_ complicated code to
do user-virtual-to-physical address lookups, making sure they have no races
with other CPU's doing page eviction, setting up the IO-DMA boundaries,
pinning the pages (and making sure they are read-only on ALL CPU's so that
nobody writes to the data) for the VM until the transfer is done, and
taking care of all the alignment problems you have etc etc.

See? That's just one obvious _software_ problem. And the fact is, doing
checksum offload JUST DOES NOT MAKE ANY SENSE if you end up copying the
data anyway. If you copy it, you checksum it. If you don't copy it, you end
up with MAJOR COMPLICATIONS.

As a result, you get strange bugs where the NIC sends out the wrong data
becaue it simply got the wrong page (and they could be rare - maybe they
only happen when the system is under load and the page stealer ends up
evicting a page that wasn't properly pinned down).

And this is COMPLETELY IGNORING the fact that even the high end hardware
tends to have some really flaky DMA. Issues with DMA engines having trouble
crossing a 4GB boundary because the address counters are just 32 bits.
Issues with cache coherency. Small "details" like that that a _lot_ of
people have gotten wrong over the time.

So don't take this personally as a hardware designer. Some day, when
_everybody_ does hardware checksums, and they use standard hardware that is
not considered high-end and hasn't been considered high end for the last
twenty years (and hasn't been totally redesigned a single time in that
time), THEN hardware checksums will likely be as reliable as the software
ones. Until then, you should balance the risks witht he benefits.

                Linus


From: jonathan@Pescadero.DSG.Stanford.EDU (Jonathan Stone)
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Date: 13 Nov 2003 20:46:09 -0800
Message-ID: <bp1mmh$s5r$1@Pescadero.DSG.Stanford.EDU>

In article <SvYsb.1249$1_2.7976@eagle.america.net>,
del cecchi <dcecchi@msn.com> wrote:


>If bits were getting munged getting from memory over the PCI to the NIC,
>shouldn't other devices be equally affected?  This is very puzzling.


>> The data I have can't directly pinpoint whethere those errors are
>> hardware or software. The paper details one case I beleive is software
>> (the last byte of an odd-length line being something other than the
>> ASCII LF the protocol requires).  Others -- like occasionaly dropping
>> 16 bits out of the middle of an IP address (so bits from the TCP
>> header magically show up in the IP header) look awfullly like a
>> harware problem, like a glitch in a FIFO dropping a 16-bit word.
>
>I presume any FIFO is in the NIC, right?  Busses don't have FIFOs, and
>main memory isn't a fifo.  Although I guess the DMA address counter
>could screw up.

Yes, it could, but  PCI devices ususally do 32-bit accesses.

>> But to answer your previous question: yes, the data overall says that
>> the software TCP checksum is catching errors at much higher rates than
>> anyone had previously anticipated: tens per million.  (Thats even
>> after excluding specific, individual hosts which generate bad packets
>> at a far higher rate; and a specific software bug).  And the
>> tail-network coax Ethernet trace shows that a number of those errors
>> are indeed happening inside the end-hosts, between the main-memory
>> buffers and the NIC CRC engine.  Right where all those
>> emphatic terms were used, a day or so back.
>
>OK.  Now recall I haven't used any such terms.

I was recalling words like this paragraph from an earlier which
seem pretty emphatic (not a euphemism; emphatic) to me:

>>Now, if you agree with the above crude summary, would you please tell me
>>what it is about TCP/IP/whatever that makes it so damn tricky that it is
>>believed impossible and evil to believe that a competent, not a genius
>>but competent, team of hardware designers are not capable of receiving
>>the data, detecting the line errors, performing retries as appropriate,
>>and transferring the data to main memory.


>I have a long background
>working for a company that takes data integrity very seriously.  And I
>have worked in the past on finding pretty subtle parasitic interactions
>on IC chips that were causing occasional errors.  So I am interested in
>what is going on here.

The data I have is four different traces, taken at different points
by different researchers, all showing tens or hundreds of packets-per-million
wehre the CRC looks good but the IP or TCP checksum indicates damage.
You're taking that as fact, and asking, why?

Over the years, several people who've seen this data for the first
time have commented that perhaps NIC designers don't to be as
completely correct as (say) SCSI or FC HBA designers, because they
know a higher-level checksum is very likely to catch the occasional
gross error. Whatever you think of that, my point is that once that
higher-level checksum moves into the NIC hardware, this becomes a
specious argument.


>You can repeat them?  Did I miss the first time, or do I need to go dig
>through papers.  It sounds pretty straightforward.  TCP/IP over Ethernet
>seems to cause the memory to I/O adapter interface to make undetected
>(by the I/O Subsystem) errors at a rate thousands of times higher than
>seen in other I/O adapters in the same I/O subsystem.  Is that a fair
>statement?

I should mention a couple more things before taking that as a blanket
description.  The first is the distribution of those errors across
both local and remote machines (i.e., how many of those errors occur
in intermedaite switches or routers, or in the end-hosts?) The second
is the distribution of errors across different IP addresses (do the
errors point to a small set of hosts, or routers, that're just plain
broken; or is there a "background" signal)?

The data says "some of each" (more details are in the paper). But even
after you exclude certain software bugs, and certain remote IP
addresses as just plain broke, there's still a lot of errors left.

[...]

>Surely someone must have looked at it.

Nope, nobody had really looked at this before.  The nearest thing I
found in a literature search dates back to modem networks in the early
60s: experiments with feeding in large volumes of known data patterns
during idle times, and comparing what came out to those known
patterns.  (That data fed into the USAF Rome Labs study which selected
the CRC now used in Ethernet).

Vern Paxson had some inbound-only whole-packet traces from Lawrence
Berkeley Labs, where he informally reported an unexpectedly high rate
of packets with bad checksums. That's the reason I started looking
into it.

>Are there that many broken NICs around?

Broken routers *or* broken NICs: as far as we can tell, yes, there
were, three or four years ago when the traces were taken. I'd
certainly call a substantial fraction of the NICs in the dorm trace
"broken": 15% to 20% [the number varies as students moved out at the
end of the quarter].  Judging by the OIDs, those NICs were at the low
end of the price range.

Or, the NIC hardware is OK but the drivers are broken.  But if the
driver is buggy and feeds the wrong data into outboard TCP checksum
hardware, that won't be caught by TCP, whereas a software TCP checksum
likely would catch it .)

Its that low-end where I see most cause for concern about data
integrity.  As I said a while back, I bought a batch of so-called
64-bit/66MHz PCI gigabit NIC card that actually ground the PCI bus
down to 33MHz. If that happened to you, would you trust that vendor to
get TCP checksum offload right, or to put in parity so they can give
some error indication if their their outboard buffers incur an SEU?


From: jonathan@Pescadero.DSG.Stanford.EDU (Jonathan Stone)
Newsgroups: comp.arch
Subject: Re: Cray to commercialize Red Storm
Date: 13 Nov 2003 18:55:58 -0800
Message-ID: <bp1g7u$r09$1@Pescadero.DSG.Stanford.EDU>

In article <bou4cb$rtu$1@news.rchland.ibm.com>,
Del  Cecchi <cecchi@us.ibm.com> wrote:


>It is not from the RJ-45 because the is some sort of CRC that checks that,
>right?  Or there could be, without being stabbed by graybeards with plastic
>forks at least.  Get a packet, check the crc.  Apparently the issue is one
>involving a bunch of packets having to be reassembled in some buffer in
>memory as a frame or a message or something.  There must be something hard
>about getting the packets from the output of the serdes, where they are
>correct (or at least 1 - 10**-12 percent of the bits are) to main memory
>where the error rate is orders of magnitude higher.

Or errors in getting the bits from a main-memory buffer, to the NIC
device and its CRC engine, which is fundamentally what the trace in
the coax-Ethernet segment is seeing. (Well, that, plus damage done
by remote hosts or in remote routers).

The data I have can't directly pinpoint whethere those errors are
hardware or software. The paper details one case I beleive is software
(the last byte of an odd-length line being something other than the
ASCII LF the protocol requires).  Others -- like occasionaly dropping
16 bits out of the middle of an IP address (so bits from the TCP
header magically show up in the IP header) look awfullly like a
harware problem, like a glitch in a FIFO dropping a 16-bit word.

But to answer your previous question: yes, the data overall says that
the software TCP checksum is catching errors at much higher rates than
anyone had previously anticipated: tens per million.  (Thats even
after excluding specific, individual hosts which generate bad packets
at a far higher rate; and a specific software bug).  And the
tail-network coax Ethernet trace shows that a number of those errors
are indeed happening inside the end-hosts, between the main-memory
buffers and the NIC CRC engine.  Right where all those
emphatic terms were used, a day or so back.

Asking for explanations *why* this happens puts me in a bind.  I can
repeat the inferences I've drawn and which the networking community
(which includes people with pretty solid hardware backgrounds) were
prepared to buy. But I dont want to be accused of handwaving and
bullshit again.


Index Home About Blog