Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Mon, 24 Apr 2006 19:03:22 UTC
Message-ID: <fa.YZQD1wqc8EAkcaIZfnPxACWIOIo@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604241156340.3701@g5.osdl.org>

On Mon, 24 Apr 2006, Stephen Hemminger wrote:
>
> We should fail request_irq() if the SA_SHIRQ but the irq is edge-triggered.

That would be HORRIBLE.

Edge-triggered works perfectly fine for SA_SHIRQ, as long as there is just
one user and the driver is properly written. Making request_irq() fail
would break existing and working setups.

If you have a driver that requires level-triggered interrupts, then your
driver is arguably buggy. NAPI or no NAPI, doesn't matter. Edge-triggered
interrupts is a fact of life, and deciding that you don't like them is not
an excuse for saying "they should not work".

You can get an edge by having your driver make sure that it clears the
interrupt source at some point where it requires an edge.

And yes, that may mean that when you're ready to start taking interrupts
again, you are required to first read all pending packets, instead of just
assuming that a level-triggered interrupt will "just happen", but that's
the harsh reality for writing a driver that actually WORKS.

For a driver writer, there is one rule above _all_ other rules:

	"Reality sucks, deal with it"

That rule is inviolate, and no amount of "I wish", and "it _should_ work
this way" or "..but the documentation says" matters at all.

If you can't take that rule, don't write drivers, and don't design
infrastructure for them.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Mon, 24 Apr 2006 19:08:59 UTC
Message-ID: <fa.bA1L2QxVD4OIoF52xl2J6ofVhMs@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604241203130.3701@g5.osdl.org>

On Mon, 24 Apr 2006, Linus Torvalds wrote:
>
> You can get an edge by having your driver make sure that it clears the
> interrupt source at some point where it requires an edge.

Btw, this is why we do end up saying that having _two_ devices share
an edge-triggered setup really is something we cannot necessarily
fix. That said, it is better to limp along and work as well as you can
than to just throw up your hands.

So even then, we should at least give the user the _chance_ of being able
to log in and give a bug-report, rather than "oops, the harddisk won't
work, because the BIOS sets it up to share an edge-triggered interrupt
with the network".

However, I'm all for printing out a honking huge warning if we have two
devices sharing the same edge-triggered interrupt. But a single device
should work, or the driver should be considered broken.

[ Btw, the "disable_irq()/enable_irq()" subsystem has been written so that
  when you disable an edge-triggered interrupt, and the edge happens while
  the interrupt is disabled, we will re-play the interrupt at enable time.
  Exactly so that drivers can have an easier time and don't have to
  normally worry about whether something is edge or level-triggered.

  However, if you're within an interrupt, that doesn't mean that you can
  just disable the irq and hope that it acts as if it was level-triggered
  when you enable it again. ]

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Mon, 24 Apr 2006 21:07:41 UTC
Message-ID: <fa.ZuiarLm+2wei48KTnEtQSASlHNk@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604241354200.3701@g5.osdl.org>

On Mon, 24 Apr 2006, Arjan van de Ven wrote:
>
> well... the corner case (as rmk described) is full starvation; even
> polling once per second is better than not polling at tall there.,..

I can confirm that even just polling once a second - or even less often -
is a huge improvement.

A long time ago, I had a machine with a 3c509 card that would sometimes
miss receive interrupts for some reason (it may actually have been a bug
in the disable_irq/enable_irq thing on SMP, I forget - this is at least
five years ago).

The 3c509 driver had some five-second timeout thing, which meant that it
would end up polling itself every five seconds regardless, and it made the
difference between a machine that was totally undebuggable over a network,
and one that actually worked surprisingly well (ie you could ssh into it,
and things would work, but have these long pauses every once in a while,
with burst of data).

Of course, the machine would have been totally useless for real work, but
it made it _much_ easier to see what was going on when things went south.

So "limping along" when things don't work can be a huge time-saver from a
debugging standpoint. So even if it's just that every registered SA_SHIRQ
would get a heartbeat at least once every five seconds (and we'd limit it
to SA_SHIRQ exactly because a driver that doesn't have that set may get
confused if it gets extra interrupts), that might sound totally useless,
but it might actually help somebody who otherwise might just make a pretty
useless "the machine hung" bug-report.

The fake interrupt could even print out a warning if somebody returns
SA_HANDLED (since normally there _shouldn't_ have been any work to handle
for it), and if that means that for somebody, things go from "the machine
hung" to "the machine got very slow, and printed out 'fake interrupt for
ide0 returned SA_HANDLED!'", that would potentially be a big debug aid.

We've had our ass saved quite a few times now by the irq storm detector
("irq X: nobody cared" and friends), which has helped debug irqs that
haven't been set up properly, that I'm convinced things like this might
well make a huge deal.

Of course, "things like this" does not necessarily cover the above
scenario. Maybe that is totally useless ;)

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Mon, 24 Apr 2006 21:09:58 UTC
Message-ID: <fa.XJyXlefS49jQ9O3ZvokD9Ls5IWw@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604241407130.3701@g5.osdl.org>

On Mon, 24 Apr 2006, linux-os (Dick Johnson) wrote:
> On Mon, 24 Apr 2006, Linus Torvalds wrote:
> >
> > SA_SHIRQ does NOT mean that the irq is shared.
> >
> > It means that it's not exclusive, and that the driver is _ok_ with it
> > being shared if that makes sense.
>
> Yeah. You have been talking to too many lawyers! You are getting a
> forked tongue!

No, it's just legacy from some _really_ really old code. As in 1991.

The very original Linux irq system didn't share interrupts at all (hey,
PCI was newfangled, and ISA interrupts ruled), so when interrupt sharing
was added, the default was to not do it.

These days, that doesn't make any sense, and if somebody did the flags
today, you'd do it the other way around (default to shared, and if
somebody wants a really exclusive interrupt, they should say so with
SA_EXCLUSIVE or something).

But Linux grew from humble and stupid roots.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Mon, 24 Apr 2006 22:27:06 UTC
Message-ID: <fa.vrpIbNNgezR09iyLbMGCqn9jOsc@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604241521210.3701@g5.osdl.org>

On Mon, 24 Apr 2006, Alan Cox wrote:
>
> > The fake interrupt could even print out a warning if somebody returns
> > SA_HANDLED (since normally there _shouldn't_ have been any work to handle
> > for it), and if that means that for somebody, things go from "the machine
> > hung" to "the machine got very slow, and printed out 'fake interrupt for
> > ide0 returned SA_HANDLED!'", that would potentially be a big debug aid.
>
> There are high rate IRQ sources that would trigger that erratically due to
> races but it could be useful in some kind of "linux irqdebug" mode

I was thinking that an interrupt actually happening on that irq would set
the "already done" flag, so that if it's a high-rate irq, then we'd not
inject any new fake ones.

So the algorithm would be something like "clear the 'already done' flag
every five seconds, and sending the fake one if it was already cleared",
with normal interrupts always setting the flag.

And yes, you could still hit it in a blue moon (nothing happened for five
seconds, and then it happens _just_ as we send a fake event), but if the
only thing we do is do a printk() on it, no big deal if you get a false
positive every five years.

(The bigger problem is that some drivers just return IRQ_HANDLED whether
they had work to do or not, because people - including me - were lazy in
some of the conversion of the irq_handler_t stuff).

I dunno. It _sounds_ simple enough..

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC 1/2] irq: record edge-level setting
Date: Mon, 24 Apr 2006 22:34:50 UTC
Message-ID: <fa.6EFZadlvFIiegTcGZVNcmgZJ0MY@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604241527060.3701@g5.osdl.org>

On Mon, 24 Apr 2006, Stephen Hemminger wrote:
>
> Maybe that's why it never was done in the past, too much work and historical
> baggage.

It's messy. That whole ELCR register was mis-designed: you can change the
edge/level detection with it, but since it _also_ changes the polarity of
the signal, you can't actually do so from a sw angle, and it has to match
the hardware. So you can't say "I want to treat this interrupt as level
triggered", and just set the bit ;^/

To make matters worse, I wouldn't be in the least surprised if the ELCR
register is totally ignored by many south-bridges for the internally
generated interrupts (ie devices that are embedded in the SB), since the
register really doesn't matter for them.

And it doesn't help that Intel mis-designed the edge-detection logic on
the IO-APIC. On the old i8259, if you masked an interrupt and unmasked it,
an active interrupt would always be seen as an edge, because the
edge-detection was done _after_ masking. On the IO-APIC crap, the masking
is done after edge-detection, so if you mask the APIC hardware level, and
an edge happens, you'll never ever learn of it ever again.

I'm sure other system architectures have similar problems, but it's
irritating.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Sun, 30 Apr 2006 05:20:50 UTC
Message-ID: <fa.z3ZVVRxN05ZWGEgbepCc+PtC3gk@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604292204270.4616@g5.osdl.org>

On Sun, 30 Apr 2006, Neil Brown wrote:
>
> So what do you propose should be done to better handle such poorly
> built machines?

Well, the thing is, there's not a lot we _can_ do.

We can try to report it. We can also try to handle it as gracefully as we
can.

> As a concrete example I have a notebook which definitely assigns
> shared interrupts to IRQ-10 (See /proc/interrupts below) yet the ELCR
> only flags IRQ-11 as being level triggered and the rest are edge
> triggered.

Also, do you have the option to enable the IO-APIC? Maybe it's already
enabled, and your BIOS has just disabled it, but your /proc/interrupts
implies that you may have compiled your kernel without UP_APIC support.

With the APIC, we might be able to do better. Worth trying out.

> And with this configuration I definitely lose interrupts to the
> wireless ethernet (ra0).
>
> How do I make this work reliably?
> I could:
>
> 1/ modify handle_IRQ_event so that it is more resilient to the
>   possibility that shared interrupts are edge triggered.  This can be
>   done be iterating over all action->handlers until they all return
>   IRQ_NONE.

Well, yes. It's worth trying, but as mentioned, we have some drivers that
return IRQ_HANDLED just because the driver conversion has been lazy. So
limit it to a few things.

And we really should have some flag that says whether the interrupt
descriptor ends up being edge, so that we could do this for edge-triggered
interrupts _only_.

Anyway, I also do wonder if your irq lossage is due to something else.

On the XT-PIC, disabling the irq will cause an edge when it's re-enabled,
so you can get the "level" behaviour by disabling the irq over the irq
handler.

And that's exactly what we do, if I recall correctly. It's been years
since I worked with that code, but looking at it quickly, it seems to
match my recollection.

> 2/ Arrange that the ELCR bit is set for any IRQ for which a shared
>   interrupt is registered (on the basis that the code for handling
>   shared interrupts is not resilient against them being edge triggered).

NO.

How many times do I have to say this?

Yes, ELCR sets edge vs level.

BUT IT ALSO SETS THE POLARITY.  If you switch the bit around, it will also
switch the polarity, and IT WILL NOT WORK. Because you'll end up with a
level-triggered interrupt that is level-triggered for the wrong polarity,
and will trigger whenever there is _not_ an interrupt pending.

Now, I will almost guarantee you that there is an exception to this rule
(hey, it's PC hardware, there's _always_ an exception to any rule ;), and
on some situations, the ELCR thing will truly only affect edge vs level.

But the point is, we can't just switch to level triggered. There simply is
no such hardware in general for the old PC interrupts.

(Now, _if_ you use the APIC, you can actually switch polarity and trigger
mode independently. Which is one reason why I'd like to hear whether you
perhaps have just disabled the APIC by mistake, rather than have a nasty
BIOS that disables it for you).

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Sun, 30 Apr 2006 06:59:49 UTC
Message-ID: <fa.WbophM0vf2KyIjBUKF76WBG4pNI@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604292343020.3690@g5.osdl.org>

On Sun, 30 Apr 2006, Neil Brown wrote:
>
> The thing is: This is exactly what I am currently doing to solve the
> problem.

I'm not entirely surprised. As mentioned, the ELCR register _originally_
selected between EISA (level) and ISA (edge) interrupts.

But EISA (and later PCI) interrupts were not just level, they were level
active _low_. While old ISA interrupts are edge-triggered, active _high_.

Which explains why it not only changes the trigger, but also the polarity.

Now, fast-forward a decade or two, and imagine that the world is 99% PCI,
and nobody really has any devices that are _electrically_ ISA any more,
but there are some legacy stuff that _looks_ like ISA. What would you do
to simplify your life from a hw perspective?

I suspect that the thing to do is to internally just say that all
interrupts are active low. There's no reason to _really_ have active high,
because there are no real devices left that drive the irq line that way.

Now, the _sane_ thing to do would be to also make all interrupts be
level-triggered, and make the whole ELCR register be a total dummy
register. But you can't really do that without being worried about
breaking compatibility (for example, the timer interrupt is a 50%
duty-cycle on/off thing, so it really _does_ end up being edge-triggered).

So you leave the ELCR register mattering for a edge/level thing, but the
polarity issue is just gone.

But then on _other_ southbridges, you'll have the old behaviour, and there
simply is no way for the OS to know. Yeah, we could look at the
nortbridge and southbridge combination, and perhaps know that some of them
always have a "active low" polarity regardless of ELCR. But nobody even
_documents_ these things, exactly because it's not supposed to matter.

So we're kind of screwed. We have to _act_ as if we still lived in the
middle ages, and people still used edge-triggered active-high interrupts.
Even when it's not necessarily the case any more..

Gaah.

That said, I'm surprised that the kernel doesn't set ELCR for you. If it
sees a PCI device, it really should know that it's a PCI interrupt. I
wonder if we should do the following.. (Does this automatically make it do
the right thing on your machine?)

			Linus

---
diff --git a/arch/i386/pci/irq.c b/arch/i386/pci/irq.c
index 7323544..6e3eaef 100644
--- a/arch/i386/pci/irq.c
+++ b/arch/i386/pci/irq.c
@@ -881,6 +881,7 @@ static int pcibios_lookup_irq(struct pci
 	((!(pci_probe & PCI_USE_PIRQ_MASK)) || ((1 << irq) & mask)) ) {
 		DBG(" -> got IRQ %d\n", irq);
 		msg = "Found";
+		eisa_set_level_irq(newirq);
 	} else if (newirq && r->set && (dev->class >> 8) != PCI_CLASS_DISPLAY_VGA) {
 		DBG(" -> assigning IRQ %d", newirq);
 		if (r->set(pirq_router_dev, dev, pirq, newirq)) {


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: better leve triggered IRQ management needed
Date: Tue, 02 May 2006 15:06:09 UTC
Message-ID: <fa.EYdeRkDusHDVqHD0ftG85wm5J08@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605020804450.4086@g5.osdl.org>

On Tue, 2 May 2006, Neil Brown wrote:
>
> Maybe the eisa_set_level_irq should be passed 'irq' rather than 'newirq'
> ??

Yeah, stupid cut-and-paste error (the eisa_set_level_irq() call _is_
already there in the PCI irq setting, for the case where we actually have
to set up routing that didn't exist before).

That's also why I'm a bit nervous even about my stupid one-liner patch: if
the irq routing is already set up, and we just use the irq we're told to
use, I'm not sure we should touch ELCR even if it "looks wrong". It
obviously works on your machine, but I wonder what could break on others..

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia oops on 2.6.17-rc[12]
Date: Mon, 15 May 2006 22:03:27 UTC
Message-ID: <fa.q1L3cfGLWFUVoGsga6VpINn+BsE@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605151459140.3866@g5.osdl.org>

On Mon, 15 May 2006, Alan Cox wrote:

> On Llu, 2006-05-08 at 17:34 +0100, Russell King wrote:
> > > So 8250 is requesting an IRQ for non-sharing mode and it's actually
> > > failing, because something else is already using that IRQ.  The difference
> > > is that the kernel now generates a warning when this happens.
> >
> > Maybe someone is clearing the UPF_SHARE_IRQ flag?  Which port is this?
>
> Its a bug in the PCMCIA code. Its the one I hit with the IDE code.
> Asking for a private IRQ is not always honoured.

Note that some PCMCIA architectures simply _will_not_ give you a private
IRQ. Ever. They may not have any ISA interrupts to give, even to old
16-bit cards. So the choice may be "shared irq or nothing".

So I would strongly argue that any driver that depends on getting an
exclusive IRQ is buggy, not the PCMCIA layer itself, and that it would be
a lot more productive to try to fix those drivers.

Especially since exclusive irq's are clearly a dying breed, and have been
for the last decade or two. Why try to keep that braindamage alive?

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia oops on 2.6.17-rc[12]
Date: Mon, 15 May 2006 23:38:48 UTC
Message-ID: <fa.xfg++gRp0xcs2chxuU1ryT72gEI@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605151629350.3866@g5.osdl.org>

On Tue, 16 May 2006, Alan Cox wrote:
>
> It would certainly be a lot cleaner than this sort of code in the pcmcia
> core right now. Want me to send a patch which only allows for SA_SHIRQ
> and WARN_ON()'s for any driver not asking for shared IRQ ?

I think it's too late for that in the current series, but yes, we could do
it for 2.6.18.

There are a few valid reasons for not using SA_SHIRQ, but they tend to be
really special. Ie you'd better _know_ that you're either some system
device, or you're physically in an ISA slot, not just based on some old
ISA design. And PCMCIA is no longer an excuse, exactly because of some
systems that will route even the 16-bit interrupts through a PCI irq.

So anything in arch/ is likely ok (for example, the vm86 interrupt
handling on x86 had _better_ be an exclusive interrupt ;)

Doing a quick grep shows a surprising amount of drivers that pass in zero,
but I guess that they truly _are_ old ISA sound/networking drivers.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia oops on 2.6.17-rc[12]
Date: Mon, 15 May 2006 23:53:12 UTC
Message-ID: <fa.7TZtLyFLsom/IkHCdkFzBFD5iZE@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605151644090.3866@g5.osdl.org>

On Mon, 15 May 2006, Linus Torvalds wrote:
>
> On Tue, 16 May 2006, Alan Cox wrote:
> >
> > It would certainly be a lot cleaner than this sort of code in the
> > pcmcia core right now. Want me to send a patch which only allows for
> > SA_SHIRQ and WARN_ON()'s for any driver not asking for shared IRQ ?
>
> I think it's too late for that in the current series, but yes, we could do
> it for 2.6.18.

Actually, thinking about it some more, what we _should_ do is to just make
any code that simply doesn't _work_ with shared interrupts have to use
SA_EXCLUSIVE.

And then, for a while, warn if we see a "request_irq()" call that has
neither SA_SHIRQ nor SA_EXCLUSIVE set. But this would be a pretty mild
warning, along the lines of

 "%s getting non-SHIRQ irq. It _may_ or may not be shared in the future"

and eventually just stop caring, and making SA_SHIRQ have no meaning. It's
kind of pointless to ask all drivers to use SA_SHIRQ, when it really
should be the default.

Even the ISA drivers that currently do not have SA_SHIRQ won't generally
_break_ when/if they were to get a shared interrupt. The reason they don't
have SA_SHIRQ isn't generally that they really really want an exclusive
interrupt, but simply because they never had a reason to say SA_SHIRQ.

So a lack of SA_SHIRQ doesn't generally mean "I _need_ an exclusive
interrupt". It could equally well mean "I never bothered to even think
about whether I could share interrupts or not".

Making it an error to pass in 0 would be silly, because for old ISA
devices, most of the time they really just don't care.

What think you?

The SA_EXCLUSIVE flag is still needed, but it would be used by things that
literally cannot handle anything but an exclusive interrupt (eg vm86 mode
irq access), exactly because they do things like disable that particular
irq for long times, or because they don't have a status register to read
(eg, on x86, the timer interrupt really _is_ exclusive for that reason).

Most ISA drivers probably wouldn't care at all one way or the other, I
suspect. And it would be a waste of time to add SA_SHIRQ to them for no
actual real gain.

Comments?

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFT] major libata update
Date: Wed, 17 May 2006 23:50:41 UTC
Message-ID: <fa.85AOIRqH1a8HegAt5HOgt3ksk6w@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605171643510.10823@g5.osdl.org>

On Wed, 17 May 2006, Linus Torvalds wrote:
>
> I think Neil already reported that it fixed a lost interrupt problem for
> him, but I'm worried that it might result in interrupt storms for others.

Of course, we could just decide to try to switch from level to
edge-triggered if the irq storm thing ever triggers. Right now we disable
the irq entirely, which means that _if_ it was just due to a polarity
error, we're screwed even if it should have been easy to fix by just
turning it into edge-high.

The code to do that should be trivial: make __report_bad_irq() try to
switch to edge mode if it's not edge already. Hmm?

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia oops on 2.6.17-rc[12]
Date: Mon, 22 May 2006 15:07:42 UTC
Message-ID: <fa.5ZlztUW5ZHp65oN1eI+7kdtrSlk@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605220751380.3697@g5.osdl.org>

On Mon, 22 May 2006, Rogier Wolff wrote:
>
> The question I'm stuck with is: When is it valid to ask for a non-shared
> IRQ, and get back a shared one.
>
> Drivers that know that they don't work well if they are called by the
> "other" interrupt?

No.

For example, on certain 16-bit PCMCIA setups, the PCMCIA controller may
have just one interrupt. It may even have that interrupt exclusively, but
the point is, it has _one_. One interrupt shared for both doing not just
card interrupts, but also for PCMCIA CSC interrupts.

In that situation, once the card has been inserted (and powerup etc has
happened), the only interrupts you'll get is actually the interrupts for
the card. So everything is fine.

BUT A PCMCIA DRIVER STILL MUST NOT ASK FOR A NON-SHARED IRQ.

Because the irq will still be registered by the PCMCIA layer, and the
PCMCIA layer will check whether the interrupt was due to a CSC when the
card was removed, for example.

So there's basically never any valid reason to ask for a nonshared irq.

> I happen to know (ISA) hardware that CANNOT share an interrupt

Not necessarily true, since ISA cards have been known to be able to share
an interrupt with the proper pull-down resistors. It was even common for
serial cards.

Perhaps more importantly, not relevant for PCMCIA.  There is no PCMCIA
hardware that cannot share an interrupt, for reasons outlines above.

There might be really bad hardware that doesn't even have an interrupt
status register so you can't tell if an interrupt happened from that card
or not, making it hard to write a driver that can handle "spurious"
interrupts (in the case of real sharing), but that sounds pretty damn
unlikely.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Re: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate
Date: Mon, 13 Nov 2006 16:07:44 UTC
Message-ID: <fa.yOnBnxuPQr3qWGVAwBPmigQC/uE@ifi.uio.no>

On Fri, 10 Nov 2006, Komuro wrote:
>
> I tried the 2.6.19-rc5,  the problem still happens.

Ok, that's good data, and especially:

> But,
> I remove the disable_irq_nosync() , enable_irq()
> from the linux/drivers/net/pcmcia/axnet_cs.c
> the interrupt is generated properly.

All RIGHT. That's a very good clue. The major difference between PCI and
ISA irq's is that they have different trigger types (they also have
different polarity, but that tends to be just a small detail). In
particular, ISA IRQ's are edge-triggered, and PCI IRQ's are level-
triggered.

Now, edge-triggered interrupts are a _lot_ harder to mask, because the
Intel APIC is an unbelievable piece of sh*t, and has the edge-detect logic
_before_ the mask logic, so if a edge happens _while_ the device is
masked, you'll never ever see the edge ever again (unmasking will not
cause a new edge, so you simply lost the interrupt).

So when you "mask" an edge-triggered IRQ, you can't really mask it at all,
because if you did that, you'd lose it forever if the IRQ comes in while
you masked it. Instead, we're supposed to leave it active, and set a flag,
and IF the IRQ comes in, we just remember it, and mask it at that point
instead, and then on unmasking, we have to replay it by sending a
self-IPI.

Maybe that part got broken by some of the IRQ changes by Eric.

Eric, can you please double-check this all? I suspect you disable
edge-triggered interrupts when moving them, or something, and maybe you
didn't realize that if you disable them on the IO-APIC level, they can be
gone forever.

[ Note: this is true EVEN IF we are in the interrupt handler right then -
  if we get another edge while in the interrupt handler, the interrupt
  will normally be _delayed_ until we've ACK'ed it, but if we have
  _masked_ it, it will simply be lost entirely. So a simple "mask"
  operation is always incorrect for edge-triggered interrupts.

  One option might be to do a simple mask, and on unmask, turn the edge
  trigger into a level trigger at the same time. Then, the first time you
  get the interrupt, you turn it back into an edge trigger _before_ you
  call the interrupt handlers. That might actually be simpler than doing
  the "irq replay" dance with self-IPI, because we can't actually just
  fake the IRQ handling - when enable_irq() is called, irq's are normally
  disabled on the CPU, so we can't just call the irq handler at that
  point: we really do need to "replay" the dang thing.

  Did I mention that the Intel APIC's are a piece of cr*p already? ]

> So I think enable_irq does not enable the irq.

It probably does enable it (that's the easy part), but see above: if any
of the support structure for the APIC crapola is subtly broken, we'll have
lost the IRQ anyway.

(Many other IRQ controllers get this right: the "old and broken" Intel
i8259 interrupt controller was a much better IRQ controller than the APIC
in this regard, because it simply had the edge-detect logic after the
masking logic, so if you unmasked an active interrupt that had been
masked, you would always see it as an edge, and the i8259 controller needs
none of the subtle code at _all_. It just works.)

Anyway, if you _can_ bisect the exact point where this started happening,
that would be good. But I would not be surprised in the least if this is
all introduced by Eric Biedermans dynamic IRQ handling.

Eric?

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] irq: do not mask interrupts by default
Date: Tue, 14 Nov 2006 16:14:41 UTC
Message-ID: <fa.XoLlQ3LeOHarLlAhCQIyDAp+lJ8@ifi.uio.no>

On Tue, 14 Nov 2006, Ingo Molnar wrote:
>
> 1) disable_irq() is relatively rare (used in about 10% of drivers, but
> there it's overwhelmingly used in some slowpath) so it's performance
> uncritical.

Well, the thing is, the _replay_ if it does happen, is going to be really
really slow compared to the masking. So at that point, it may well be a
net performance downside if the masking is going to almost always have an
interrupt happen while the thing is masked. I dunno.

There's another thing too:

For level-triggered interrupts, I _really_ don't think we should do this.
The code inside the masked region is sometimes "setup code", which will do
things that _will_ raise an interrupt, but may read the status register or
whatever to then unraise it. So in that case, your patch will generate
different behaviour, something that I really don't want to introduce at
this point in the 2.6.19 series.

> 2) missing an IRQ while the line is masked is often a lethal regression
> to the user. An IRQ could be missed even if we think that the IRQ line
> is 'level-triggered'.

If it's level-triggered, it's going to be missed only if it's de-asserted
by code inside the masked region, and that is what we have always done on
purpose, so "missing" it is the right thing to do. It's what we have
tested all level-triggered interrupts with for the last 15+ years, and
it's been part of the semantics for masking.

So I absolutely do _not_ think your change is improved semantics. It's new
semantics, and illogical. If the driver masked the irq line, did some
testing that raises and clears it again ("let's check if this version of
the chip raises the interrupt when we do XYZZY"), then the logical thing
to do would be to not cause the interrupt to happen.

Of course, for edge-triggered APIC interrupts, we _have_ to replay the irq
(since we don't have any way of even *knowing* whether we might get it
again), but for level-triggered and for the old legacy i8259 controller
that gets it right for edges anwyay, we should _not_ send the spurious
interrupt that is no longer active.

And a lot of code has been tested with either just the i8259 (old machines
without any APIC) or with PCI-only devices (which are always level-
triggered), so the fact that edge-triggered things have always seen the
potential for spurious interrupts is not a reason to say "well, they have
to handle it anyway". True PCI drivers generally do _not_ have to handle
the crazy case, and generally have never seen it.

> so my patch changes the default irq-disable logic of /all/ controllers
> to "delayed disable". (IRQ chips can still override this by providing a
> different chip->disable method that just clones their ->mask method, if
> it is absolutely sure that no IRQs can be lost while masked)

I really think we should do this just for APIC edge triggered interrupts,
ie keep the old behaviour.

Also, I worry a bit about the patch:

> @@ -272,8 +268,11 @@ handle_simple_irq(unsigned int irq, stru
>  	kstat_cpu(cpu).irqs[irq]++;
>
>  	action = desc->action;
> -	if (unlikely(!action || (desc->status & IRQ_DISABLED)))
> +	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
> +		if (desc->chip->mask)
> +			desc->chip->mask(irq);
>  		goto out_unlock;
> +	}

The simple-irq case too? That's not even going to replay the thing? So now
you just mask (without replaying) simple irqs, but then the other irqs you
mask and replay.. See above on why I don't think this is necessarily a bug
(since masking is almost always the right thing _anyway_), but now it will
*STILL* depend on some internal implementation decision on whether the
replay happens at all. I'd much rather have the replay decision be based
on hard physical data: we replay _only_ for edge-triggered interrupts, and
_only_ for controllers that need it.

In other words, I think we should just make APIC-edge have the "please
delay masking and replay" bit, and nobody else.

Can you send that patch (for both x86 and x86-64), and we can ask Komuro
to test it. That would be the "same behaviour as we've always had" thing,
which I think is also the _right_ behaviour.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered
Date: Wed, 15 Nov 2006 01:22:09 UTC
Message-ID: <fa.z1RGOF5Ofaq4dWyNf4gQCbxcCMc@ifi.uio.no>

On Tue, 14 Nov 2006, Eric W. Biederman wrote:
>
> Hopefully this is the trivial patch that solves the problem.

Ok, having looked more at this, I have to say that the whole
"IRQ_DELAYED_DISABLE" thing seems very fragile indeed.

It looks like we should do it not only for APIC edge-triggered interrupts,
but for HT and MSI interrupts too, as far as I can tell (at least they
also use the "handle_edge_irq" routine)

So I'm wondering how many other cases there are that are missing this.

In that sense, Ingo's patch was a lot safer, although I still dislike it
for all the other reasons I mentioned - it's simply wrong to re-send a
level-triggered irq.

I don't know MSI and HT interrupts well enough to tell whether they will
re-trigger on their own when we unmask them, but the point is, this
_looks_ like it might be incomplete.

I think part of the problem is a bad interface. We should simply never set
the IRQ handler on its own. It should be a field in the "irq_chip"
structure, and we should use _different_ irq chip structures for level and
edge-triggered. Then we should also add the "flags" thing there, and you
could do something like

	static struct irq_chip level_ioapic_chip = {
		..

instead of making the insane decision to use the "same" chip for all
ioapic things.

Ingo? Eric? Comments?

		Linus



Index Home About Blog