From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Mon, 24 Apr 2006 19:03:22 UTC Message-ID: <fa.YZQD1wqc8EAkcaIZfnPxACWIOIo@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604241156340.3701@g5.osdl.org> On Mon, 24 Apr 2006, Stephen Hemminger wrote: > > We should fail request_irq() if the SA_SHIRQ but the irq is edge-triggered. That would be HORRIBLE. Edge-triggered works perfectly fine for SA_SHIRQ, as long as there is just one user and the driver is properly written. Making request_irq() fail would break existing and working setups. If you have a driver that requires level-triggered interrupts, then your driver is arguably buggy. NAPI or no NAPI, doesn't matter. Edge-triggered interrupts is a fact of life, and deciding that you don't like them is not an excuse for saying "they should not work". You can get an edge by having your driver make sure that it clears the interrupt source at some point where it requires an edge. And yes, that may mean that when you're ready to start taking interrupts again, you are required to first read all pending packets, instead of just assuming that a level-triggered interrupt will "just happen", but that's the harsh reality for writing a driver that actually WORKS. For a driver writer, there is one rule above _all_ other rules: "Reality sucks, deal with it" That rule is inviolate, and no amount of "I wish", and "it _should_ work this way" or "..but the documentation says" matters at all. If you can't take that rule, don't write drivers, and don't design infrastructure for them. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Mon, 24 Apr 2006 19:08:59 UTC Message-ID: <fa.bA1L2QxVD4OIoF52xl2J6ofVhMs@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604241203130.3701@g5.osdl.org> On Mon, 24 Apr 2006, Linus Torvalds wrote: > > You can get an edge by having your driver make sure that it clears the > interrupt source at some point where it requires an edge. Btw, this is why we do end up saying that having _two_ devices share an edge-triggered setup really is something we cannot necessarily fix. That said, it is better to limp along and work as well as you can than to just throw up your hands. So even then, we should at least give the user the _chance_ of being able to log in and give a bug-report, rather than "oops, the harddisk won't work, because the BIOS sets it up to share an edge-triggered interrupt with the network". However, I'm all for printing out a honking huge warning if we have two devices sharing the same edge-triggered interrupt. But a single device should work, or the driver should be considered broken. [ Btw, the "disable_irq()/enable_irq()" subsystem has been written so that when you disable an edge-triggered interrupt, and the edge happens while the interrupt is disabled, we will re-play the interrupt at enable time. Exactly so that drivers can have an easier time and don't have to normally worry about whether something is edge or level-triggered. However, if you're within an interrupt, that doesn't mean that you can just disable the irq and hope that it acts as if it was level-triggered when you enable it again. ] Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Mon, 24 Apr 2006 21:07:41 UTC Message-ID: <fa.ZuiarLm+2wei48KTnEtQSASlHNk@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604241354200.3701@g5.osdl.org> On Mon, 24 Apr 2006, Arjan van de Ven wrote: > > well... the corner case (as rmk described) is full starvation; even > polling once per second is better than not polling at tall there.,.. I can confirm that even just polling once a second - or even less often - is a huge improvement. A long time ago, I had a machine with a 3c509 card that would sometimes miss receive interrupts for some reason (it may actually have been a bug in the disable_irq/enable_irq thing on SMP, I forget - this is at least five years ago). The 3c509 driver had some five-second timeout thing, which meant that it would end up polling itself every five seconds regardless, and it made the difference between a machine that was totally undebuggable over a network, and one that actually worked surprisingly well (ie you could ssh into it, and things would work, but have these long pauses every once in a while, with burst of data). Of course, the machine would have been totally useless for real work, but it made it _much_ easier to see what was going on when things went south. So "limping along" when things don't work can be a huge time-saver from a debugging standpoint. So even if it's just that every registered SA_SHIRQ would get a heartbeat at least once every five seconds (and we'd limit it to SA_SHIRQ exactly because a driver that doesn't have that set may get confused if it gets extra interrupts), that might sound totally useless, but it might actually help somebody who otherwise might just make a pretty useless "the machine hung" bug-report. The fake interrupt could even print out a warning if somebody returns SA_HANDLED (since normally there _shouldn't_ have been any work to handle for it), and if that means that for somebody, things go from "the machine hung" to "the machine got very slow, and printed out 'fake interrupt for ide0 returned SA_HANDLED!'", that would potentially be a big debug aid. We've had our ass saved quite a few times now by the irq storm detector ("irq X: nobody cared" and friends), which has helped debug irqs that haven't been set up properly, that I'm convinced things like this might well make a huge deal. Of course, "things like this" does not necessarily cover the above scenario. Maybe that is totally useless ;) Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Mon, 24 Apr 2006 21:09:58 UTC Message-ID: <fa.XJyXlefS49jQ9O3ZvokD9Ls5IWw@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604241407130.3701@g5.osdl.org> On Mon, 24 Apr 2006, linux-os (Dick Johnson) wrote: > On Mon, 24 Apr 2006, Linus Torvalds wrote: > > > > SA_SHIRQ does NOT mean that the irq is shared. > > > > It means that it's not exclusive, and that the driver is _ok_ with it > > being shared if that makes sense. > > Yeah. You have been talking to too many lawyers! You are getting a > forked tongue! No, it's just legacy from some _really_ really old code. As in 1991. The very original Linux irq system didn't share interrupts at all (hey, PCI was newfangled, and ISA interrupts ruled), so when interrupt sharing was added, the default was to not do it. These days, that doesn't make any sense, and if somebody did the flags today, you'd do it the other way around (default to shared, and if somebody wants a really exclusive interrupt, they should say so with SA_EXCLUSIVE or something). But Linux grew from humble and stupid roots. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Mon, 24 Apr 2006 22:27:06 UTC Message-ID: <fa.vrpIbNNgezR09iyLbMGCqn9jOsc@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604241521210.3701@g5.osdl.org> On Mon, 24 Apr 2006, Alan Cox wrote: > > > The fake interrupt could even print out a warning if somebody returns > > SA_HANDLED (since normally there _shouldn't_ have been any work to handle > > for it), and if that means that for somebody, things go from "the machine > > hung" to "the machine got very slow, and printed out 'fake interrupt for > > ide0 returned SA_HANDLED!'", that would potentially be a big debug aid. > > There are high rate IRQ sources that would trigger that erratically due to > races but it could be useful in some kind of "linux irqdebug" mode I was thinking that an interrupt actually happening on that irq would set the "already done" flag, so that if it's a high-rate irq, then we'd not inject any new fake ones. So the algorithm would be something like "clear the 'already done' flag every five seconds, and sending the fake one if it was already cleared", with normal interrupts always setting the flag. And yes, you could still hit it in a blue moon (nothing happened for five seconds, and then it happens _just_ as we send a fake event), but if the only thing we do is do a printk() on it, no big deal if you get a false positive every five years. (The bigger problem is that some drivers just return IRQ_HANDLED whether they had work to do or not, because people - including me - were lazy in some of the conversion of the irq_handler_t stuff). I dunno. It _sounds_ simple enough.. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [RFC 1/2] irq: record edge-level setting Date: Mon, 24 Apr 2006 22:34:50 UTC Message-ID: <fa.6EFZadlvFIiegTcGZVNcmgZJ0MY@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604241527060.3701@g5.osdl.org> On Mon, 24 Apr 2006, Stephen Hemminger wrote: > > Maybe that's why it never was done in the past, too much work and historical > baggage. It's messy. That whole ELCR register was mis-designed: you can change the edge/level detection with it, but since it _also_ changes the polarity of the signal, you can't actually do so from a sw angle, and it has to match the hardware. So you can't say "I want to treat this interrupt as level triggered", and just set the bit ;^/ To make matters worse, I wouldn't be in the least surprised if the ELCR register is totally ignored by many south-bridges for the internally generated interrupts (ie devices that are embedded in the SB), since the register really doesn't matter for them. And it doesn't help that Intel mis-designed the edge-detection logic on the IO-APIC. On the old i8259, if you masked an interrupt and unmasked it, an active interrupt would always be seen as an edge, because the edge-detection was done _after_ masking. On the IO-APIC crap, the masking is done after edge-detection, so if you mask the APIC hardware level, and an edge happens, you'll never ever learn of it ever again. I'm sure other system architectures have similar problems, but it's irritating. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Sun, 30 Apr 2006 05:20:50 UTC Message-ID: <fa.z3ZVVRxN05ZWGEgbepCc+PtC3gk@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604292204270.4616@g5.osdl.org> On Sun, 30 Apr 2006, Neil Brown wrote: > > So what do you propose should be done to better handle such poorly > built machines? Well, the thing is, there's not a lot we _can_ do. We can try to report it. We can also try to handle it as gracefully as we can. > As a concrete example I have a notebook which definitely assigns > shared interrupts to IRQ-10 (See /proc/interrupts below) yet the ELCR > only flags IRQ-11 as being level triggered and the rest are edge > triggered. Also, do you have the option to enable the IO-APIC? Maybe it's already enabled, and your BIOS has just disabled it, but your /proc/interrupts implies that you may have compiled your kernel without UP_APIC support. With the APIC, we might be able to do better. Worth trying out. > And with this configuration I definitely lose interrupts to the > wireless ethernet (ra0). > > How do I make this work reliably? > I could: > > 1/ modify handle_IRQ_event so that it is more resilient to the > possibility that shared interrupts are edge triggered. This can be > done be iterating over all action->handlers until they all return > IRQ_NONE. Well, yes. It's worth trying, but as mentioned, we have some drivers that return IRQ_HANDLED just because the driver conversion has been lazy. So limit it to a few things. And we really should have some flag that says whether the interrupt descriptor ends up being edge, so that we could do this for edge-triggered interrupts _only_. Anyway, I also do wonder if your irq lossage is due to something else. On the XT-PIC, disabling the irq will cause an edge when it's re-enabled, so you can get the "level" behaviour by disabling the irq over the irq handler. And that's exactly what we do, if I recall correctly. It's been years since I worked with that code, but looking at it quickly, it seems to match my recollection. > 2/ Arrange that the ELCR bit is set for any IRQ for which a shared > interrupt is registered (on the basis that the code for handling > shared interrupts is not resilient against them being edge triggered). NO. How many times do I have to say this? Yes, ELCR sets edge vs level. BUT IT ALSO SETS THE POLARITY. If you switch the bit around, it will also switch the polarity, and IT WILL NOT WORK. Because you'll end up with a level-triggered interrupt that is level-triggered for the wrong polarity, and will trigger whenever there is _not_ an interrupt pending. Now, I will almost guarantee you that there is an exception to this rule (hey, it's PC hardware, there's _always_ an exception to any rule ;), and on some situations, the ELCR thing will truly only affect edge vs level. But the point is, we can't just switch to level triggered. There simply is no such hardware in general for the old PC interrupts. (Now, _if_ you use the APIC, you can actually switch polarity and trigger mode independently. Which is one reason why I'd like to hear whether you perhaps have just disabled the APIC by mistake, rather than have a nasty BIOS that disables it for you). Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Sun, 30 Apr 2006 06:59:49 UTC Message-ID: <fa.WbophM0vf2KyIjBUKF76WBG4pNI@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0604292343020.3690@g5.osdl.org> On Sun, 30 Apr 2006, Neil Brown wrote: > > The thing is: This is exactly what I am currently doing to solve the > problem. I'm not entirely surprised. As mentioned, the ELCR register _originally_ selected between EISA (level) and ISA (edge) interrupts. But EISA (and later PCI) interrupts were not just level, they were level active _low_. While old ISA interrupts are edge-triggered, active _high_. Which explains why it not only changes the trigger, but also the polarity. Now, fast-forward a decade or two, and imagine that the world is 99% PCI, and nobody really has any devices that are _electrically_ ISA any more, but there are some legacy stuff that _looks_ like ISA. What would you do to simplify your life from a hw perspective? I suspect that the thing to do is to internally just say that all interrupts are active low. There's no reason to _really_ have active high, because there are no real devices left that drive the irq line that way. Now, the _sane_ thing to do would be to also make all interrupts be level-triggered, and make the whole ELCR register be a total dummy register. But you can't really do that without being worried about breaking compatibility (for example, the timer interrupt is a 50% duty-cycle on/off thing, so it really _does_ end up being edge-triggered). So you leave the ELCR register mattering for a edge/level thing, but the polarity issue is just gone. But then on _other_ southbridges, you'll have the old behaviour, and there simply is no way for the OS to know. Yeah, we could look at the nortbridge and southbridge combination, and perhaps know that some of them always have a "active low" polarity regardless of ELCR. But nobody even _documents_ these things, exactly because it's not supposed to matter. So we're kind of screwed. We have to _act_ as if we still lived in the middle ages, and people still used edge-triggered active-high interrupts. Even when it's not necessarily the case any more.. Gaah. That said, I'm surprised that the kernel doesn't set ELCR for you. If it sees a PCI device, it really should know that it's a PCI interrupt. I wonder if we should do the following.. (Does this automatically make it do the right thing on your machine?) Linus --- diff --git a/arch/i386/pci/irq.c b/arch/i386/pci/irq.c index 7323544..6e3eaef 100644 --- a/arch/i386/pci/irq.c +++ b/arch/i386/pci/irq.c @@ -881,6 +881,7 @@ static int pcibios_lookup_irq(struct pci ((!(pci_probe & PCI_USE_PIRQ_MASK)) || ((1 << irq) & mask)) ) { DBG(" -> got IRQ %d\n", irq); msg = "Found"; + eisa_set_level_irq(newirq); } else if (newirq && r->set && (dev->class >> 8) != PCI_CLASS_DISPLAY_VGA) { DBG(" -> assigning IRQ %d", newirq); if (r->set(pirq_router_dev, dev, pirq, newirq)) { From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: better leve triggered IRQ management needed Date: Tue, 02 May 2006 15:06:09 UTC Message-ID: <fa.EYdeRkDusHDVqHD0ftG85wm5J08@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0605020804450.4086@g5.osdl.org> On Tue, 2 May 2006, Neil Brown wrote: > > Maybe the eisa_set_level_irq should be passed 'irq' rather than 'newirq' > ?? Yeah, stupid cut-and-paste error (the eisa_set_level_irq() call _is_ already there in the PCI irq setting, for the case where we actually have to set up routing that didn't exist before). That's also why I'm a bit nervous even about my stupid one-liner patch: if the irq routing is already set up, and we just use the irq we're told to use, I'm not sure we should touch ELCR even if it "looks wrong". It obviously works on your machine, but I wonder what could break on others.. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: pcmcia oops on 2.6.17-rc[12] Date: Mon, 15 May 2006 22:03:27 UTC Message-ID: <fa.q1L3cfGLWFUVoGsga6VpINn+BsE@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0605151459140.3866@g5.osdl.org> On Mon, 15 May 2006, Alan Cox wrote: > On Llu, 2006-05-08 at 17:34 +0100, Russell King wrote: > > > So 8250 is requesting an IRQ for non-sharing mode and it's actually > > > failing, because something else is already using that IRQ. The difference > > > is that the kernel now generates a warning when this happens. > > > > Maybe someone is clearing the UPF_SHARE_IRQ flag? Which port is this? > > Its a bug in the PCMCIA code. Its the one I hit with the IDE code. > Asking for a private IRQ is not always honoured. Note that some PCMCIA architectures simply _will_not_ give you a private IRQ. Ever. They may not have any ISA interrupts to give, even to old 16-bit cards. So the choice may be "shared irq or nothing". So I would strongly argue that any driver that depends on getting an exclusive IRQ is buggy, not the PCMCIA layer itself, and that it would be a lot more productive to try to fix those drivers. Especially since exclusive irq's are clearly a dying breed, and have been for the last decade or two. Why try to keep that braindamage alive? Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: pcmcia oops on 2.6.17-rc[12] Date: Mon, 15 May 2006 23:38:48 UTC Message-ID: <fa.xfg++gRp0xcs2chxuU1ryT72gEI@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0605151629350.3866@g5.osdl.org> On Tue, 16 May 2006, Alan Cox wrote: > > It would certainly be a lot cleaner than this sort of code in the pcmcia > core right now. Want me to send a patch which only allows for SA_SHIRQ > and WARN_ON()'s for any driver not asking for shared IRQ ? I think it's too late for that in the current series, but yes, we could do it for 2.6.18. There are a few valid reasons for not using SA_SHIRQ, but they tend to be really special. Ie you'd better _know_ that you're either some system device, or you're physically in an ISA slot, not just based on some old ISA design. And PCMCIA is no longer an excuse, exactly because of some systems that will route even the 16-bit interrupts through a PCI irq. So anything in arch/ is likely ok (for example, the vm86 interrupt handling on x86 had _better_ be an exclusive interrupt ;) Doing a quick grep shows a surprising amount of drivers that pass in zero, but I guess that they truly _are_ old ISA sound/networking drivers. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: pcmcia oops on 2.6.17-rc[12] Date: Mon, 15 May 2006 23:53:12 UTC Message-ID: <fa.7TZtLyFLsom/IkHCdkFzBFD5iZE@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0605151644090.3866@g5.osdl.org> On Mon, 15 May 2006, Linus Torvalds wrote: > > On Tue, 16 May 2006, Alan Cox wrote: > > > > It would certainly be a lot cleaner than this sort of code in the > > pcmcia core right now. Want me to send a patch which only allows for > > SA_SHIRQ and WARN_ON()'s for any driver not asking for shared IRQ ? > > I think it's too late for that in the current series, but yes, we could do > it for 2.6.18. Actually, thinking about it some more, what we _should_ do is to just make any code that simply doesn't _work_ with shared interrupts have to use SA_EXCLUSIVE. And then, for a while, warn if we see a "request_irq()" call that has neither SA_SHIRQ nor SA_EXCLUSIVE set. But this would be a pretty mild warning, along the lines of "%s getting non-SHIRQ irq. It _may_ or may not be shared in the future" and eventually just stop caring, and making SA_SHIRQ have no meaning. It's kind of pointless to ask all drivers to use SA_SHIRQ, when it really should be the default. Even the ISA drivers that currently do not have SA_SHIRQ won't generally _break_ when/if they were to get a shared interrupt. The reason they don't have SA_SHIRQ isn't generally that they really really want an exclusive interrupt, but simply because they never had a reason to say SA_SHIRQ. So a lack of SA_SHIRQ doesn't generally mean "I _need_ an exclusive interrupt". It could equally well mean "I never bothered to even think about whether I could share interrupts or not". Making it an error to pass in 0 would be silly, because for old ISA devices, most of the time they really just don't care. What think you? The SA_EXCLUSIVE flag is still needed, but it would be used by things that literally cannot handle anything but an exclusive interrupt (eg vm86 mode irq access), exactly because they do things like disable that particular irq for long times, or because they don't have a status register to read (eg, on x86, the timer interrupt really _is_ exclusive for that reason). Most ISA drivers probably wouldn't care at all one way or the other, I suspect. And it would be a waste of time to add SA_SHIRQ to them for no actual real gain. Comments? Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [RFT] major libata update Date: Wed, 17 May 2006 23:50:41 UTC Message-ID: <fa.85AOIRqH1a8HegAt5HOgt3ksk6w@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0605171643510.10823@g5.osdl.org> On Wed, 17 May 2006, Linus Torvalds wrote: > > I think Neil already reported that it fixed a lost interrupt problem for > him, but I'm worried that it might result in interrupt storms for others. Of course, we could just decide to try to switch from level to edge-triggered if the irq storm thing ever triggers. Right now we disable the irq entirely, which means that _if_ it was just due to a polarity error, we're screwed even if it should have been easy to fix by just turning it into edge-high. The code to do that should be trivial: make __report_bad_irq() try to switch to edge mode if it's not edge already. Hmm? Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: pcmcia oops on 2.6.17-rc[12] Date: Mon, 22 May 2006 15:07:42 UTC Message-ID: <fa.5ZlztUW5ZHp65oN1eI+7kdtrSlk@ifi.uio.no> Original-Message-ID: <Pine.LNX.4.64.0605220751380.3697@g5.osdl.org> On Mon, 22 May 2006, Rogier Wolff wrote: > > The question I'm stuck with is: When is it valid to ask for a non-shared > IRQ, and get back a shared one. > > Drivers that know that they don't work well if they are called by the > "other" interrupt? No. For example, on certain 16-bit PCMCIA setups, the PCMCIA controller may have just one interrupt. It may even have that interrupt exclusively, but the point is, it has _one_. One interrupt shared for both doing not just card interrupts, but also for PCMCIA CSC interrupts. In that situation, once the card has been inserted (and powerup etc has happened), the only interrupts you'll get is actually the interrupts for the card. So everything is fine. BUT A PCMCIA DRIVER STILL MUST NOT ASK FOR A NON-SHARED IRQ. Because the irq will still be registered by the PCMCIA layer, and the PCMCIA layer will check whether the interrupt was due to a CSC when the card was removed, for example. So there's basically never any valid reason to ask for a nonshared irq. > I happen to know (ISA) hardware that CANNOT share an interrupt Not necessarily true, since ISA cards have been known to be able to share an interrupt with the proper pull-down resistors. It was even common for serial cards. Perhaps more importantly, not relevant for PCMCIA. There is no PCMCIA hardware that cannot share an interrupt, for reasons outlines above. There might be really bad hardware that doesn't even have an interrupt status register so you can't tell if an interrupt happened from that card or not, making it hard to write a driver that can handle "spurious" interrupts (in the case of real sharing), but that sounds pretty damn unlikely. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: Re: Re: 2.6.19-rc5: known regressions :SMP kernel can not generate Date: Mon, 13 Nov 2006 16:07:44 UTC Message-ID: <fa.yOnBnxuPQr3qWGVAwBPmigQC/uE@ifi.uio.no> On Fri, 10 Nov 2006, Komuro wrote: > > I tried the 2.6.19-rc5, the problem still happens. Ok, that's good data, and especially: > But, > I remove the disable_irq_nosync() , enable_irq() > from the linux/drivers/net/pcmcia/axnet_cs.c > the interrupt is generated properly. All RIGHT. That's a very good clue. The major difference between PCI and ISA irq's is that they have different trigger types (they also have different polarity, but that tends to be just a small detail). In particular, ISA IRQ's are edge-triggered, and PCI IRQ's are level- triggered. Now, edge-triggered interrupts are a _lot_ harder to mask, because the Intel APIC is an unbelievable piece of sh*t, and has the edge-detect logic _before_ the mask logic, so if a edge happens _while_ the device is masked, you'll never ever see the edge ever again (unmasking will not cause a new edge, so you simply lost the interrupt). So when you "mask" an edge-triggered IRQ, you can't really mask it at all, because if you did that, you'd lose it forever if the IRQ comes in while you masked it. Instead, we're supposed to leave it active, and set a flag, and IF the IRQ comes in, we just remember it, and mask it at that point instead, and then on unmasking, we have to replay it by sending a self-IPI. Maybe that part got broken by some of the IRQ changes by Eric. Eric, can you please double-check this all? I suspect you disable edge-triggered interrupts when moving them, or something, and maybe you didn't realize that if you disable them on the IO-APIC level, they can be gone forever. [ Note: this is true EVEN IF we are in the interrupt handler right then - if we get another edge while in the interrupt handler, the interrupt will normally be _delayed_ until we've ACK'ed it, but if we have _masked_ it, it will simply be lost entirely. So a simple "mask" operation is always incorrect for edge-triggered interrupts. One option might be to do a simple mask, and on unmask, turn the edge trigger into a level trigger at the same time. Then, the first time you get the interrupt, you turn it back into an edge trigger _before_ you call the interrupt handlers. That might actually be simpler than doing the "irq replay" dance with self-IPI, because we can't actually just fake the IRQ handling - when enable_irq() is called, irq's are normally disabled on the CPU, so we can't just call the irq handler at that point: we really do need to "replay" the dang thing. Did I mention that the Intel APIC's are a piece of cr*p already? ] > So I think enable_irq does not enable the irq. It probably does enable it (that's the easy part), but see above: if any of the support structure for the APIC crapola is subtly broken, we'll have lost the IRQ anyway. (Many other IRQ controllers get this right: the "old and broken" Intel i8259 interrupt controller was a much better IRQ controller than the APIC in this regard, because it simply had the edge-detect logic after the masking logic, so if you unmasked an active interrupt that had been masked, you would always see it as an edge, and the i8259 controller needs none of the subtle code at _all_. It just works.) Anyway, if you _can_ bisect the exact point where this started happening, that would be good. But I would not be surprised in the least if this is all introduced by Eric Biedermans dynamic IRQ handling. Eric? Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [patch] irq: do not mask interrupts by default Date: Tue, 14 Nov 2006 16:14:41 UTC Message-ID: <fa.XoLlQ3LeOHarLlAhCQIyDAp+lJ8@ifi.uio.no> On Tue, 14 Nov 2006, Ingo Molnar wrote: > > 1) disable_irq() is relatively rare (used in about 10% of drivers, but > there it's overwhelmingly used in some slowpath) so it's performance > uncritical. Well, the thing is, the _replay_ if it does happen, is going to be really really slow compared to the masking. So at that point, it may well be a net performance downside if the masking is going to almost always have an interrupt happen while the thing is masked. I dunno. There's another thing too: For level-triggered interrupts, I _really_ don't think we should do this. The code inside the masked region is sometimes "setup code", which will do things that _will_ raise an interrupt, but may read the status register or whatever to then unraise it. So in that case, your patch will generate different behaviour, something that I really don't want to introduce at this point in the 2.6.19 series. > 2) missing an IRQ while the line is masked is often a lethal regression > to the user. An IRQ could be missed even if we think that the IRQ line > is 'level-triggered'. If it's level-triggered, it's going to be missed only if it's de-asserted by code inside the masked region, and that is what we have always done on purpose, so "missing" it is the right thing to do. It's what we have tested all level-triggered interrupts with for the last 15+ years, and it's been part of the semantics for masking. So I absolutely do _not_ think your change is improved semantics. It's new semantics, and illogical. If the driver masked the irq line, did some testing that raises and clears it again ("let's check if this version of the chip raises the interrupt when we do XYZZY"), then the logical thing to do would be to not cause the interrupt to happen. Of course, for edge-triggered APIC interrupts, we _have_ to replay the irq (since we don't have any way of even *knowing* whether we might get it again), but for level-triggered and for the old legacy i8259 controller that gets it right for edges anwyay, we should _not_ send the spurious interrupt that is no longer active. And a lot of code has been tested with either just the i8259 (old machines without any APIC) or with PCI-only devices (which are always level- triggered), so the fact that edge-triggered things have always seen the potential for spurious interrupts is not a reason to say "well, they have to handle it anyway". True PCI drivers generally do _not_ have to handle the crazy case, and generally have never seen it. > so my patch changes the default irq-disable logic of /all/ controllers > to "delayed disable". (IRQ chips can still override this by providing a > different chip->disable method that just clones their ->mask method, if > it is absolutely sure that no IRQs can be lost while masked) I really think we should do this just for APIC edge triggered interrupts, ie keep the old behaviour. Also, I worry a bit about the patch: > @@ -272,8 +268,11 @@ handle_simple_irq(unsigned int irq, stru > kstat_cpu(cpu).irqs[irq]++; > > action = desc->action; > - if (unlikely(!action || (desc->status & IRQ_DISABLED))) > + if (unlikely(!action || (desc->status & IRQ_DISABLED))) { > + if (desc->chip->mask) > + desc->chip->mask(irq); > goto out_unlock; > + } The simple-irq case too? That's not even going to replay the thing? So now you just mask (without replaying) simple irqs, but then the other irqs you mask and replay.. See above on why I don't think this is necessarily a bug (since masking is almost always the right thing _anyway_), but now it will *STILL* depend on some internal implementation decision on whether the replay happens at all. I'd much rather have the replay decision be based on hard physical data: we replay _only_ for edge-triggered interrupts, and _only_ for controllers that need it. In other words, I think we should just make APIC-edge have the "please delay masking and replay" bit, and nobody else. Can you send that patch (for both x86 and x86-64), and we can ask Komuro to test it. That would be the "same behaviour as we've always had" thing, which I think is also the _right_ behaviour. Linus From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: [PATCH] Use delayed disable mode of ioapic edge triggered Date: Wed, 15 Nov 2006 01:22:09 UTC Message-ID: <fa.z1RGOF5Ofaq4dWyNf4gQCbxcCMc@ifi.uio.no> On Tue, 14 Nov 2006, Eric W. Biederman wrote: > > Hopefully this is the trivial patch that solves the problem. Ok, having looked more at this, I have to say that the whole "IRQ_DELAYED_DISABLE" thing seems very fragile indeed. It looks like we should do it not only for APIC edge-triggered interrupts, but for HT and MSI interrupts too, as far as I can tell (at least they also use the "handle_edge_irq" routine) So I'm wondering how many other cases there are that are missing this. In that sense, Ingo's patch was a lot safer, although I still dislike it for all the other reasons I mentioned - it's simply wrong to re-send a level-triggered irq. I don't know MSI and HT interrupts well enough to tell whether they will re-trigger on their own when we unmask them, but the point is, this _looks_ like it might be incomplete. I think part of the problem is a bad interface. We should simply never set the IRQ handler on its own. It should be a field in the "irq_chip" structure, and we should use _different_ irq chip structures for level and edge-triggered. Then we should also add the "flags" thing there, and you could do something like static struct irq_chip level_ioapic_chip = { .. instead of making the insane decision to use the "same" chip for all ioapic things. Ingo? Eric? Comments? Linus |
|