[linux-dvb] CX18 Oops

Andy Walls awalls at radix.net
Sun Aug 17 21:12:50 CEST 2008

On Sun, 2008-08-17 at 11:41 +0200, Hans Verkuil wrote:
> On Sunday 17 August 2008 04:13:24 Andy Walls wrote:
> > On Mon, 2008-08-11 at 17:33 -0400, Brandon Jenkins wrote:
> > > On Fri, Aug 8, 2008 at 10:18 AM, Andy Walls <awalls at radix.net> 
> wrote:

> > I have updated my repo at
> >
> > http://linuxtv.org/hg/~awalls/v4l-dvb
> >
> > with 3 changes:
> >
> > 1. Back out the original band aid fix
> > 2. Simplify the queue flush routines (you will not see that oops
> > again)
> Nice! Cleans it up considerably.


> > 3. Fix the interrupt handler to obtain a queue lock (prevents 
> > queue corruption)
> No, that is not the bug.

Yeah, it is not *the* bug I was after, but pending further discussion
below I'll maintain it is a bug.

I really want to do a full audit of all the queue manipulations in the
driver.  I'll try to find some time when I'm offline this week.

(Thanks for all the comments BTW!)

> I'm pretty sure that the real bug is that the 
> old cx18_queue_move() function didn't use a spin_lock_irqsave(). I 
> think that it was possible for an interrupt to arrive when the CPU was 
> in the middle of a cx18_queue_move().

On the surface it did look OK since all the interrupts for that stream
*should* have been stopped.  But after some hypothetical thought about
how the encoder might not stop right away and looking at Jeff's logs
where a new capture may have been started before the queue flush was
done, it's safer just to have the queue flush routine to acquire the
lock.  Since the queue flush routine in question is called very
infrequently, there's not much of a time penalty.

>  A spinlock in an interrupt 
> handler is usually bogus (and that includes the one in the ivtv irq 
> handler, I've just realized).

Chapter 5 of _Understanding_the_Linux_Kernel_ (2nd edition) on page 185
& 188 mentions that to protect data structures, being accessed by both
exceptions (like open/close/read/write/poll induced INT 0x80 exceptions
on Intel) and interrupt handlers on a MP system, a spinlock needs to be
used.  The book also mentions that a semaphore (now a mutex) is
sometimes preferable in this case, where the interrupt routine polls the
semaphore in a tight loop like a spinlock, but the system calls use the
semaphore normally and are allowed to sleep.

> What I am uneasy about, though, is why an interrupt could arrive while 
> in the cx18_queue_move() function. In principle this function should 
> only be called when the capture has stopped. I think it might be a good 
> idea to debug this: is it possible for interrupts to arrive after the 
> capture was stopped? Or is it possible for cx18_queue_move() to be 
> called when a capture is still in progress?

Right.  I should really test to see if this actually happens.  Since
spinlocks are supposedly optimized for the case of the lock being
available and the cx18_queue_move() is called infrequently, leaving it
in for now, should be OK.  

I also have difficulty reproducing the original oops, so my test results
could be misleading.  I have no good criteria for terminating the
experiment/testing and declaring "interrupts can't happen when we decide
to flush queues".

> I do think it is a good idea to rename cx18_queue_find_buf to 
> cx18_queue_get_buf_irq to denote that it 1) not just finds a buffer but 
> also removes it from the queue, and 2) it can only be called safely 
> from interrupt context.

Agree.  I'll make that part of the final change.

> > >From most of the output you provided, it was pretty obvious that
> > > q_full
> >
> > was always claiming to have more buffers that it actually did.  I
> > hypothesized this could come about at the end of a capture when the
> > encoder hadn't really stopped transferring buffers yet (after we told
> > it to stop) and then we try to clear q_full while the interrupt
> > handler is still trying to add buffers.  This could happen because
> > the interrupt handler never (ever) properly obtained a lock for
> > manipulating the queues.  This could have been causing the queue
> > corruption.
> >
> > Please test.  I need feedback that I haven't introduced a deadlock.
> >
> > It also appears that the last change requiring the interrupt handler
> > to obtain a lock, completely mitigates me having to use the "-cache
> > 8192" option to mplayer for digital captures, and greatly reduces the
> > amount of cache I need to have mplayer use for analog captures.
> I suspect that it is the change before that one: adding a spinlock to 
> cx18_queue_move().

I have a recollection that in my incremental testing that this was not
the case.  It was actually the spinlock in the irq handler that made
things better.  But it was late and I was tired.  I'll retest and run my
blocking read timing test to see if I can see a difference in the
numbers for a few cases.

>  The spinlock in the interrupt handler doesn't do 
> anything. It would only be useful if you could have two independent 
> interrupt handlers that both needed access to that resource. But that 
> is not the case here.

I agree that 2+ interrupt handlers could not access that resource
concurrently.  But, AFAICT, a system call on one processor and the
interrupt handler on the other processor can access the queues

It's my understanding that spin_lock_irqsave() and spin_lock_irq() only
disable local interrupts for the particular CPU and not globally.  So
here's the case I think needs a spinlock lock in the irq handler:

1. A capture is in progress

2. Application on CPU #0 issues a read(). cx18_dequeue() is ultimately
called on q_full.  cx18_dequeue() calls spin_lock_irqsave(), disabling
preemption, disabling local interrupts, and acquiring the lock, and
begins manipulating q_full

3. At an inopportune time, an interrupt arrives from the encoder and is
sent to CPU #1 for servicing.   In the cx18 driver, in an interrupt
context, epu_dma_done() is eventually invoked.  Without obtaining a
lock, epu_dma_done() calls cx18_queue_find_buf() and manipulates q_full.
This manipulation of q_full happens while the system call on CPU #0
holds the lock and thinks things are safe.

Given the above situation I think the interrupt handler does need to
acquire the spinlock.  So, here's where people get to hit me with the
clue-stick: in the above case, what do I have wrong?

> > Hans,
> > (or anyone else with expertise in using spinlocks withing an
> > interrupt handler),
> >
> > Could you please provide comments on if I'm doing something wrong
> > with the way I obtain the spinlock in the interrupt handler?
> See above :-)
> >
> > http://linuxtv.org/hg/~awalls/v4l-dvb/rev/f3ada35200c0
> >
> > >From reading Bovet and Cesati's _Understanding_the_Linux_Kernel_ and
> >
> > Corbet, Rubini, and Kroah-Hartman's _Linux_Device_Drivers_ I think
> > I've got it right.
> >
> > When the stream queues (q_full, q_io, and q_free) are accessed from
> > the system call exception handler, I need to do a spin_lock_irqsave()
> > to disable local CPU interrupts and protect access to the queues by
> > kernel control paths on other CPU's.  When they stream queues are
> > accessed by the interrupt handler on any CPU, the interrupt handler
> > is serialized with respect to itself and need not disable any
> > interrupts and simply obtain the lock via spin_lock() to protect
> > against access from system call exceptions.
> System call exceptions? Not sure what you mean.

As I understand it, on Intel platforms, when a user land application
invokes read(), write(), or some other system call, said system call
eventually invokes an INT 0x80 software exception to make the transition
to kernel code and data space with the proper privileges.

> AFAIK the interrupt 
> handler doesn't have to protect against anything.

See my concern above.  In brief, AFAICT, a system call on one processor
concurrent with interrupt service on another processor requires the irq
handler to obtain the proper lock before mucking with the shared data


> Regards,
> 	Hans
> >
> >
> > Regards,
> > Andy

More information about the linux-dvb mailing list