hverkuil: Heippa!
sailus: hi
How do you do?
very busy
but feel free to ask quick questions :-)
It's one of those days where things only get worse, not better :-(
made me wish I'd stayed in bed.
hverkuil: I acked the final patch in the set.
thanks
Otherwise it's non-trivial and I don't have time to properly review it but I believe others have done that already.
So I guess we're good.
pinchartl: why the MC sometimes uses a mutex and sometimes use a spin_lock?
that doesn't sound right to me
mchehab: as a general rule a spinlock is used for short section that don't need to sleep, with code that needs to be callable from interrupt context, and a mutex for larger code sections that are always called in non-sleepable context. if you tell me exactly what you're referring to I can probably provide more information
well, media_device_unregister* uses the spinlock...
but the methods to remove stuff uses the mutex
that sounds wrong
the spinlock protects the list of entities, to make entity lookup in non-sleepable context possible
but we're protecting all the rest with the mutex
the graph mutex protects more complex operations, in particular graph walk
this will need to be revisited to implement dynamic changes to the graph
well, we're using it also for media_create_intf_link() / media_devnode_remove()
I expect that most of the code wil need to be rewritten
s/wil/will/
(as before: creating/removing things were using mutexes)
s/the code/that code/
I suspect that this is causing some very hard to track bugs, like the ones that Shuah complains, when she uses the Kernel instrumentation for memory/lock troubles
I wonder whether we'll need rcu to handle dynamic updates
yes, RCU is one alternative
not easy to make it work
I haven't analyzed the problem yet so I don't know what the best solution will be
but it will certainly require very careful design
and lots of prototyping and testing
yes
mutex hold during device removal is definitely a problem
yes, and we do that
at V4L2 core
we'll need reference counting
references to entities don't bother me too much
One thing I considered is making media_device_unregister execute only once
pinchartl: well, I removed refcount per your request ;)
what bothers much is drivers using those references to travel the graph through links
I was using it on the first version of the MC patchset
shuah: so you can remove a single entity only, and then it will always return an error ? :-)
mchehab: it wasn't needed without dynamic updates :-)
there's one problem with refcount: we'll likely need to dynamically allocate entity, instead of using media_entity embedded
what I mean is keep state that we started unregister process upon first call and then return
the reason I asked for it to be removed was that there was, in my opinion, no way to implement correct reference counting without a full design for dynamic updates
we would have ended up with some kind of reference counting
that wouldn't have been wrong per se
pinchartl: the problem I'm facing is not related to dynamic update.... it is related to device unbind
but that would most likely not have fulfilled the dynamic updates use case
requiring a rework of the code
can we make graph rcu list? We probably have more readers than writers
shuah: rcu is an option, but I have no way to tell whether it's a good one without studying it in detail first
right - also it could be more complex because rcu constructs need to be followed for the solution to be correct
it looks like it would be very complex, I agree
What I mean is we will have to restructure the code and provide API so drivers don't hold mutex directly - it has to all go through API so rcu constructs don't have to spread out in driver code
I think the base issue will be to implement an efficient and easy to use API for graph traversal without any race condition with entity addition/removal
right
rcu in driver code is likely the last thing we want :-)
exactly
if we were to expose rcu to drivers we would have absolute certainty that none of them would get it right :-)
agreed
well, for now I'll be balancing graph object creation/removal using the spinlock
keeping the mutex *only* for graph traversal
ok one thing we can do is reduce the critical sections that need mutex hold
(although this still doesn't sound right)
this way, I'll remove the mutex_lock() from the non-atomic context
so read only use mutex and write cases use spinlock
that it is currently called during device removal
readers could still contend on the mutex though - won't graph walks take longer
shuah: on a dynamic change, this is still problematic...
what happens if some graph elements are inserted or removed while graph traversal is running?
the code is known to be racy currently
I would try to avoid making it worse
but I'm not sure it makes sense fixing it completely before implementing dynamic updates
pinchartl: I suspect that the current code is broken for device driver removal
probably unnoticed, because this is something that people don't do on embedded environment
device driver removal is broken by design
it's an unsupported feature of the kernel
pinchartl: well, but USB device removal is supported
unbinding a device from a driver, on the other hand, is supposed to be supported in a race-free way
and this is where the problem happens
when a device is removed, the V4L core calls the *remove* functions in non-atomic context
no, it's not supported. it works most of the time and is available to users, but it's known to be broken
device removal must be supported of course
two are protected by the spin_lock
unplugging a device should not cause issues
the rest are "protected" by mutex
the usb disconnect callback is called directly from an interrupt handler ? I thought there was a work queue nowadays
but anyway
except that mutex doesn't protect on non-atomic context and causes KASAN to hang
when a usb device is disconnected
kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
there's no need to remove the v4l2 devices right away
look at what uvcvideo does, it only removes the v4l2 devices when the last reference goes away
pinchartl: the problem is acually when the au8522 subdev gots unbind
due to the device removal
what does it do then ?
http://pastebin.com/Yuvyjg9c
[   68.005776] BUG: sleeping function called from invalid context at include/linux/sched.h:2776
looks like me might need to design dynamic updates sooner than later
but not at nearly midnight I'm afraid