hverkuil: Heippa! sailus: hi How do you do? very busy but feel free to ask quick questions :-) It's one of those days where things only get worse, not better :-( made me wish I'd stayed in bed. hverkuil: I acked the final patch in the set. thanks Otherwise it's non-trivial and I don't have time to properly review it but I believe others have done that already. So I guess we're good. pinchartl: why the MC sometimes uses a mutex and sometimes use a spin_lock? that doesn't sound right to me mchehab: as a general rule a spinlock is used for short section that don't need to sleep, with code that needs to be callable from interrupt context, and a mutex for larger code sections that are always called in non-sleepable context. if you tell me exactly what you're referring to I can probably provide more information well, media_device_unregister* uses the spinlock... but the methods to remove stuff uses the mutex that sounds wrong the spinlock protects the list of entities, to make entity lookup in non-sleepable context possible but we're protecting all the rest with the mutex the graph mutex protects more complex operations, in particular graph walk this will need to be revisited to implement dynamic changes to the graph well, we're using it also for media_create_intf_link() / media_devnode_remove() I expect that most of the code wil need to be rewritten s/wil/will/ (as before: creating/removing things were using mutexes) s/the code/that code/ I suspect that this is causing some very hard to track bugs, like the ones that Shuah complains, when she uses the Kernel instrumentation for memory/lock troubles I wonder whether we'll need rcu to handle dynamic updates yes, RCU is one alternative not easy to make it work I haven't analyzed the problem yet so I don't know what the best solution will be but it will certainly require very careful design and lots of prototyping and testing yes mutex hold during device removal is definitely a problem yes, and we do that at V4L2 core we'll need reference counting references to entities don't bother me too much One thing I considered is making media_device_unregister execute only once pinchartl: well, I removed refcount per your request ;) what bothers much is drivers using those references to travel the graph through links I was using it on the first version of the MC patchset shuah: so you can remove a single entity only, and then it will always return an error ? :-) mchehab: it wasn't needed without dynamic updates :-) there's one problem with refcount: we'll likely need to dynamically allocate entity, instead of using media_entity embedded what I mean is keep state that we started unregister process upon first call and then return the reason I asked for it to be removed was that there was, in my opinion, no way to implement correct reference counting without a full design for dynamic updates we would have ended up with some kind of reference counting that wouldn't have been wrong per se pinchartl: the problem I'm facing is not related to dynamic update.... it is related to device unbind but that would most likely not have fulfilled the dynamic updates use case requiring a rework of the code can we make graph rcu list? We probably have more readers than writers shuah: rcu is an option, but I have no way to tell whether it's a good one without studying it in detail first right - also it could be more complex because rcu constructs need to be followed for the solution to be correct it looks like it would be very complex, I agree What I mean is we will have to restructure the code and provide API so drivers don't hold mutex directly - it has to all go through API so rcu constructs don't have to spread out in driver code I think the base issue will be to implement an efficient and easy to use API for graph traversal without any race condition with entity addition/removal right rcu in driver code is likely the last thing we want :-) exactly if we were to expose rcu to drivers we would have absolute certainty that none of them would get it right :-) agreed well, for now I'll be balancing graph object creation/removal using the spinlock keeping the mutex *only* for graph traversal ok one thing we can do is reduce the critical sections that need mutex hold (although this still doesn't sound right) this way, I'll remove the mutex_lock() from the non-atomic context so read only use mutex and write cases use spinlock that it is currently called during device removal readers could still contend on the mutex though - won't graph walks take longer shuah: on a dynamic change, this is still problematic... what happens if some graph elements are inserted or removed while graph traversal is running? the code is known to be racy currently I would try to avoid making it worse but I'm not sure it makes sense fixing it completely before implementing dynamic updates pinchartl: I suspect that the current code is broken for device driver removal probably unnoticed, because this is something that people don't do on embedded environment device driver removal is broken by design it's an unsupported feature of the kernel pinchartl: well, but USB device removal is supported unbinding a device from a driver, on the other hand, is supposed to be supported in a race-free way and this is where the problem happens when a device is removed, the V4L core calls the *remove* functions in non-atomic context no, it's not supported. it works most of the time and is available to users, but it's known to be broken device removal must be supported of course two are protected by the spin_lock unplugging a device should not cause issues the rest are "protected" by mutex the usb disconnect callback is called directly from an interrupt handler ? I thought there was a work queue nowadays but anyway except that mutex doesn't protect on non-atomic context and causes KASAN to hang when a usb device is disconnected kasan: GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN there's no need to remove the v4l2 devices right away look at what uvcvideo does, it only removes the v4l2 devices when the last reference goes away pinchartl: the problem is acually when the au8522 subdev gots unbind due to the device removal what does it do then ? http://pastebin.com/Yuvyjg9c [ Â 68.005776] BUG: sleeping function called from invalid context at include/linux/sched.h:2776 looks like me might need to design dynamic updates sooner than later but not at nearly midnight I'm afraid