↑back Search ←Prev date Next date→ Show only urls | (Click on time to select a line by its url) |
Who | What | When |
---|---|---|
svarbanov | mort: can you paste few lines before Oops?
mort: is there Mem abort info? | [08:21] |
..................... (idle for 1h41mn) | ||
*** | benjiG has left | [10:03] |
.......................... (idle for 2h6mn) | ||
mort | svarbanov: here's another oops with a few lines before and after: https://pastebin.com/raw/9ZYYiDt6 - afaik mem abort info should be at the end of the oops, there doesn't seem to be any
I didn't notice the "Unable to handle kernel NULL pointer dereference at virtual address 00000008" when copying the last oops, so I left it out, sorry - it's weird that it's doing something with address 00000008 | [12:09] |
...... (idle for 28mn) | ||
https://pastebin.com/raw/gpL0sB2V is also curious, it doesn't produce an oops, but does mean no valid video stream is produced before the process is restarted (which means just restarting with VIDIOC_STREAMOFF, closing the FD, re-opening the file, and VIDIOC_STREAMON doesn't fix it)
and it's interesting how there's a Windows path, and I'm unable to find a file called "vbuffer.c" anywhere either in my kernel tree or through Google | [12:39] | |
svarbanov | mort: can you enable CONFIG_DEBUG_LIST in kernel? | [12:52] |
mort | sure
https://pastebin.com/raw/5d539fid here's another oops, from the same function but not a null pointer dereference CONFIG_DEBUG_LIST is a compile option and not a kernel cli option, right? | [12:55] |
svarbanov | mort: yes, somewhere in kernel hacking I guess
mort: any idea about the sequence of v4l2 calls? it happens on streamoff as I can see but some background will help much more | [12:57] |
mort | https://gist.github.com/mortie/4d2961e00b946546fa95b1ecd419b780 here's the source code for my v4l device abstraction, the release() function is where it first calls streamoff and then munmaps all buffers
there's two instances of that class for the encoder, one for the output and one for the capture, and it runs poll on the file descriptor in a loop there are multiple threads involved, so the functions may be called from any thread, but all the operations on a device is protected by a mutex, so they aren't called at the same time when calling release(), any number of calls to dequeue_buf and queue_bug might have happened | [13:02] |
svarbanov | mort: could you enable : echo 0xf > /sys/class/video4linux/video0/dev_debug , where the video0 is the encoder video node | [13:05] |
mort | the output part of the encoder uses V4L2_MEMORY_DMABUF
sure, sec also, the oops just happens occasionally, not every time and it's worth noting that calling release() and then init() happens pretty frequently, it's how I create a keyframe when webrtc asks me to create a keyframe, and it's how I change bitrate; restarting the encoder for those purposes involves calling release() on both the instances related to the encoder, then closing the FD, then opening the file an d calling init() on both instances | [13:05] |
svarbanov: https://pastebin.com/raw/E8QDGT6G | [13:18] | |
I see you've made some changes to session_unregister_bufs since 4.9.39 though | [13:29] | |
svarbanov | mort: hm, I can't see STREAMOFF
just before the Oops | [13:32] |
mort | the oops is in vb2_core_streamoff, I assume it happens to not log streamoff before that function returns | [13:33] |
svarbanov | mort: yes, you are right | [13:35] |
mort | actually, all those changes to session_unregister_bufs seem to be backported to the version we're using | [13:37] |
svarbanov | mort: I hope you can rebuild the kernel with some debug messages | [13:38] |
mort | yeah, it actually just got done recompiling the kernel and building a new image | [13:39] |
svarbanov | mort: can you add a check list_empty just before list_for_each_entry_safe and exit if it is empty? | [13:44] |
mort | sure
is there some function I should call to make sure a message gets logged to the dmesg log? | [13:47] |
svarbanov: with CONFIG_DEBUG_LIST but without the list_empty check: log without oops: https://pastebin.com/raw/ypg8ML4k, log where an oops happens: https://pastebin.com/raw/mSeB3G71 | [14:01] | |
currently installing the image with the list_empty check, I added `if (list_empty(&inst->registeredbufs)) { printk(KERN_ERR, "\n\nregisteredbufs is empty!\n\n\n"); return 0; }` right after the `is_reg_unreg_needed(inst)` check in session_unregister_bufs
will send logs from that version when they're ready | [14:07] | |
svarbanov: with the change, it still oops'd; here's a log where it does that: https://pastebin.com/raw/L6BanXGR, "list_del corruption. next->prev should be ffff80002ddfe208, but was ffff8000177e9e08" is very interesting | [14:19] | |
svarbanov | mort: yes, the corruption is in list_del_init
mort: who is exporting the dma-bufs? | [14:33] |
mort | svarbanov: the camera does that, it's a third v4l2device instance with conf_.export_bufs = true (https://gist.github.com/mortie/4d2961e00b946546fa95b1ecd419b780#file-v4l2-device-cc-L127)
running the qcom-camss driver maybe it has to do with the order I'm releasing that yeah, I'm releasing the camera first and then the encoder, that's probably very bad when using dma buffers | [14:39] |
svarbanov | mort: I wonder when the dma-buf buffers are imported to venus encoder does vb2::buf_init is called
because it seems that buf->list_reg is invalid (I guess) and the buf->list_reg is add in the registeredbufs list only in vb2:buf_init ops mort: to your question, dma-bufs are ref counted so that shouldn't be a problem, but it is worth to check it | [14:42] |
mort | is it helper_vb2_buf_init you want to check if gets called? I can add a printk there too | [14:46] |
svarbanov | yes, check does list_add_tail is called
and print the pointer to the list_head | [14:47] |
mort | just to make sure, and avoid an unnecessary rebuild; printk gets printed to the dmesg log unconditionally, right? | [14:48] |
svarbanov | use pr_err for sure :) | [14:49] |
javier__ | mort: it depends on your /proc/sys/kernel/printk configuration
mort: I believe the default log level is KERN_WARNING | [14:52] |
mort | pr_err it is then :p | [14:52] |
svarbanov | mort: hm, but CAPTURE buffers for the encoder are not dma-buf, right? | [14:55] |
mort | no, those are memory bapped
they get enqueued again when they're not used anymore, in the destructor of a videoframe class | [14:56] |
svarbanov: helper_vbf_buf_init is indeed called, https://pastebin.com/raw/pZHaVfmT
(I added a print in both branches of the `if (vb->type == V4L2_BUF_TYPE_VIDEO_CAPTURE_MPLANE)`) | [15:07] | |
svarbanov | can you compare the list_head address with the one from list debug?
with this "list_del corruption. next->prev should be ffff80002ddfe208, but was ffff8000177e9e08" | [15:08] |
mort | sure, I'll start and stop the video call a few times to try and reproduce it | [15:09] |
now it's complaining about list_add... https://pastebin.com/raw/2Ltvq31D
one "list_add corruption. prev->next should be next (ffff80002e459850), but was (null). (prev=ffff800018048e08)." and one "list_add corruption. prev->next should be next (ffff80002e459850), but was ffff800018048e08. (prev=ffff800018048e08)." this looks different from anything I've seen before, with its three traces. If adding those prints made that happen, this smells like a race condition | [15:17] | |
*** | benjiG has left | [15:20] |
javier__ | mort: indeed, it's quite possible that a linked list corruption may be caused by a synchronization bug | [15:22] |
svarbanov | mort: yes, looks like serious problem, and I don't think we tested the encoder with dma-buf on OUTPUT queue | [15:23] |
javier__ | so I would check that locks protect that linked list and if there some code path that's not taking it | [15:23] |
svarbanov | mort: does it work with mmap? | [15:23] |
mort | svarbanov: I haven't encountered anything like this when using mmap, no | [15:24] |
svarbanov | mort: probably because now (with dma-buf) everything is faster?
:0 | [15:24] |
mort | that's actually possible, yeah
I have a small test suite which, among other things, tests streaming from the camera to the encoder by itself though, and have never experienced any issues when running that | [15:25] |
svarbanov | mort: if you want you can try to protect buf_init with inst->lock, maybe there is a race between qbuf and streamoff | [15:27] |
mort | maybe | [15:28] |
svarbanov | mort: especially if the calls are from different threads | [15:30] |
mort | which they are, though I have my own mutex which makes sure no two threads are running ioctls against the same device at the same time
(though in my model, the encoder's OUTPUT and the encoder's CAPTURE will be two different "devices", so they will have separate mutexes) svarbanov: what else grabs inst->lock? Would it be enough to grab it just in start_streaming and stop_streaming, or should I grab it in buf_queue and maybe other functions too? | [15:31] |
svarbanov | mort: just a second, I'm trying to understand this: SFR message from FW: QC_IMAGE_VERSION_STRING=VIDEO.VE.1.8-00109, Err_Fatal - Z:\b\venus\utils\src\vbuffer.c:1319: | [15:37] |
mort | it doesn't seem to really correspond with an oops, and the only symptom when I get that message is that the other end doesn't receive any video (I haven't checked whether that's because the camera doesn't give me any frames or if the encoder doesn't work or if the frames go through the encoder, but just can't be decoded by the receiver) | [15:39] |
svarbanov | mort: I think it is consequence of the list_head issue | [15:40] |
mort | I find the error message really strange, it's a windows-style path which you don't expect in a linux driver, and I can't find anything online about a "vbuffer.c"
it might be part of a closed-source portion of the camera driver, I don't know | [15:41] |
svarbanov | mort: yes, that is assert() in firmware side | [15:41] |
mort | though the qcom camera isn't called venus, so it wouldn't be that
oh so the firmware logs a fatal error with no message, that's nice | [15:42] |
svarbanov | mort: could you add prints in session_register_bufs and session_unregister_bufs and print venus_buffer parameters like dma_addr and size?
and print them just before hfi_session_xxx functions mort: the SFR message could be because we are trying to unregister not-registered buffer | [15:44] |
mort | should I print that for each instance in the list_for_each_entry? | [15:48] |
svarbanov | mort: yes, for every hfi_session_set_buffers and hfi_session_unset_buffers | [15:49] |
mort | compiling and creating a new image now | [15:54] |
....... (idle for 31mn) | ||
svarbanov: sorry it took a while, https://gist.github.com/mortie/aedf2ea262542e213f0866fe41ad69a4
(TIL there's a limit of 10 pastes on pastebin per day without an account) sorry it took a while | [16:25] | |
svarbanov | mort: I have to leave now, will take a look later | [16:28] |
mort | Alright
I have to leave too, and won't have access to the hardware for a couple of weeks because christmas holiday I'd send you the binary necessary to reproduce it, but the only program which reproduces it depends on a couple other services, including some running on our servers, so that'd be impractical | [16:28] |
↑back Search ←Prev date Next date→ Show only urls | (Click on time to select a line by its url) |