#v4l 2016-08-15,Mon

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
neghverkuil: ping [08:52]
hverkuilneg: pong [08:53]
neghverkuil: about the DT parsing bug in rcar-vin, there is a fix for it in the Gen3 enablement series, but I suspect that wont be accepted for v4.9 and you want me to break that out in a separat patch? [08:55]
hverkuilwhich patch in that series is it?
I was planning to review that series this week.
[08:57]
neg[PATCHv2 09/16] [media] rcar-vin: rework how subdeivce is found and bound [08:58]
hverkuilIs that a patch independent of the others?
OK, what I would prefer is that you make a patch series fixing the remaining rcar-vin issues so I can remove the old driver.
And then a v3 patch series on top of that for the rcar gen 3.
[08:59]
negunfortunaly not, it depends on a few cleanup patches before it. Maybe the best solution is if I split the series in a cleanup/prepare for Gen3 sereis which we hopefully can get accepted for 4.9 and a add gen3 support we can handle separatly? [09:01]
hverkuilThat will work for me.
Removing the old rcar-vin driver is high prio.
[09:01]
negOK I will post the cleanup/prepare part of the series later today and rebase the V4L2_FIELD_ALTERNATE patches ontop of the cleanups [09:05]
hverkuilneg: after you post the new version, can you also check patchwork, marking old patch series as 'Superseded'?
It would help me a lot. There are a bunch of adv7180 patches there, and I am slightly confused which are still relevant.
Which is why I've been postponing reviewing them ;-)
[09:07]
negohh I did not know I could do that, will do so for the current patches and in the futrure [09:12]
hverkuilneg: you can only do that for your own patches. [09:13]
...... (idle for 25mn)
tiffanylinIs it acceptablewe open a tunnel from kernel to user space....so user space function could be called by kernel? [09:38]
nohousi am pretty curious how you want to do that [09:39]
larscno it is not [09:44]
tiffanylinFor some algorithm computation, it need hw output as computation input. This computation implementation in user space
ok...I got it. it is not allowed.
[09:49]
larscit's a security issue
it allows userspace to run arbitrary code in kernel mode
[09:53]
tiffanylinEven if tunnel just for send command/data to user space process and return immediately user space complete computation, send result back? [10:00]
nohousagain, how are you going to invoke the function?
but maybe this might be interesting for you http://www.ibm.com/developerworks/linux/library/l-user-space-apps/index.html?ca=dgr-lnxw97Kernel-APIs-P1dth-LX
[10:01]
hverkuilis this codec related? [10:04]
nohoushverkuil: yeah, i have the feeling someone wants to do floating point here... [10:05]
tiffanylinkernel send command to libfuse, libfuse will inform user space application to read data from user space. Function is not directly invoke by kernel process
nohous: this looks like what I need. I will check it ...
hverkuil: yes this is codec related
[10:07]
when HW/SW need more tight cowork to achieve one job, how to implement this kind of driver in kernel, or this kind of driver should just put in user space..... [10:17]
larscdepends on the scope of the work that needs to be done [10:18]
awalls(single-threaded) workqueue for in-kernel deferred tasks. Single-threaded if ordering needs to be preserved. [10:21]
tiffanylinlarsc: how to evaluate the scope? [10:24]
hverkuiltiffanylin: how often does this computation have to be performed? Once per frame? More? [10:25]
tiffanylinawalls: But SW need HW output as its input to selection algorithm or update probability table for next HW trigger...., user space cannot know in advanced [10:25]
hverkuilHow much data does the computation need as input, and how much data does it produce? [10:26]
tiffanylinhverkuil, once per frame.... [10:26]
awallstiffanylin: Right. But you cannot do the whole computation in the interrupt handler, so you must defer the computation to one or more threads. Whether or not those threads are in user-space or kernel space is, as larsc mentioned, a matter of scope. [10:28]
tiffanylinJust an example: after decode, it may need to recalculate its probably tables by using CPU [10:28]
awallsIf you need floating point arithmetic, you'll likely need to perform the computation in user-space [10:30]
hverkuilone option is that the driver sends an event when the input data is available, and that triggers the userspace routine. [10:30]
awallshverkuil: would a mem2mem driver structure be something useful here?
awalls is not that familiar with mem2mem
[10:31]
hverkuilperhaps. That's why I asked about the amount of data involved.
if it is fairly small, then I am thinking about controls. If it is a lot, then m2m might be an option.
One small problem: m2m is memory->HW->memory, this is HW->memory->HW.
[10:31]
tiffanylinhverkuil: its a lot....about "one option is that the driver sends an event when the input data is available, and that triggers the userspace routine." the the userspace need to set data back before next HW trigger [10:34]
hverkuilWhat is a lot? 1kB, 10 kB, 100 kB, 1MB? [10:35]
awallsWhat is the time interval between the next HW trigger? 1/50th of a second? Going out to a user-space thread, that is not a Real-Time priority thread, will make satisfying that latency requirement challenging. [10:38]
tiffanylinThere are some issues we encounter when writing kernel driver, 1. not allow computation 2. If we have properitary feature that don't want to open, how we hide that part and only write standard part in kernel
hverkuil: I think in some case, it may up to 100kB
[10:39]
hverkuillibv4l2 allows proprietary plugins. [10:40]
awallsThere is no way to legal way to keep propritery code secret, if its object code is part of the kernel. [10:40]
hverkuilIt was meant for such things (esp auto gain/whitebalance.etc. alogrithms)
Is that input, output or both combined?
[10:40]
tiffanylinawalls: 1/60 fps. " Going out to a user-space thread, that is not a Real-Time priority thread, will make satisfying that latency requirement challenging." -> yes, me met this issue, and trying to enhance
awalls: the object code is part of the kernel, but it may expose some information about HW
[10:41]
awallsThe ivtv driver uses a kthread worker thread at realtime priority, and it does all of its low latency work in kernel.
If the object code is part of the kernel, then the GPL v2 requires you to make source code availabel to whomever the object code is distributed.
NVidia gets around this by requiring the end user to perform the link of the object code to the kernel.
awalls tends to avoid NVidia hardware for this reason. Their drivers tend to make package management a hassle.
[10:45]
tiffanylinhverkuil: both combined. I have not survey proprietary plugins yet. But I did study request API these day. Its not easy to control to make request apply at right time. In this case, we may need to provide a glue layer on top of v4l2 api. In this case, do we still need put driver code in kernel?
Since we want to use v4l2 as driver interface.
[10:50]
hverkuilthe problem with doing a driver in userspace is that you can't use the V4L2 API, so you loose compatibility: any application that wants to use this has to link in proprietary libraries, so everything has to be custom.
the realtime issues remain, whether you put the whole driver in userspace or just the computation part, so moving everything to userspace won't help with that.
To come back to the amount of data: is it equally spread over input and output, or mostly one or the other?
[10:52]
tiffanylinawalls: NVidia gets around this by requiring the end user to perform the link of the object code to the kernel. -> your mean only upstream partial driver code and provide object code to end user to link if they want to use this driver?
hverkuil: mostly output
[10:56]
hverkuiltiffanylin: is the data transfer for the output DMA based or a memcpy? [10:58]
awallstiffanylin: The GPL v2 source code requirement is triggered by distribution of linked object files. NVidia's kernel driver module has a thin GPL v2 layer of source code around a binary blob. By requiring the end user to build and link the driver, NVidia doesn't need to distribute sourced code for the binary blob. [10:59]
tiffanylinIts a DMA buffer output by HW..... [10:59]
hverkuil(basically would you use videobuf2-vmalloc or videobuf2-dma-*) [10:59]
awalls*thin layer of source code [11:00]
hverkuilWhat about input, is there a DMA engine involved as well? [11:00]
tiffanylinhverkuil: sorry. I need to clarify it. There are intput/output/working buffer. working buffer is DMA buffer too, and it's a computation input.....
hverkuil:we use videobuf2-dma-*
awalls: this thin layer can upstream?
[11:03]
hverkuiltiffanylin: frankly, this sounds like a m2m device (except in the other direction).
So in userspace you would have a process waiting for input from the video capture device, it processes it and sends the result to the video output device.
This is probably metadata, not video data we're talking about?
[11:06]
The real-time aspects are the same, whether it is a kernel or userspace driver. Normally 1/60 s should be enough time, except in extreme circumstances. [11:13]
tiffanylinhverkuil: no, we are trying put core decode process to user space, but use v4l2 as driver interface....v4l2 will send decode command to user space.
Could we use the method that awalls mentioned?
[11:13]
hverkuilwhich method? the one nvidia uses?
Can you even do the SW decoding fast enough?
[11:14]
tiffanylinyes... [11:15]
hverkuilWell, I would 1) ask your legal department first, 2) there is one huge disadvantage with that method: NVIDIA has to update their code for each new kernel release. It is basically an out-of-tree driver, so it will be your responsibility to keep it up to date. NVIDIA thinks that it is worth doing that, but I would say it is a last resort option. [11:17]
awallsFrom a marketing and sales standpoint, you can only do what NVidia does, if you have hardware that people think is good enough to be worth the trouble. [11:17]
tiffanylinhverkuil: why it related to SW decoding? Is it ok we put driver control code in a binary blob? [11:17]
hverkuilThe whole point of upstreaming is to get rid of that. [11:17]
awallsIf you're offloading computation to the CPU for video compression/decompression, I'm guessing your hardware won't be worth the trouble for most users. [11:18]
hverkuilI think I still don't understand what is in kernelspace and what is it you need to do in userspace. [11:19]
tiffanylinSW decode up to 4k do not fast enough [11:19]
hverkuilKernelspace drivers deal with hardware, so what does the hardware do? [11:20]
tiffanylinhverkuil: you could think we put whole vpu firmware run in user space not in a specific low end processor...
ipi message not talk to vpu firmware but talk to user space process
By using this method, we could use faster cpu do sw computation part and no need to export HW control. But I am not sure if it could upstream.
[11:20]
hverkuilSo, thinking as a VPU you need to be able to send IPI messages to userspace (I'm still thinking events), the VPU probably needs access to video buffer(s), probably memory mapped? [11:25]
awalls4K UHD is a resolution of 3840 pixels × 2160 lines (8.3 megapixels, aspect ratio 16:9) [11:25]
nohoushverkuil: off topic: any idea, why would VIDIOC_STREAMON return EFAULT on vivid device? [11:26]
hverkuilnohous: streamon has an int pointer as argument. Are you perhaps passing an int instead of an int pointer? [11:27]
tiffanylinhverkuil: yes. the input/output buffer is dmabuf. it cannot export to other process and map to its process space [11:27]
awallsI wouldn't want to copy 8.3 Mpixels back and forth between kernel and user-space; it's a waste of CPU. It would have to be memory mapped. [11:27]
nohoushverkuil: ofc, i pass null, sorry :-) [11:27]
hverkuiltiffanylin: it is very hard to come up with a solution without knowing the details of what that userspace process needs to do.
The tools you have at your disposal with V4L2 (as it stands today) that seem related to this are events, controls and vb2 (mem2mem although you need that in the other direction, HW -> userspace ->HW).
[11:30]
awallsHW -> DMA -> CPU transforms frame -> DMA -> HW [11:32]
hverkuilAlso custom ioctls are possible, but how 'upstreamable' they are will depend on the details. [11:33]
awallsSeems like an ARM processor on the video card with local memory could beat that timeline. [11:33]
hverkuilThe whole point of using a separate ARM for a VPU is that it can be realtime. So I am worried about putting all this in a CPU.
It seems a crazy design to me.
[11:35]
awallsI'm guessing it's a cost thing.
It worked so well for WinModems....
[11:35]
tiffanylinhverkuil: I don't get it. ARM is faster than low end processor. running in cpu can improve performance... [11:39]
hverkuilThe CPU is running lots of things, so ensuring that a realtime deadline is reached in time is much harder. [11:39]
awallstiffanly: Our question is, and you don't have to answer, is why doesn't the card have it's own VPU > [11:40]
hverkuilLinux is not a hard realtime OS. [11:40]
awalls? [11:40]
hverkuilIt's not about the performance, it is about realtime behavior.
There are things you can do to improve it and actually make it work, but it is always a big hassle.
[11:41]
awallshverkuil is correct. Your users are going to have varying user experiences unless they know how to tune the system for (near) realtime performance. (Most users won't know how). [11:42]
hverkuilIsolating the ARM core so it only handles the 'VPU' processing would be one approach.
If one ARM core is enough, then that's probably the easiest to configure.
HW designers never really think about that :-(
That said, if you don't have all that much to do and you have to do it within 1/60 s, then that usually works fine. But if there is a sizable amount of work to be done, then it gets really tricky.
if it needs to do SW decoding, then I don't think you can do that with a regular linux setup. You'll need to configure the OS very carefully. Not something anyone can do.
We actually did something like that, and we had to consult with linux real-time experts to get it working.
[11:42]
awallsawalls imagines a V4L2 driver that moves data over to a GPU. [11:49]
tiffanylinWhen I saw "https://linuxtv.org/news.php?entry=2015-11-12.mchehab", I saw "CPUs are getting faster in the ARM world. The trend is to implement lower-level hardware codecs that require stream parsing on the CPU" I think you agree that offload processing to cpu [11:49]
hverkuilawalls: already exists: capture straight into a texture buffer with dmabuf.
That refers to 'packetizing' the data. E.g. the hardware still does the video encoding/decoding, the SW just adds the headers and creates a proper compressed stream.
It does NOT do actual video encoding or decoding.
it may setup codec parameters that the HW codecs use.
My understanding is that the userspace code does actual video decoding. Or am I wrong?
[11:49]
awallshverkuil: WRT GPU: neat! In this case it would work. Round trip to GPU and back would kill the timeline.
*it wouldn't
[11:52]
tiffanylinHverkuil: No, it's HW decode, but need SW do some computation [11:53]
hverkuilOK, that's better. I misunderstood.
So SW needs an event to trigger it, it needs a video output device to send back the data, what else?
Possilbly controls for the input data (it wasn't much you said, so that should be OK).
[11:54]
tiffanylinnohous: Is there any driver use this method ? http://www.ibm.com/developerworks/linux/library/l-user-space-apps/index.html?ca=dgr-lnxw97Kernel-APIs-P1dth-LX [11:56]
hverkuilA pointer to a memory mapped video frame? [11:56]
tiffanylinhverkuil: sorry, i don't understand what you mean. [11:59]
awallsI have an out of tree driver that does that. The userspace application just sets up OMAP-3 I/O pin muxing though. [11:59]
nohoustiffanylin: i doubt that
tiffanylin: why don't you inverse the control? Meaning the user space app would setup the V4L2 streaming, capture the data via mmaped buffers, do the processing, and return the buffers back to the driver?
as for the GPU driver: we currently allocate the buffers via cudaHostAlloc(), pass this to our driver via USERPTR, and then it's directly accessible for the GPU
this works for HSA GPU/CPU SoC though, for discrete graphics card you'd need the card that supports gpu direct (nvidia case) or its amd equivalent
[11:59]
negquestion about patchwork, if I want to remove a sereis since I will split and repost it as two series do I make the patches as Obsolete or do I use another state? [12:04]
hverkuilneg: I usually use Superseded. [12:05]
negthanks [12:08]
tiffanylinnohous: when using ioctl means user space need to have knowledge when to set ioctl to get/set data to hw. This will complicate user space application. In this case, we may need to provide a glue layer. Then the driver interface is glue layer not v4l2 layer.
Current codec use same method. Allocate DMA buffer and pass to our driver via dmabuf, and it could directly accessible for HW.
nohous: the nvidia case, it's a card not a soc?
[12:09]
nohoustiffanylin: well, if you think hacking the kernel in this way is actually easier... what is the problem with creating a thread with prequeued buffers, which will only get dequeued once there is data ready?
(user space thread)
[12:11]
hverkuilgood question. [12:12]
nohoustiffanylin: gpu direct makes sense for PCIe cards with discrete video memory rather than for socs where GPU typically shares memory with CPU
if thats what you ask
[12:13]
tiffanylinnohous: device run will inform user space there is BS ready to decode (and all info it need to trigger decode), after user space thread complete decode, It send command back to kernel and trigger buffer done. [12:18]
nohoustiffanylin: what's BS? if you really don't want to use v4l2 (which gives you some nice frameworks like videobuf2), then create a char device with custom ioctls and poll implementation [12:20]
tiffanylinbitstream data [12:20]
nohousapp then can select() on the file descriptor and once there is bitstream data ready, you will use some mechanism similar to v4l2's MEMORY_MMAP / MEMORY_USERPTR to get the data out of the kernel space (no copy required!) [12:21]
tiffanylinI want to use v4l2, that's why come out with this strange solution [12:21]
hverkuilSo again, isn't this just a m2m device? Except running in the opposite direction? [12:22]
tiffanylingstreamer/chrome use v4l2. It's more easier to integrate with user space application..... [12:22]
nohousso what's the trouble? it maps quite well to the v4l2, you can always define custom pixel format for your bitstream
hverkuil: do you mean pipeline like capture->mem->process->playback ?
[12:22]
tiffanylinIt's a m2m device? it's not running in the opposite direction, it just need decode output some meta for next decode... [12:23]
hverkuilwell, you get data from the kernel, you process it and you give it back. (a normal m2m device give data to the kernel which processes it and gives it back). [12:25]
tiffanylinThe problem is 1: we need sw intervention. and 2. we don't want to expose some hw info. Such as how to trigger HW.
the output/capture fd will send from user space to v4l2 kernel driver and then send to user space process and trigger decode.
application (player) -> v4l2 driver -> user space decode process -> v4l2 driver -> application(player)
[12:25]
hverkuilSo the application sees a regular V4L2 mem2mem decoder device. [12:27]
tiffanylinyes....that's what we want.... [12:28]
hverkuiland the userspace decode process uses another V4L2 mem2mem device, but transfering data in opposite directions.
So there are two video nodes, one for applications, one for the decode process.
[12:28]
nohoushverkuil: ...and the two devices are looped in the driver [12:29]
hverkuilright.
(the 'inner' loop doesn't have to be a m2m device, it can also be two separate video nodes, one for capture, one for output. It depends on the details which is best)
[12:29]
tiffanylinuser space decode process will not use another v4l2 mem2 device. it just parse bs, trigger (with previous decode output info) decode, get decode output meta data (keep it) and capture, return capture buffer to v4l2 kernel..... [12:31]
hverkuilBut where does it get the BS from? It doesn't come out of nowhere. [12:32]
tiffanylinBS is in OUTPUT buffer..... [12:32]
nohoustiffanylin: implementing the user space decode process as v4l2 client seems the easiest solution to me as well [12:32]
hverkuilOr is this the BS that the application provides? [12:32]
tiffanylindecoded frame is in capture buffer...
yes....application provide
[12:33]
hverkuilOK, so it is:
scratch that.
I think this is a v4l2 plugin.
When the application passes the BS to the kernel it is intercepted by the plugin, it parses the BS, sends the data on to the kernel, and triggers the HW etc.
Basically if you create a v4l2 plugin the plugin sits between the application and the kernel and intercepts all v4l2 ioctls.
It's seamless for the application, and I think it is exactly what you want.
[12:33]
nohoushverkuil: the plugin is user space? something like libv4l? [12:37]
tiffanylinnohous: In this case, we need to check a send command from kernel to user space can be upstream... [12:37]
hverkuilv4l-utils/lib/libv4l-mplane is a simple plugin example. [12:38]
tiffanylinhverkuil: how about capture side, after decode, we need to do some computation for next decode [12:38]
nohoushverkuil: that meens that the user application needs to use libv4l
*means
[12:39]
hverkuilnohous: yes, but that's OK. Many applications do that already, although not gstreamer.
tiffanylin: as long as such a send command is well documented it is OK.
[12:39]
nohoushverkuil: could in theory vivid's loopback feature be used for something like: (non libv4l app)->(vivid sink)->(vivid source)->(the plugin)->(libv4l wrapper capture and playback app)->(real hw out)? [12:41]
hverkuilNo, because vivid doesn't handle compressed formats. [12:42]
nohoushverkuil: i've tried the v4l2loopback out-of-tree driver on kernel 4.7 yesterday, but without much of luck [12:43]
tiffanylinhverkuil: So the original proposal application-> v4l2 driver -> user space decode process -> v4l2 driver -> application not allowed? [12:44]
hverkuilv4l2loopback is not supported by us. Mauro doesn't like it, and you can do most of what v4l2loopback does with vivid (although it is a bit more cumbersome)
tiffanylin: it is allowed, but I think using a plugin is a better fit to what you want to do.
(Ideally we'd be discussing this face-to-face and with a big whiteboard!)
It depends so much on the precise details what the best approach is.
[12:45]
tiffanylinyes. it's hard to explain clear.....
It also consider the effort, that we could directly reuse stable code without rewrite a new architecture.....
[12:48]
.... (idle for 16mn)
I just heard that GPU driver have same issue. They cannot upstream because there is no user space open source application use their driver... [13:05]
hverkuilI know drm/kms is really strict.
Depending on the details you may run into problems with v4l2 as well. Basically the driver should be usable by only using open source code.
It doesn't have to be high quality, but it should be possible to test it.
So there should be a 'poor-man's' implementation available.
[13:08]
tiffanylinI see. Thanks hverkuil, nohous, awalls.... [13:19]
..................... (idle for 1h44mn)
***hverkuil has left [15:03]
.......... (idle for 47mn)
awalls has left [15:50]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)