<!-- Some styling for better description lists --><style type='text/css'>dt { font-weight: bold;float: left;display:inline;margin-right: 1em} dd { display:block; margin-left: 2em}</style>

   bbrezillon: <u>Kwiboo</u>: great!
   Kwiboo: <u>bbrezillon</u>: there is still some issue on some of the h264 compliance reference samples, but main "real content" samples seems to work, I also need to sort out what other changes beside scaling list are really needed
   bbrezillon: <u>Kwiboo</u>: I'm curious, how did you come up with the new zig_zag matrix
   <br> (the changes you made here https://github.com/Kwiboo/linux-rockchip/commit/c1e6d92dd3857e521fbc4b6dd2ee1a0a4459ae8b#diff-7af29b54708399c486888c6eb96c3d59L193=
   <br> https://github.com/Kwiboo/linux-rockchip/commit/c1e6d92dd3857e521fbc4b6dd2ee1a0a4459ae8b#diff-7af29b54708399c486888c6eb96c3d59L193)
   Kwiboo: I looked at the mpp and hantro g1 ref code, the hw needs the information in inverse msb/lsb order for every 32-bit, e.g. [3] &lt;&lt; 24 | [2] &lt;&lt; 16 | [1] &lt;&lt; 8 | [0]
   <br> but we still need to come to conclution on what order userspace should send scaling lists in, I will include details in mail later, need to focus on my day job now :-)
   <br> and of course I meant: [0] &lt;&lt; 24 | [1] &lt;&lt; 16 | [2] &lt;&lt; 8 | [3]
   bbrezillon: look like ChromeOs is relying on this specific zig_zag table
   <br> <u>Kwiboo</u>: ^
   <br> so yes, we'd have to define the order of the scaling list I guess
   ***: jmleo has quit IRC (Ping timeout: 246 seconds)
   ndufresne: Kwiboo, hantro g1 ref code ? I don't think we have that one ;-P
   <br> bbrezillon, ^
   bbrezillon: no, we don't
   <br> <u>ndufresne</u>: I found a gstreamer plugin for g1 on the atmel (linux4sam) repo
   <br> but doesn't look like reference code from hantro to me
   pH5: <u>bbrezillon</u>: https://github.com/linux4sam/g1_decoder
   bbrezillon: also, I guess Atmel/Microchip will be interested by this V4L hantro driver too :)
   pH5: I think SAMA5D4 has variants that contain a G1.
   <br> And the equivalent i.MX8M code is at http://www.nxp.com/lgfiles/NMG/MAD/YOCTO/imx-vpu-hantro-1.11.0.bin
   ndufresne: pH5, thanks, more ref code is good
   bbrezillon: <u>pH5</u>: thx
   ndufresne: pH5, we are trying to figure-out how to trigger the decoding of the second and following slices without having to copy / extend the current buffer
   marcodiego: does getting the hantro driver to work with rockchip SoC's means it will also work with NXP IMX8 family of devices?
   ndufresne: pH5, right now, if we replace the buffer with another, we can an ASO, but if we append the data, then it continue
   <br> marcodiego, there might be small adaptation, the version on RK3288 is 3.X while the version of the chip on IMX8 is 5.X
   <br> we also need proper dtb interface to configure per soc, so there's still a bit of work, pH5 will still have to work a little ;-P
   marcodiego: ndufresne, Thanks! Was eager to know that. Thanks!
   pH5: note that while the NXP sources are BSP licensed, both contain somewhat contrasting release notes by verisilicon
   marcodiego: I'm a complete outsider but I'm really happy to see that in the future some allwinner, rockchip and nxp SoC's will have both GPU and VPU opensource drivers without needing proprietary firmware. Thank all for that!
   ndufresne: pH5, is it really bsd, there is a note that "binary must come with the above licence"
   <br> and it starts with "This software is confidential and proprietary and may be used only as"
   <br> anyway, it's unlikely at this stage that we will come close to consider using any code from there
   <br> pH5, bbrezillon: but this ref code seems to exersize the per macro-block error concealment stuff we notice recently
   <br> First impression is that this ref code runs the IP in RLC mode, while we run in VLC mode
   bbrezillon: <u>ndufresne</u>: https://github.com/linux4sam/g1_decoder/blob/master/software/source/h264high/h264hwd_asic.c#L810
   ndufresne: wow, they also implement the multi-core thing
   <br> bbrezillon, oh, so you don't trigger the decode, you simply clear the IRQ_BUFFER_EMPTY
   bbrezillon: yes
   <br> that's what I suspected
   ndufresne: ah, that kind of make sense ;-P
   pH5: <u>ndufresne</u>: I'm not that far yet, I have v4l2-request-test running for mpeg-2, but vaapimpeg2dec fails with -EBUSY on S_FMT(V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE)
   bbrezillon: though I remember trying that
   <br> and it didn't work
   pH5: <u>bbrezillon</u>: git://git.pengutronix.de/pza/linux.git hantro/imx8m-wip
   ndufresne: pH5, yes, I started a rewrite of the VA driver on my spare time, the VA driver cannot work
   pH5: I have split the module into hantro-common, rockchip-vpu and imx8m-vpu, but I'm not too happy about all the exports.
   <br> I wonder if it would be better to just have a single module hantro-vpu that contains two platform drivers.
   hverkuil: I just want to thank all of you working on this! It's fantastic to see all the ongoing codec driver work!
   bbrezillon: <u>pH5</u>: +1 for the single module
   <br> you can then embed per-SoC code in the module based on what's enabled in .config
   <br> but the end result should be a single module supporting the SoCs you've selected
   <br> <u>pH5</u>: hm, maybe I misunderstoo what you've done
   <br> can't we have a single platform_driver
   <br> with per-SoC probe/remove hooks?
   <br> anyway, will try to have a look at the code
   pH5: <u>bbrezillon</u>: If we are ok with a bit of if-then-else in the probe function, that would be possible as well. But the pm ops are different, we have to manually reset and set the fuse registers on runtime resume on i.MX8M because apparently the power domain controller doesn't propagate resets into the cores. And it might even make sense to keep the m2m ops separate as well, as on i.MX8M we have per-core external
   <br> clock gates.
   bbrezillon: I was more thinking about abstracting away SoC integration details using a per a per-SoC-ops struct
   <br> and then attaching this struct to a DT string
   <br> in the of_device_id table
   <br> but maybe it doesn't make sense here
   pH5: <u>bbrezillon</u>: would have to weigh the added indirections against duplicated probe/remove, it might make sense. I'll give that a try.
   <br> <u>bbrezillon</u>: why is there only a single m2m_dev btw, doesn't that mean G1 and H1 can't run in parallel?
   ndufresne: pH5, on RK3288 we believe they can't
   <br> but it's a bit obfuscated
   <br> pH5, also, RK3288 is single core, how many cores do you have on the G1/G2/H1 ?
   bbrezillon: <u>pH5</u>: that's what I've been told
   <br> tfiga should know more about this limitation
   <br> <u>pH5</u>: regarding the indirection, for anything that's not in the hot path is should matter (like proble/remove/PM)
   <br> *it shouldn't matter
   <br> might have an impact for the -&gt;run() function, but we already have so many indirection that I'm not sure it will make a huge difference
   pH5: Hm, at least on i.MX8MM (which I don't have) G1 and H1 seem to be completely separate instances (they have separate kernel drivers). I'm not sure if G1 and G2 can run at the same time or not.
   bbrezillon: <u>pH5</u>: IIRC, it was specific to RK integration
   ndufresne: on RK, they call the VPU a combo, and it contains two IP glued together
   <br> what the doc says is that some SRAM is shared
   mchehab: b-rad: ok, I'll skip the pending patches from you in order to let you send me a PR next week
   pH5: <u>ndufresne</u>: the G1 synth config 1 register (SWREG(54)) here reads 0xe4dc8cc0, with sw_dec_core_am at bits 9:7, so that should be a dual core decoder according to the reference manual.
   <br> on the other hand the i.MX8MQ NXP kernel code seems to have some "core request" bit in the control block that is switched to the active core. this code is #if 0-ed in the i.MX8MM NXP kernel code...
   ***: benjiG has left
   Kwiboo: <u>ndufresne</u>: yep with the hantro g1 ref code I meant the code pH5 linked to, especially the imx-vpu-hantro package as it is newer and conatins more codecs
   ndufresne: bbrezillon, just a note about the slice thing, I think that's great we figure-out, since it nicer from userspace to pass each slice in seperate buffer, but it does not mean we can expose low latency decoding
   <br> bbrezillon, what I mean is that if we implement CODEC multiplexing in the driver, we need to fairly schedule work over each "sessions", but if we allow to start a decode of a slice before the full slice is in, we could be blocking all other sessions
   <br> so unless we found a way to fully save and restore the state, decoding trigger should still happen when we have all slices in, so effectively with 1 slice of latency or base on the flag
   <br> pH5, this work my interest you too, https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/merge_requests/209
   <br> pH5, we trying to negotiate and apply vertical padding between two V4L2 drivers
   <br> so basically our capture driver is no capable of capturing in large frames, the S_SELECTION is used to configure the real video capture size
   pH5: <u>ndufresne</u>: yes, thanks! that would make https://gitlab.freedesktop.org/gstreamer/gst-plugins-good/merge_requests/126 obsolete?
   ***: flok has left
   pH5: <u>ndufresne</u>: the coda fix for !132, commit be9dd094e8f3 ("media: coda: remove mask from decoder h.264 level control") is in v5.2-rc1.
   bbrezillon: <u>ndufresne</u>: okay, got per-slice decoding working
   ndufresne: nice!
   bbrezillon: but there's no way you can save/restore the state
   ndufresne: sos cheduling whise, it means we can only multiplex per frame, right ?
   bbrezillon: that means nothing can go in when the engine is waiting for a new slice of the same frame
   ndufresne: yep, hence my comment
   bbrezillon: yes, I was reacting to your comment
   ndufresne: ok, well, if that's how the HW work, that's how we will implement it ;-P
   <br> userspace does not need to be aware
   <br> except for one thing
   <br> If you send a slice, and does not set the flag, the driver will return the buffer, but there will be no way to continue decoding
   <br> so basically, ffmpeg code, the requeue the output on every slice cannot work
   bbrezillon: I really think we should consider per-frame decoding
   <br> when the HW is not suitable for per-slice decoding
   ndufresne: make sense, but I think being able to pass each slice in separate buffers should still be possible
   bbrezillon: yes
   ndufresne: so in the end, the driver will accumulate until the flag is drop or the timestamp changes, to start decoding
   bbrezillon: as long as I'm passed all the slices at once
   <br> or if we go for the flag approach
   ndufresne: that will guaranty we can do multiplexing over 1 core too ;-P
   <br> but it also means we need to document that not setting the flag at all in invalid
   bbrezillon: exactly
   ndufresne: and that's the bit I had of this reverse logic
   bbrezillon: I don't think we should relax the rule
   ndufresne: while if we had timestamp and end-of-frame marker, just setting the timestamp would work
   bbrezillon: so, you want to let the driver accumalate/queue slice data
   ndufresne: yes, slice could be queued as soon as they are available, and the decoder would start when a frame is complete
   bbrezillon: that means using a bounce buf in our case, which I'm fine with, just want to be sure we're on the same page
   ndufresne: which is a) when the flag is dropped or b) when the timestamp changes (in which case a frame is all slices that was queued before)
   bbrezillon: <u>ndufresne</u>: that also means that not passing the flag will lead to undefined behavior
   <br> it might work on some HW but not on others
   <br> (I mean, not passing the flag when it should actually be set, like on the first slice of multi-slice frames)
   ndufresne: hverkuil, ^ once side effect of a flag the keeps the buffer queued
   bbrezillon: <u>ndufresne</u>: at least we solve the start_code issue with this approach =&gt; since we'll anyway have to copy the buffer, it's easy to add start codes driver-side
   <br> though it's probably better if the uAPI is clear about this aspect too
   Kwiboo: bbrezillon, ndufresne: using bounce buffer and slice only decoding sounds like a bad idea, for ffmpeg use case I would prefer to use frame decoding if possible, I would suggest adding 2 cap flags: frame decoding supported/required, in ffmpeg I would try to use frame decoding if driver reports supported
   ndufresne: bbrezillon, I'm not sure why you say we need copy with this approach, the principle of the v4l2 queues is that you can queue data, and the driver can use multiple input to produce one output
   <br> unlike DRM, there is no 1:1 relationship in processing
   <br> how it should work, is that userspace should keep queueing buffer, signalling the end of frame, or the start of a frame to the driver
   <br> driver should be left in streamon state pretty much all the time
   bbrezillon: <u>ndufresne</u>: that's clearly not how the FFmpeg v4l2 backend works
   <br> and if we go for this approach we'll have one interrupt per slice with no real benefit
   ndufresne: bbrezillon, you mean how it's currently implemented or you mean about the accel framework ?
   bbrezillon: how it's currently implemented
   ndufresne: well, some framework do slice the buffer that way, so we have a real benifit
   <br> think of VA drivers
   <br> the irq cost seems to be overlay considered in our discussions imho
   bbrezillon: no, I mean, I understand the low-latency benefit when decoding happens as soon as we get the slice data
   <br> but here that's not the case
   <br> you have to wait for all slices before starting the decoding process
   ndufresne: indeed, but userspace still need to split the slice NALs, which can have other nals in between
   <br> bbrezillon, my point is that the driver should make the decision, and userspace should be written in a way that it does not expect anything to happend on intermediate slices
   <br> now, in our case, unless we give exclusive access to the HW, low latency is just a no-go, the driver should never start a decode job unless it has all the slices in queu
   bbrezillon: yes, I was only talking about driver implem here
   ndufresne: but saying that userspace must pack all slices in one buffer, seems like another approach
   bbrezillon: no, I say someone has to do it
   <br> well, has to is probably too strong
   ndufresne: what I understood here is that you can handle having multiple slice buffers with no start-code with this IP
   <br> that's exactly what I would need for optimal VA-VAPI in the driver
   bbrezillon: didn't test that yet
   <br> right now I have start code added
   ndufresne: but you tested multiple buffers with start code
   bbrezillon: yes
   ndufresne: ok, well, it seems quite evident that this is the right way to handle slices with no start-code
   bbrezillon: I'm about to test multiple buffers without start code
   ndufresne: so indeed, if we make a format with start-code, and one without start-code, it would make a lot of sense to allow packing full slice in the first, but not in the former
   <br> * packing all slices)
   bbrezillon: if we pack for the former, we can pretty much do the same for the latter since we can add start codes as part of the "copy to bounce buf" step
   ndufresne: ezequielg, is there a KMS driver for virme/virtio ? I tried the --graphics, but don't have a fb0 or dri out of it, I found vkms, but that does not do what I want really
   ezequielg: it's virtio_gpu
   <br> this is a command from my bash history, just as an example `virtme-run --cwd . --script-dir test/ --kdir ~/builds/virtme-x86_64 --kopt vivid.node_types=0x1 --kopt vivid.n_devs=1 --qemu-opts  -vga virtio  -display gtk,gl=off -m 2000`
   <br> and your kernel needs DRM_VIRTIO_GPU=y
   ndufresne: ok, I had that, didn't knwo the qemu-opts bits
   ezequielg: which would give you /dev/dri/cardX with KMS caps
   <br> right
   <br> it's all about chanting the proper thing
   ndufresne: so I should probably remove vmks
   <br> so I guess it's qemu --help for more details ;-P
   bbrezillon: <u>ndufresne</u>: back to the FFmpeg implem limitations you mentioned earlier
   <br> even if it's not efficient, I think it's valid to queue output bufs and assume they can be dequeued when the decode operation is "done"
   <br> since "done" in our case would be "queued but not yet decoded"
   <br> not sure what happens
   ndufresne: ezequielg, was virto_gpu usable for you using kmssink or other kms app ?
   <br> the dump buffers cannot be mapped ....
   ezequielg: it is.
   <br> and for dmabuf, iirc I added the missing pieces to virtio driver.
   <br> some time ago, should be mainline now
   <br> iirc, my goal was vivid-&gt;kmssink.
   ndufresne: bbrezillon, valid, not sure, that seems pretty odd, why would you mark the capture buffer done, if you don't consume the related input buffer ?
   ezequielg: is it not working?
   ndufresne: I'm on v5.1
   <br> is that too old ?
   bbrezillon: <u>ndufresne</u>: the capture buf is not done, but the output buf would be
   <br> or t
   <br> at
   <br> at least that was my understandinf
   ndufresne: but if the capture buf is not marked done, then ffmpeg code as of now will freeze on the second slice
   <br> bbrezillon, the output buf can only be done if you use a bounce buffer, you would be done copying it
   bbrezillon: yes, but right now it's dequeuing the output buf and reusing it
   ndufresne: it's also dequeuing the capture, and queueing it back
   bbrezillon: no, not with the hold-buf patch
   ndufresne: because cedrus marks the capture buffer done on each slice being decoded
   <br> ah, I meant it use to
   <br> bbrezillon, I don't want userspace to rely on driver making a copy
   bbrezillon: I'm concerned about the output buf
   ndufresne: ffmpeg should only dequeue the slice when the decoding is done
   bbrezillon: so output buf should be retained too
   ndufresne: driver should retain buffers only if needed, if you have a bounce buffer, it should let it go as soon as the internal work queue is done copying it
   bbrezillon: I mean all output bufs that are attached to frame that's not been decoded yet
   ndufresne: It should not copy it on userspace QBUF call, QBUF is not suppose to do that
   <br> as I said, if you zero-copy from these buf, they should be retained
   bbrezillon: so userspace should be prepared to have enough output bufs to decode a frame with X slices
   ndufresne: ezequielg, found this in the logs "drmWaitVBlank failed: Operation not permitted (-1)"
   ezequielg: what's the pipeline?
   ndufresne: gst-launch-1.0 videotestsrc ! kmssink driver-name=virtio_gpu
   <br> if I force-modesetting=1, then I only get black frames ...
   <br> interesting, if I use vivid, I get images, but it spams "failed to map scatterlist"
   ezequielg: hm ok
   <br> let me try
   ndufresne: so basically, writing to virtio own dmabuf fails, but external dmabuf kind of works
   <br> and WaitForBlank legacy API is missing, but I guess that would maybe go away with atomic apis
   <br> weird, v4l2src io-mode=dmabuf-import  works without the spam
   <br> I guess some more work is needed ;-P
   bbrezillon: <u>ndufresne</u>: per-slice decoding without start code as expected
   <br> *works as expected
   ndufresne: bbrezillon, curiosity, if you pack all slice without start-code, does it fails as I expect ?
   bbrezillon: it fails yes
   <br> I had already tried that
   ndufresne: thanks, so what the manual does not document is that if you do start_code_e = 0, you also need to split things up per slice
   <br> I think we really have two formats with two semantic here
   <br> this is starting to make sense
   <br> bbrezillon, the annoying bit is that the normal _H264 format has the same semantic as what is ideal for the start_code_e=1 case
   ezequielg: <u>ndufresne</u>: if it works with some path, then it works \o/
   ndufresne: bbrezillon, oh, except for one thing, we don't know what happens if you have a full frame with the extra NALs in there
   <br> ezequielg, lol
   bbrezillon: <u>ndufresne</u>: you mean, as if the bitstream as passed as is by userspace?
   <br> *was passed
   ndufresne: yes, as is in the sense that it's just full frame with the auxiliary nals
   <br> I wondwer if our IP won't just simply skip it
   bbrezillon: possibly
   <br> it actually parses the bitstream to some extent
   <br> so I wouldn't be surprized if it was able to skip non-slice NALs
   ndufresne: I guess my question is if the scanning in nal type aware, since that's all you need
   <br> I must admit, I didn't think about this possibility
   <br> ezequielg, ok, next step was "st-launch-1.0 v4l2src io-mode=dmabuf-import ! v4l2fwhtenc output-io-mode=dmabuf ! v4l2fwhtdec ! kmssink driver-name=virtio_gpu force-modesetting="
   <br> not quite working yet
   <br> there's a nasty assert on buf size overflowing, and then pipeline failing
   <br> works better is I do export ! import instead
   bbrezillon: <u>ndufresne</u>: ok, so tomorrow I'll try to implement both solutions and patch the ffmpep v4l2 backend to not tie output bufs to capture one
   ndufresne: sounds good
   <br> bbrezillon, btw, I'm travelling to Oslo (gst hackfest) this over night
   bbrezillon: (by both solutions I mean copy+decode-frame vs queue+decode-slice*X)
   ndufresne: but should be around in AM for you I guess
   bbrezillon: <u>ndufresne</u>: don't worry, I'm no longer blocked
   ndufresne: hehe
   <br> but I like keeping up, I learn so much too
   <br> ezequielg, yeah, anything that touches the dump buffer in virtio_gpu fails
   <br> ezequielg, ok, so what's missing is the following, you do create dump buffers, map/write/unmap/display, but you cannot do dump/prime/map(dmabuf)/unmap/display
   <br> and this is arbitrary limiation, since the kernel can map the primed dumb and write to it in software (vivid)