↑back Search ←Prev date Next date→ Show only urls | (Click on time to select a line by its url) |
Who | What | When |
---|---|---|
*** | rellla has quit IRC (Quit: see you later) | [06:30] |
................................................................ (idle for 5h16mn) | ||
lyakh has quit IRC (Remote host closed the connection) | [11:46] | |
............... (idle for 1h13mn) | ||
ndufresne has quit IRC (Ping timeout: 250 seconds) | [12:59] | |
.................. (idle for 1h26mn) | ||
ndufresne | paulk-leonov, I must admit I'm a bit confuse of what you are trying to propose to solve the pipelining issue in the current H264 slice API | [14:25] |
paulk-leonov | ndufresne, ah?
is it unclear in the way I present it or does it seem flawed/unadapted? | [14:26] |
ndufresne | no no, I'm just wondering if I'm missing some of your points
also I don't understand why you consider a legacy and a non-legacy thing, as it's all new anyway Let's start with the request | [14:26] |
paulk-leonov | ah yes, that's separate | [14:27] |
ndufresne | so you think we attach multiple v4l2 buffer (1 slice in each) to a single request, is that right ? | [14:27] |
paulk-leonov | I see two distinct issues: one is pixfmt and one is proper slices pipelining | [14:27] |
ndufresne | ok, I'm letting you explain now | [14:27] |
paulk-leonov | ndufresne, no, I'm thinking of one request per buffer per slice | [14:28] |
ndufresne | ah ok, so that does not change, I mean that's how it's used by Kwiboo so far | [14:28] |
paulk-leonov | but we could allow reusing the same source (output) buffer across slices
yes I think it's the proper way too ndufresne, what I'm suggesting is a request batch entity that takes multiple requests | [14:28] |
ndufresne | ok, now I don't follow what it gives to reusing the same source (output) buffer | [14:29] |
paulk-leonov | ndufresne, there is no benefit to it, it's just simpler if userspace gets all its slices in one go | [14:29] |
ndufresne | but my point is that userspace don't always get all slices in one go | [14:30] |
paulk-leonov | also allows allocating "reasonably-sized" output buffers
ndufresne, yes but you don't know how many you're going to get, right? | [14:30] |
ndufresne | In video streaming, WebRTC/RTP, you will get slices seperately | [14:30] |
paulk-leonov | in v4l2, you need to allocate buffers before using them
I mean, I don't think you can allocate buffers during decoding | [14:30] |
ndufresne | that's generic, maybe you mean something else ? if you don't allocate before using a buffer you end up with using random memory
well, yes you can if the driver implement CREATE_BUFS, but let's ignore that one | [14:31] |
paulk-leonov | mostly, I mean that you can't allocate the output buffers you need at the time you know how many slices you're going to get per frame
so you have to work with the fact that you don't know how many slices you get and yet you need a fixed number of buffers for that | [14:32] |
ndufresne | sure, but that's not new | [14:32] |
paulk-leonov | I guess | [14:32] |
ndufresne | again, if that's too complicated to guess, we will just start addiing CREATE_BUFER on decoder output side | [14:33] |
paulk-leonov | not sure I follow the last part
you can't call CREATE_BUFS with STREAMON AFAIK | [14:33] |
ndufresne | I mean there exist an ioctl to allow userspace to allocate more if there is not enough
sure you can call create_bufs after streamon see UVC ;-D | [14:34] |
paulk-leonov | ah
good then :) | [14:34] |
ndufresne | but it's not ideal, ideally you should not allocate during streaming | [14:34] |
paulk-leonov | that's a random latency peak | [14:34] |
ndufresne | exactly
but I don't think it's a very important issue here | [14:35] |
paulk-leonov | so one option to cover that is to allocate large buffers at the beginning | [14:35] |
ndufresne | yes, on statefull encoders, we look at parameters like profile/level/resolution and make a guess
but right now, what I'm trying to solve is the contention on the capture side | [14:36] |
paulk-leonov | right | [14:37] |
ndufresne | Since the performance blocker is that you must wait, dequeue, requeue the capture buffer if you have multiple slices in one frame | [14:37] |
paulk-leonov | so I'm suggesting a "request batch" entity
which would contain multiple requests and you'd schedule the whole batch, not just one request | [14:38] |
ndufresne | batching imply that you have all the information before you start decodeding no ? | [14:38] |
paulk-leonov | at least for one request
then we could allow adding more requests to the batch | [14:38] |
ndufresne | to me, batching should be dynamic, e.g. we would batch if requests have accumulated in our queues | [14:38] |
paulk-leonov | even when it's running
but requests aren't queued AFAIK the batch idea would be a request queue more or less maybe the naming is not optimal either | [14:38] |
ndufresne | no indeed, it's the slices that are queued, but the driver could accumulate these internally if e.g. it only works on full frames | [14:39] |
paulk-leonov | ndufresne, I think the driver should only see m2m jobs and never have to accumulate them
the driver logic is complex enough as it is IMO | [14:40] |
ndufresne | what I'm suggesting is to simply use a flag to indicate when all slices of a frame (regardless of the order) has been requested | [14:41] |
paulk-leonov | well, I don't see why the driver would need that | [14:41] |
ndufresne | ok, so by using such a flag, the driver would hold setting DONE state on the capture buffer (the frame) | [14:42] |
paulk-leonov | it can definitely be good to mark that on the batch/request queue, so that it knows when the consider the batch as done
that's an option, yes seems better than the suggestion I had earlier about a non-buffer-backed sync object | [14:42] |
ndufresne | this way, you have a single meaning of the single state of the frame (capture buffer), which is DONE == completely decoded | [14:43] |
paulk-leonov | yes that would make good sense | [14:43] |
ndufresne | what I like of this solution, is that it can be used per format, so you can avoid it if it makes no sense for that format | [14:44] |
paulk-leonov | ok I'm sold on the idea too | [14:44] |
ndufresne | now, implementation whise, if the HW is single core
the driver code would be relatively simple, since it would keep decoding slices, and when it's done decoding the slice, and that slice have the flag, it mark the buffer done | [14:44] |
paulk-leonov | again, I think the core should do that
but yes, the idea would be that | [14:45] |
ndufresne | sure, I'm not familiar with the addition to the core to help with that
but for multi-core HW, we'll probably need to implement scheduling once in one driver before extending the core seems a tad more complex | [14:46] |
paulk-leonov | this is a feature that can be useful for media requests in general, not only for the decoding case I guess
ndufresne, is there such a thing as "multi-core" stateless decoding hardware? | [14:46] |
ndufresne | right, and it would be nicer if the flag is the same for both H264/H265, as an example | [14:47] |
paulk-leonov | definitely | [14:47] |
ndufresne | for multi-core, the ZynqMP from Xilinx is a quad
currently everything is being handled by a firmware running on a microblaze (a fake processer implement for Xilinx FPGA) | [14:47] |
paulk-leonov | ndufresne, what does that mean? It has 4 times the decoding pipeline or something?
ah so it's not fixed-function | [14:48] |
ndufresne | but it could also be done stateless with a smaller footprint firmwawre | [14:48] |
paulk-leonov | mhh
yeah I guess | [14:48] |
ndufresne | It's probably fixed function, that bit is not open, but it's multiplied by 4 if I understood well
as it's state less, you can run any job on any core | [14:48] |
paulk-leonov | currently m2m is purely synchronous as far as I know | [14:49] |
ndufresne | no, m2m is asynchronous
everything v4l2 is asynchronous | [14:49] |
paulk-leonov | ah sorry, I didn't mean asynchronous | [14:49] |
ndufresne | in GStreamer for each m2m driver, we have a dedicated thread for the capture queue | [14:49] |
paulk-leonov | it'll wait for the previous m2m job to complete before scheduling a new one
is what I mean | [14:49] |
ndufresne | ah right, that's m2m framework specific, that framework I believe we all agreed is quite minimal and limiting | [14:50] |
paulk-leonov | so for multi-core we probably only need m2m to be able to schedule in parallel
and driver can just keep track of which slots or busy/free and potentially fail to schedule a job | [14:51] |
ndufresne | yes, but that will happen if we have such a HW (and specification to implement it) | [14:51] |
paulk-leonov | which m2m will retry as soon as another job completes
ndufresne, yes, that does not feel urgent at all for now and it shouldn't mess with our requests queue design | [14:51] |
ndufresne | yes, I mean you could also just use the multi-core for multiple stream, but keep you stream on single core | [14:52] |
paulk-leonov | (as far as I can see) | [14:52] |
ndufresne | in general it's should be fast enough since the best quality is achieve with single slice per frame
(did I mention that the more slices per frame the less the quality ?) | [14:52] |
paulk-leonov | I didn't know, and it's definitely good to know :)
so it seems that we have an overall draft on the general requests queue/batch idea marking one of the requests as the last one to complete the frame and returning buffer done only when that one was processed | [14:53] |
ndufresne | of course, I think there is lot of thing we can improve on batching, e.g. how to let userspace do explicit batching
Could be nice for non-live cases | [14:54] |
paulk-leonov | ah, that's why you mentionned multi-core: we can't be sure the last one queued will be the last one decoded
so the core will need to make sure we have decoded the last one *and* the previous ones before marking as done :) and that should do the trick | [14:54] |
ndufresne | yep, so for multi-core, there is special care, as completion will be out of order | [14:55] |
paulk-leonov | we can make it the general case too | [14:55] |
ndufresne | the partionning of macroblock is pretty naive in general, so it's not rate that the last slice only have a handful of macroblocks, so it's fast to decode | [14:55] |
paulk-leonov | makes sense | [14:55] |
ndufresne | there will be tones of details as usual
right now, Driver to userspace, EOS is communicated with the generic _LAST flag So shall we use the same _LAST flag, or shall be add a dedicated one, I'm fan of the second, but is this consistant ;-P in stateless there should be no need for CMD_STOP/EOS flow, so that's not a worry (meaning both ways works) | [14:56] |
paulk-leonov | no I think we need something in media for that in v4l2
so a dedicated separate flag it's not really the same kind of indication need to go be back in 30 mins, I'll rollback | [14:58] |
ndufresne | thanks | [15:08] |
........ (idle for 39mn) | ||
*** | benjiG1 has left | [15:47] |
Kwiboo | paulk-leonov, ndufresne, the ffmpeg hwaccel use "dynamic" allocation of buffers using CREAT_BUFS during decoding, jernej had to make a small change to cedrus driver to not depend on number of allocated buffers but we have not seen issues with allocating buffers as needed
we start with around 10 buffers for h264 and will allocate more when the ffmpeg buffer pool is out of free buffers | [16:00] |
ndufresne | make sense | [16:02] |
paulk-leonov | Kwiboo, can't we figure out the required number of buffers? It seems that vlc + ffmpeg + vaapi does that
at least in vaapi the number of buffers we allocate initially seems to change depending on the video being played | [16:05] |
Kwiboo | ffmpeg vaapi will allways allocate 20 buffers for h264 if I recall correctly, and this causes issues during seeking because two different hwaccels/buffer pools will be allocated if sps/pps changes between seek points, meaning 40 buffers is allocated for a short period
we solved this memory issue by dynamic allocate and free buffers | [16:09] |
jernej | actually, buffers are not freed dynamically
just at the end or during seek IIRC | [16:10] |
Kwiboo | true :-), once they have been assigned to a ffmpeg buffer pool they will stay there until entire pool is released | [16:12] |
paulk-leonov, it is probably possible to calculate number of buffers needing for decoding, but when you want to display frames on screen the number of required buffers may change in case application want to have an internal queue of decoded frames before they are displayed | [16:17] | |
paulk-leonov | Kwiboo, yes definitely
it's a minimum that can be figured out but we can't always rely on having lots of buffers around e.g. I have a device with 128 MiB RAM that I want to do hw-accelerated decoding with (with cedrus) | [16:19] |
jernej | paulk-leonov: Isn't then better to allocate buffers on demand, as it is already done in ffmpeg? | [16:22] |
Kwiboo | yep, that is why we allocate buffers as they are needed, lowers the need for memory pre-allocated, and also try to flush the ffmpeg buffer pool to release as many buffers as possible when we may have an overlap and twoo different buffer pools | [16:22] |
paulk-leonov | jernej, yes definitely, but allocating each slice buffer feels like to much overhead still
especially since CMA-backed buffer are generously aligned | [16:22] |
jernej | ah, you would start reusing buffers pretty soon | [16:23] |
paulk-leonov | efficient reuse also means having adapted sizes from the get go
which could fit different sizes of slices so what I'm saying is that if we have an output buffer with space for up to two slices, it would be good to be able to use that | [16:24] |
jernej | I put together H264 video which changes resolution and many other parameters on the fly
and it works just as good, since ffmpeg flushes pipeline when that happen | [16:25] |
paulk-leonov | right, that makes sense | [16:25] |
jernej | the only issue is how to determine optimal slice size
currently we use 2 MiB, since some 4K videos need it I mean, they don't fit in 1 MiB | [16:27] |
paulk-leonov | mhh that could be loosely correlated from the codec and dimensions
I think we went for 1 MiB in our vaapi | [16:33] |
jernej | yes and we put in ffmpeg at first
*that | [16:33] |
paulk-leonov | I see
well, we could start at 1 MiB and double the size if needed | [16:34] |
........ (idle for 36mn) | ||
ndufresne | Kwiboo, when the buffers are backed by vmalloc, I'd be inclined to say that it's better to allocate as late as possible ;-D
jernej, venus and coda driver have some formula to guess the frame slice (not the slice though) these have worked quite well so far of course you can always decide to generate a 940Mb/s HEVC stream which will blow it ;-D | [17:11] |
............. (idle for 1h1mn) | ||
*** | rubdos has quit IRC (Ping timeout: 240 seconds) | [18:14] |
........ (idle for 37mn) | ||
ndufresne | Kwiboo, the ffmpeg code, I see it extracts the refs list from DPB, but I don't see where the reordering take place ? Do you pass the refs in DPB order ? | [18:51] |
Kwiboo | ndufresne, yes we pass refs in DPB order, I think the DPB will be in oldest ref frame first order, h264 and hevc parts where only tested on cedrus where ordering do not seem to matter much as long as same buffer is using same hw dbp position for its lifetime (position variable in cedrus)
s/dbp/dpb/ | [19:00] |
ndufresne | that seems Allwinner specific then
something to give a close look at | [19:03] |
Kwiboo | I do not fully grasp the requirement regarding reference order limit on rk decoder, is that for rkvdec only or does it also apply to vpu1/vpu2 ? | [19:04] |
ndufresne | I don't know what "reference order limit" refers to
I'm only working on reference order right now Kwiboo, for VAAPI you need to implement in userspace, 8.2.4.3 "Modification process for reference picture lists" I'm surprised you didn't need to implement this, but maybe it's done elsewhere ? | [19:05] |
Kwiboo | ffmpeg should take care of such steps, all we do in hwaccel is: controls->slice_params.ref_pic_list0[i] = get_dpb_index(&controls->decode_params, &sl->ref_list[0][i]) | [19:20] |
ndufresne | indeed, it does before calling decode_slice
it was confusing by the fact there DPB is maintained at two place really | [19:20] |
ok ok, so that make sense | [19:27] | |
Kwiboo | yep, I do not really understand the need for ref_pic_list0/1 and also the ref_pic_list_p0/bo/b1, from what I understand p0/bo/b1 is some rk specific before the modification process or similar | [19:28] |
ndufresne | yeah, I'm still trying to figure-out why they got more lists on RK side, both so single slice decoding
a slice cannot be p and b at the same time * p and b Kwiboo, I'm surprised, I though the ref list in v4l2 side was going to be a list of timestamp, but instead they look like a list of index in the dpb | [19:29] |
Kwiboo | yep, could feel a little bit strange, I think it was designed before the tag/timestamp concept was added, to get field coded videos working we added reference type (top/bottom) in top 2 bits of the ref_list: i | ((ref->reference & 3) << 6), where i = index in dpb
a small hack that probably should be solved in a better way :-) https://github.com/LibreELEC/LibreELEC.tv/blob/master/projects/Allwinner/patches/linux/0007-H264-improvements.patch#L66-L90 <<-- use of reference type in cedurs | [19:34] |
ndufresne | hmm, I see, that's a hack
ezequielg, were you aware of ^ ha, so the proposed implementation was hard coding VE_H264_SHS_FIRST_SLICE_IN_PIC on every slices | [19:38] |
Kwiboo | yep, getting slice decoding was rather easy and logical, submit new slice data, same frame buffer, set start position and fix first slice in pic flag | [19:47] |
..... (idle for 21mn) | ||
ndufresne | Kwiboo, for the bottom field hack, did you notice that the driver code never sets the field on the destination buffer ?
there is a comment in the structure that says, fields are in the buffer, but you need to copy field from source to dst, as you will refer to the capture dst buffer later | [20:08] |
Kwiboo | they are two different things, buffer may be a bottom field, but the reference can also be a top/bottom/frame reference | [20:09] |
ndufresne | right, it's a state of the decoding of the interlaced frame
it's probably a bad idea to try and merge this with the field semantic did you raise that problem on the ML already ? | [20:11] |
Kwiboo | I think jernej raised something regarding the bad/wrong use of field, I focused on ffmpeg parts and jernej on cedrus when we put this hack and the other parts that where missing in cedrus driver :-) | [20:16] |
jernej | we did, but interlaced videos were never supported with initial version
of the driver | [20:16] |
ndufresne | I should start writing down all these a todos then
jernej, but over all, it seems that if we introduce the equivalent of "reference" into the DPB Entry that should do, or am missing something ? | [20:18] |
jernej | imo that should do and I think VAAPI already has something similar
but I need to look again | [20:22] |
ndufresne | there is a flags in the DPB entry, but I don't know what's it used for
VAPictureH264 and v4l2_h264_dpb_entry seems to match the buf buf_index in that structure is odd or I don't understand | [20:22] |
Kwiboo | I do not think that works and I belive that was something we tested, the reference should have a flag. I think you can have a field reference to same frame you also have a bottom/top reference to | [20:25] |
jernej | ah, sorry, I mixed up things a bit | [20:27] |
Kwiboo | my memory can of course have faild me, it was months ago we fiddeled with those parts ;) | [20:27] |
jernej | let me check again
ok, so there are actually 3 VAPictureH264 arrays | [20:27] |
Kwiboo | I think we tested saving the frame reference typ into flags in dbp, in the end I think it did not always match the reference type in all cases, using the reference type fixed the visible issue (I do not remember how it showed) | [20:30] |
jernej | ReferenceFrames represents DPB
RefPicList0 and RefPicList1 represents references so yeah, RefPicList arrays have informations if it is top or bottom reference independetly of DPB | [20:30] |
ndufresne | maybe it has something to do with the third list in RK
I'll have to a) make my mind on this, and b) keep that in mind then ok, let me recap so if if I get it, ReferenceFrames from VAPictureParameterBufferH264 is equivalent to our v4l2_h264_dpb_entry list and flags is for the purpose you added the hack, but the in VAAPI, you have the different flags depending in which list you have placed the picture but on our side we can't, the list refer to the DPB | [20:33] |
jernej | yes, that's my understanding | [20:37] |
ndufresne | ok, so solution then, make the refs list v4l2_h264_dpb_entry ?
or add a new structure with the DPB index + flags ? | [20:37] |
jernej | I would rather see if we just somehow add flags, no need to duplicate other info | [20:39] |
Kwiboo | I think a new struct with dbp index (or ref_ts) + flags would be easier
:-) | [20:39] |
jernej | but please note that I'm not expert on H264, so I maybe missed something | [20:40] |
ndufresne | what's the story about the TS btw, I been reading discussion that we should be using TS all over, does that simply never have been ported ?
jernej, that's why things goes back and forth so much, we all are not really H264 experts | [20:40] |
jernej | by TS you mean timestamps? they are used for references | [20:42] |
Kwiboo | my understanding is that hverkuil wanted to add a "tag" to be able to reference buffers instead of using buffer index, in the end the "timestamp" field was reused but should still be seen as sort of a "tag"
in ffmpeg hwaccel the buffer index is used as the "tag" ;) | [20:42] |
jernej | Kwiboo: I tested once rolling tags, but there were certainly no benefit, only slightly more complicated code, so I dropped the idea | [20:44] |
Kwiboo | ahh, I think as long as we have a 1:1 between capture and output buffers using the buffer index will be easiest, only in future for optimization that may have to change | [20:46] |
ndufresne | But the API I see here in cedrus repo is still using dpb index + buffer index | [20:49] |
jernej | check this one: https://git.linuxtv.org/hverkuil/media_tree.git/tree/drivers/staging/media/sunxi/cedrus/cedrus_h264.c?h=cedrus-h264#n115
it uses vb2_find_timestamp() | [20:52] |
ndufresne | ah, I totally forgot that it was Hans who ported it
I don't like that we just continue, to no debug log if it's not found, this seems like bad practice to me | [20:54] |
jernej | well, ts stuff for H264 was done by mripard | [20:55] |
ndufresne | I still need to get use to how things get spread out in the kernel development ;-D
jernej, thanks btw for the pointers | [20:56] |
jernej | probably a debug message would be nice if reference frame was not found, but I think few corrupted frames are better than no frames
no problem | [20:59] |
ndufresne | there is such a dispute between devs all the time betweem using freeze or displaying the corrupted buffers, it's a bit crazy | [21:00] |
jernej | really? I never noticed that, but then again, I don't follow much development of other players | [21:01] |
ndufresne | on facetime as an example, they opted for freezing last good image, on hangout they display corrupted image
most TVs will display the broken frames, but it never last long, since there is many keyframes while in video conference, you generally need to request a key frame | [21:04] |
...... (idle for 26mn) | ||
jernej | ndufresne: maybe we should support both cases. Decode the frame despite the issues in references (or elsewhere) but later mark frame as corrupted
so both principles can be supported, freeze or rendering of corrupted frames | [21:31] |
ndufresne | jernej, yes, that's what stateful decoder do
in gst I also have this flag, and you can add an element to drop frames with specific flag to choose what you want | [21:31] |
jernej, do you think we can deduce if the references are P or B slice from the expose DPB ?
I'll probably give this couple more reads, but in RK it seeps that the P reference are remove from list0/1 which only contains b-reference The control API we have make no sense, since the list are in v4l2_ctrl_h264_decode_param, that should be per slice I'm wondering if I can deduce that info and remove ref_pic_list_p0 ref_pic_list_b0 ref_pic_list_b1 to be continued | [21:39] | |
.... (idle for 15mn) | ||
jernej | ndufresne: I don't think you can deduce frame type from DPB
in DPB you should have decoded pictures where P, B or I type doesn't matter anymore | [21:56] |
↑back Search ←Prev date Next date→ Show only urls | (Click on time to select a line by its url) |