<!-- Some styling for better description lists --><style type='text/css'>dt { font-weight: bold;float: left;display:inline;margin-right: 1em} dd { display:block; margin-left: 2em}</style> jernej: <u>ndufresne</u>: I just tested HEVC on Cedrus and bitstream offset really needs to point to padding after header and not actual slice data ndufresne: silly hardware, but we'll need to find a solution that does not make this visible to users, or add an extra field that indicate the size of the padding bits <br> sorry alignment_bit_size perhaps <br> otherwise we have a weirdo field that no-one directly support <br> it's just an accident that it works with ffmpeg really ezequielg: tfiga you mention size and start_byte_offset fields, <br> but why are you bringing that topic? <br> we are not really discussion removing those fields, afaik. tfiga: <u>ezequielg</u>: those make no sense if there is only 1 slice in the buffer ezequielg: ah! <br> ok, now i get your point. tfiga: they are there to support the multiple slices per buffer model ezequielg: right. <br> i don't think we are ready for a multiple slice per buffer model. tfiga: neither we are for 1 slice per buffer :) ezequielg: btw, are you aware of any other hardware (besides cedrus) which would leverage this model? <br> we are already doing 1 slice per buffer. suboptimal or not, that's what we have now. tfiga: Chrome does multiple slices per buffer <br> I suppose our hardware ignores the parsed slice structs <br> but still, it is also implementable ndufresne: yes, so in the latest API, whole frame decoder no longer need to implementslice_params control <br> so that "all slices" per buffer case that's covered tfiga: only for hardware which parses the slices on its own ndufresne: for cedrus, it was raised that what we do currently might not scale well when there is a lot of slices, I guess some HEVC streaming case ? <br> <u>tfiga</u>: yes, we can basically use the control presence for that ezequielg: so, currently the only hardware we are know that would be interested in this is cedrus. ndufresne: but then if you have HW (it's hypothetical), frame decoder <br> without parsing, you are screwed, cause we can only give up to 16slices, params in the control tfiga: gnurou would have to check what the stateless mtk-vcodec does, as I don't remember ndufresne: iirc, it was using the slice_params at index 0 tfiga: we can give 16 slices in 1 buffer <br> and then go with next buffer for the remaining ones ndufresne: yes, but framebase decoder might not like having multiple buffers ezequielg: afaik, mtk is frame-based. <br> by frame-based i mean, it gets all the slices in a buffer, and then parses the slice headers in the hw. pinchartl: ribalda: ndufresne: we notice with hantro/rkvdec, for these, you really need 1 buffer with all the data, so you'd have to accumulate the slice params until you have them all <br> seems horribly complex pinchartl: oops ndufresne: (I mean for a driver to do) ezequielg: <u>pinchartl</u>: now you have to take part in the discussion as well. ndufresne: <u>tfiga</u>: it's of course hard to guess what strange thing HW vendor will ever do tfiga: <u>ndufresne</u>: for parsing hardware, one indeed needs all the slices in one buffer likely <br> although it sounds like an artificial limitation <br> so I'm okay with 1 slice per buffer IF we solve the problems with it <br> or at least have a clear plan for solving them ndufresne: that's interesting direction, I had on my todo to test and if needed debug slice batching with cedrus ... <br> do we have a thread the list these issues you refer to ? ezequielg: i think given cedrus is _Already_ doing 1 slice per buffer, and even we introduced a special flag and API mode (capture hold) for it to work, it's sane to assume it's a mode we want. <br> given all the work that was done towards it. ndufresne: I've been using cedrus a lot, and haven't hit any thing that does not work there that works on other tfiga: <u>ndufresne</u>: yes, see my reply to ezequielg from today on the list ezequielg: my proposal is to change the current slice control to 1 slice, not an array. <br> and to think of an array as an extension ndufresne: ah, you just mean fix the useless fields if we go 1 slice per buffer <br> which is just offset and n_slices iirc right ? <br> What I like of this proposal ezequielg is that it's also the best way to figure-out if batching complexity is really needed <br> as we all said before, while slices exists, a majority of streams don't use them, and when they do, they have less then 8 slices tfiga: yes, I get all the points <br> and the approach seems fine to me <br> but it has problems, which need to be addressed <br> not necessarily right now ndufresne: <u>tfiga</u>: so from the email, 1) seems from an unrelated discussions, something I believe we discarded, was about having N buffers per requests as a batching model, instead of 1 + using slice_params to index to their memory offset <br> 2) is of course releavant, but would have to be measured to justify more complexity <br> 3) well, we don't have such hardware, but batching through IRQ is trully HW accelerated, and that was our idea with cedrus (though we had 1 buffer N slices in mind) ezequielg: the key is "not necessarily right now" ndufresne: 4) is true, with cedrus, as we don't know the number of slices, and that slices don't have to be even in amount of bits, we with a higher "unused" amount of memory when there is multiple slices, and that problem goes away if you implement batching, which is to append as many a slice (up to 16) as fit into the v4l2_buffer ezequielg: we don't have to fix the world at once on this proposal. <br> we just need a plan forward. <br> which we have. tfiga: <u>ezequielg</u>: yes, we need a plan <br> and I don't see it ezequielg: the plan is a new control or set of controls. <br> which we know it's totally possible. <br> so i don't see why we'd block the current work because of that tfiga: would you mind replying to the problems I listed in my reply? ndufresne: I think 2, 3 and 4 was covered with the 16 slices arrays and their offset, and 1 is an issue for something we don't even plan to ever support ezequielg: sure, i will reply on the ML. I mostly wanted to clarify on my first question about those fields. ndufresne: now the problem is more techy, the folks (us) working on finishing this API have to direct motivation to work on cerdus, I'd like to be honnest here <br> *have no direct ... tfiga: <u>ndufresne</u>: I don't understand why 1) is unrelated <br> if you have many slices, you would end up stalling the pipeline quite fast ndufresne: it was related to another thread, were we suggested to do 1 buffer per slice, and N buffer per request <br> and I believe we discarded that idea tfiga: no, that would still be 1 buffer per request <br> but N slices per buffer <br> or we could increase NUM_BUFFERS ndufresne: if it's 1 buffer per request, why would 32buffer be a problem ? tfiga: because NUM_BUFFERS is 32 ndufresne: if you have 1 buffer per request, you just have to wait for some of your request to finish tfiga: yes, and you end up doing so every slice <br> after you fill the pipeline ndufresne: a 32slice batch is way sufficient, in fact we had 16 suggested in current API tfiga: so N times per frame of waiting for dqbuf ndufresne: no, you only wait in re-ordered order, at least in gst that's what I do <br> in gst I don't even wait for every frame <br> if you have b-frames <br> and if it was implemented as signed, it would be N/16 <br> for slices <br> anyway, all very techny, but to be fair, I think this "design" optimization isn't proven with any H.264 (cause yes, this is just 1 codec, we can do things different with the reality of other codec) tfiga: I'm talking about OUTPUT (bitstream) buffers <br> let's say you have 32 slices and 32 buffers ndufresne: same here tfiga: how do you queue the bitstream to the driver? ndufresne: that's an odd question, QBUF ? tfiga: POLL() for a free buffer, DQBUF, fill in the bitstream, QBUF, repeat ndufresne: well, you can optimize few of the POLL if you are not pressured by memory <br> I implemented some of it <br> it's not just driver that can optimize things tfiga: no, if you already queued 32 slices <br> and that would be a common case <br> if you have enough buffers to queue 2 frames, you could just keep queuing ndufresne: you can as an example optimize in userspace that picking the middle point in the list of pending request tfiga: and then synchronize on a full frame decode <br> and that would be the place to dqbuf all the free buffers ndufresne: and wait for that one, then you have N/2 bitstream buffer you can immediatly dequeu with a single poll <br> you can do that, because request are executed in order <br> in fact, you can do that because request behaves exactly like fences tfiga: okay, I forgot one can poll on a request fd ndufresne: in fact, you have to, since it's racy if you poll the queue tfiga: okay, I suppose that solves some of the problems ndufresne: in term of amount of poll(), the request are doing a great job here tfiga: <u>ndufresne</u>: would you mind replying to my email as well? ndufresne: but I must admit, it's quite painful to have to DQBUF tfiga: <u>ezequielg</u>: thanks in advance too ndufresne: sure tfiga: thanks! ndufresne: ok, hopefully I managed a useful answer cocus: hello! quick question, slightly off topic, which community maintains the aux-display drivers? is it part of the media-tree as well?