<!-- Some styling for better description lists --><style type='text/css'>dt { font-weight: bold;float: left;display:inline;margin-right: 1em} dd { display:block; margin-left: 2em}</style>

   jernej: <u>ndufresne</u>: I just tested HEVC on Cedrus and bitstream offset really needs to point to padding after header and not actual slice data
   ndufresne: silly hardware, but we'll need to find a solution that does not make this visible to users, or add an extra field that indicate the size of the padding bits
   <br> sorry alignment_bit_size perhaps
   <br> otherwise we have a weirdo field that no-one directly support
   <br> it's just an accident that it works with ffmpeg really
   ezequielg: tfiga you mention size and start_byte_offset fields,
   <br> but why are you bringing that topic?
   <br> we are not really discussion removing those fields, afaik.
   tfiga: <u>ezequielg</u>: those make no sense if there is only 1 slice in the buffer
   ezequielg: ah!
   <br> ok, now i get your point.
   tfiga: they are there to support the multiple slices per buffer model
   ezequielg: right.
   <br> i don't think we are ready for a multiple slice per buffer model.
   tfiga: neither we are for 1 slice per buffer :)
   ezequielg: btw, are you aware of any other hardware (besides cedrus) which would leverage this model?
   <br> we are already doing 1 slice per buffer. suboptimal or not, that's what we have now.
   tfiga: Chrome does multiple slices per buffer
   <br> I suppose our hardware ignores the parsed slice structs
   <br> but still, it is also implementable
   ndufresne: yes, so in the latest API, whole frame decoder no longer need to implementslice_params control
   <br> so that "all slices" per buffer case that's covered
   tfiga: only for hardware which parses the slices on its own
   ndufresne: for cedrus, it was raised that what we do currently might not scale well when there is a lot of slices, I guess some HEVC streaming case ?
   <br> <u>tfiga</u>: yes, we can basically use the control presence for that
   ezequielg: so, currently the only hardware we are know that would be interested in this is cedrus.
   ndufresne: but then if you have HW (it's hypothetical), frame decoder
   <br> without parsing, you are screwed, cause we can only give up to 16slices, params in the control
   tfiga: gnurou would have to check what the stateless mtk-vcodec does, as I don't remember
   ndufresne: iirc, it was using the slice_params at index 0
   tfiga: we can give 16 slices in 1 buffer
   <br> and then go with next buffer for the remaining ones
   ndufresne: yes, but framebase decoder might not like having multiple buffers
   ezequielg: afaik, mtk is frame-based.
   <br> by frame-based i mean, it gets all the slices in a buffer, and then parses the slice headers in the hw.
   pinchartl: ribalda:
   ndufresne: we notice with hantro/rkvdec, for these, you really need 1 buffer with all the data, so you'd have to accumulate the slice params until you have them all
   <br> seems horribly complex
   pinchartl: oops
   ndufresne: (I mean for a driver to do)
   ezequielg: <u>pinchartl</u>: now you have to take part in the discussion as well.
   ndufresne: <u>tfiga</u>: it's of course hard to guess what strange thing HW vendor will ever do
   tfiga: <u>ndufresne</u>: for parsing hardware, one indeed needs all the slices in one buffer likely
   <br> although it sounds like an artificial limitation
   <br> so I'm okay with 1 slice per buffer IF we solve the problems with it
   <br> or at least have a clear plan for solving them
   ndufresne: that's interesting direction, I had on my todo to test and if needed debug slice batching with cedrus ...
   <br> do we have a thread the list these issues you refer to ?
   ezequielg: i think given cedrus is _Already_ doing 1 slice per buffer, and even we introduced a special flag and API mode (capture hold) for it to work, it's sane to assume it's a mode we want.
   <br> given all the work that was done towards it.
   ndufresne: I've been using cedrus a lot, and haven't hit any thing that does not work there that works on other
   tfiga: <u>ndufresne</u>: yes, see my reply to ezequielg from today on the list
   ezequielg: my proposal is to change the current slice control to 1 slice, not an array.
   <br> and to think of an array as an extension
   ndufresne: ah, you just mean fix the useless fields if we go 1 slice per buffer
   <br> which is just offset and n_slices iirc right ?
   <br> What I like of this proposal ezequielg is that it's also the best way to figure-out if batching complexity is really needed
   <br> as we all said before, while slices exists, a majority of streams don't use them, and when they do, they have less then 8 slices
   tfiga: yes, I get all the points
   <br> and the approach seems fine to me
   <br> but it has problems, which need to be addressed
   <br> not necessarily right now
   ndufresne: <u>tfiga</u>: so from the email, 1) seems from an unrelated discussions, something I believe we discarded, was about having N buffers per requests as a batching model, instead of 1 + using slice_params to index to their memory offset
   <br> 2) is of course releavant, but would have to be measured to justify more complexity
   <br> 3) well, we don't have such hardware, but batching through IRQ is trully HW accelerated, and that was our idea with cedrus (though we had 1 buffer N slices in mind)
   ezequielg: the key is "not necessarily right now"
   ndufresne: 4) is true, with cedrus, as we don't know the number of slices, and that slices don't have to be even in amount of bits, we with a higher "unused" amount of memory when there is multiple slices, and that problem goes away if you implement batching, which is to append as many a slice (up to 16) as fit into the v4l2_buffer
   ezequielg: we don't have to fix the world at once on this proposal.
   <br> we just need a plan forward.
   <br> which we have.
   tfiga: <u>ezequielg</u>: yes, we need a plan
   <br> and I don't see it
   ezequielg: the plan is a new control or set of controls.
   <br> which we know it's totally possible.
   <br> so i don't see why we'd block the current work because of that
   tfiga: would you mind replying to the problems I listed in my reply?
   ndufresne: I think 2, 3 and 4 was covered with the 16 slices arrays and their offset, and 1 is an issue for something we don't even plan to ever support
   ezequielg: sure, i will reply on the ML. I mostly wanted to clarify on my first question about those fields.
   ndufresne: now the problem is more techy, the folks (us) working on finishing this API have to direct motivation to work on cerdus, I'd like to be honnest here
   <br> *have no direct ...
   tfiga: <u>ndufresne</u>: I don't understand why 1) is unrelated
   <br> if you have many slices, you would end up stalling the pipeline quite fast
   ndufresne: it was related to another thread, were we suggested to do 1 buffer per slice, and N buffer per request
   <br> and I believe we discarded that idea
   tfiga: no, that would still be 1 buffer per request
   <br> but N slices per buffer
   <br> or we could increase NUM_BUFFERS
   ndufresne: if it's 1 buffer per request, why would 32buffer be a problem ?
   tfiga: because NUM_BUFFERS is 32
   ndufresne: if you have 1 buffer per request, you just have to wait for some of your request to finish
   tfiga: yes, and you end up doing so every slice
   <br> after you fill the pipeline
   ndufresne: a 32slice batch is way sufficient, in fact we had 16 suggested in current API
   tfiga: so N times per frame of waiting for dqbuf
   ndufresne: no, you only wait in re-ordered order, at least in gst that's what I do
   <br> in gst I don't even wait for every frame
   <br> if you have b-frames
   <br> and if it was implemented as signed, it would be N/16
   <br> for slices
   <br> anyway, all very techny, but to be fair, I think this "design" optimization isn't proven with any H.264  (cause yes, this is just 1 codec, we can do things different with the reality of other codec)
   tfiga: I'm talking about OUTPUT (bitstream) buffers
   <br> let's say you have 32 slices and 32 buffers
   ndufresne: same here
   tfiga: how do you queue the bitstream to the driver?
   ndufresne: that's an odd question, QBUF ?
   tfiga: POLL() for a free buffer, DQBUF, fill in the bitstream, QBUF, repeat
   ndufresne: well, you can optimize few of the POLL if you are not pressured by memory
   <br> I implemented some of it
   <br> it's not just driver that can optimize things
   tfiga: no, if you already queued 32 slices
   <br> and that would be a common case
   <br> if you have enough buffers to queue 2 frames, you could just keep queuing
   ndufresne: you can as an example optimize in userspace that picking the middle point in the list of pending request
   tfiga: and then synchronize on a full frame decode
   <br> and that would be the place to dqbuf all the free buffers
   ndufresne: and wait for that one, then you have N/2 bitstream buffer you can immediatly dequeu with a single poll
   <br> you can do that, because request are executed in order
   <br> in fact, you can do that because request behaves exactly like fences
   tfiga: okay, I forgot one can poll on a request fd
   ndufresne: in fact, you have to, since it's racy if you poll the queue
   tfiga: okay, I suppose that solves some of the problems
   ndufresne: in term of amount of poll(), the request are doing a great job here
   tfiga: <u>ndufresne</u>: would you mind replying to my email as well?
   ndufresne: but I must admit, it's quite painful to have to DQBUF
   tfiga: <u>ezequielg</u>: thanks in advance too
   ndufresne: sure
   tfiga: thanks!
   ndufresne: ok, hopefully I managed a useful answer
   cocus: hello! quick question, slightly off topic, which community maintains the aux-display drivers? is it part of the media-tree as well?