[01:08] *** kaspter has quit IRC (Ping timeout: 256 seconds) [01:18] *** iive has quit IRC (Quit: They came for me...) [02:19] *** dreamcat4 has quit IRC (Quit: Connection closed for inactivity) [02:56] *** [LOGGER] has quit IRC (Ping timeout: 265 seconds) [05:09] *** buzzmarshall has quit IRC (Remote host closed the connection) [06:09] *** YuGiOhJCJ has quit IRC (Quit: YuGiOhJCJ) [07:34] *** NiksDev has quit IRC (Remote host closed the connection) [08:27] *** mripard has quit IRC (Quit: Lost terminal) [11:53] *** juvenal has quit IRC (Remote host closed the connection) [11:59] *** gbisson has quit IRC (Ping timeout: 260 seconds) [13:19] *** TFKyle has quit IRC (Ping timeout: 246 seconds) [15:06] *** astylian has quit IRC (Remote host closed the connection) [15:22] *** benjiG has left [16:35] *** lexano has quit IRC (Ping timeout: 256 seconds) [17:11] *** ndufresne has quit IRC (*.net *.split) [17:11] *** ayaka has quit IRC (*.net *.split) [17:11] *** tensa has quit IRC (*.net *.split) [17:11] *** serg-z_ has quit IRC (*.net *.split) [17:11] *** aballier has quit IRC (*.net *.split) [17:11] *** epaul has quit IRC (*.net *.split) [17:24] *** gnurou has quit IRC (*.net *.split) [17:24] *** kbingham[m] has quit IRC (*.net *.split) [17:24] *** mchehab has quit IRC (*.net *.split) [17:24] *** Lightsword has quit IRC (*.net *.split) [17:24] *** TylerShetrompf has quit IRC (*.net *.split) [17:24] *** svarbanov has quit IRC (*.net *.split) [17:24] *** Kwiboo has quit IRC (*.net *.split) [17:24] *** pinchartl has quit IRC (*.net *.split) [17:24] *** mort has quit IRC (*.net *.split) [17:24] *** darvon has quit IRC (*.net *.split) [17:24] *** ndec has quit IRC (*.net *.split) [17:24] *** marvs has quit IRC (*.net *.split) [17:24] *** RzR has quit IRC (*.net *.split) [17:24] *** ezequielg has quit IRC (*.net *.split) [17:24] *** al has quit IRC (*.net *.split) [17:24] *** kos_tom has quit IRC (*.net *.split) [17:24] *** ebarretto has quit IRC (*.net *.split) [17:24] *** mfe[home] has quit IRC (*.net *.split) [17:24] *** Muzer has quit IRC (*.net *.split) [17:24] *** fritzfritz has quit IRC (*.net *.split) [17:24] *** narmstrong has quit IRC (*.net *.split) [17:24] *** koike has quit IRC (*.net *.split) [17:24] *** smartin has quit IRC (*.net *.split) [17:24] *** paulk-leonov has quit IRC (*.net *.split) [17:24] *** lvrp16 has quit IRC (*.net *.split) [17:24] *** styl1te has quit IRC (*.net *.split) [17:24] *** ric96 has quit IRC (*.net *.split) [17:24] *** jernej has quit IRC (*.net *.split) [17:24] *** uajain has quit IRC (*.net *.split) [17:24] *** harrow has quit IRC (*.net *.split) [17:24] *** rellla has quit IRC (*.net *.split) [17:24] *** jkqxz has quit IRC (*.net *.split) [17:24] *** yang has quit IRC (*.net *.split) [17:24] *** void09 has quit IRC (*.net *.split) [17:24] *** jhand2[m] has quit IRC (*.net *.split) [17:24] *** kitakar5525[m] has quit IRC (*.net *.split) [17:24] *** dafna2 has quit IRC (*.net *.split) [17:24] *** kbingham has quit IRC (*.net *.split) [17:24] *** learningc has quit IRC (*.net *.split) [17:24] *** tfiga has quit IRC (*.net *.split) [17:24] *** unidan has quit IRC (*.net *.split) [17:24] *** nst has quit IRC (*.net *.split) [17:24] *** andrey-konovalov has quit IRC (*.net *.split) [17:24] *** syoung has quit IRC (*.net *.split) [17:24] *** larsc has quit IRC (*.net *.split) [17:24] *** MateusRodCosta has quit IRC (*.net *.split) [17:24] *** KitsuWhooa has quit IRC (*.net *.split) [17:24] *** jmondi has quit IRC (*.net *.split) [17:24] *** d0ggie has quit IRC (*.net *.split) [17:24] *** ChanServ has quit IRC (*.net *.split) [17:24] *** kaspter has quit IRC (*.net *.split) [17:24] *** padovan has quit IRC (*.net *.split) [17:24] *** hverkuil has quit IRC (*.net *.split) [17:24] *** blinky42 has quit IRC (*.net *.split) [17:24] *** hnrk has quit IRC (*.net *.split) [17:24] *** varodek has quit IRC (*.net *.split) [17:24] *** z3ntu has quit IRC (*.net *.split) [17:24] *** pH5 has quit IRC (*.net *.split) [17:24] *** ldts has quit IRC (*.net *.split) [17:24] *** ribalda has quit IRC (*.net *.split) [17:24] *** griffinp has quit IRC (*.net *.split) [17:24] *** robertfoss has quit IRC (*.net *.split) [17:24] *** mani_s has quit IRC (*.net *.split) [17:24] *** zf has quit IRC (*.net *.split) [17:24] *** _bingbu_ has quit IRC (*.net *.split) [17:24] *** b-rad has quit IRC (*.net *.split) [17:24] *** andrzej_1 has quit IRC (*.net *.split) [17:24] *** matt68 has quit IRC (*.net *.split) [17:24] *** r0kc4t_ has quit IRC (*.net *.split) [17:24] *** ukembedded has quit IRC (*.net *.split) [17:24] *** henriknj has quit IRC (*.net *.split) [17:24] *** sailus has quit IRC (*.net *.split) [17:24] *** GrayShade has quit IRC (*.net *.split) [17:24] *** rubdos has quit IRC (*.net *.split) [17:24] *** gtucker has quit IRC (*.net *.split) [17:24] *** dagmcr has quit IRC (*.net *.split) [17:24] *** ntrrgc has quit IRC (*.net *.split) [17:24] *** ezequielg_6 has quit IRC (*.net *.split) [17:24] *** indy has quit IRC (*.net *.split) [17:24] *** elGamal has quit IRC (*.net *.split) [17:24] *** mmattice has quit IRC (*.net *.split) [17:24] *** Marex has quit IRC (*.net *.split) [17:24] *** taliho has quit IRC (*.net *.split) [17:26] *** card.freenode.net sets mode: +o ChanServ [19:16] <ndufresne> tfiga: not sure if Alex is on IRC or not, but thanks for posting, I walked through the code and only slice_params[0] is actually used by the driver, and only used through the ref list generator, so that will completely go away with ezequielg changes [19:17] <ndufresne> tfiga: and this driver is closer to DXVA2, it does not seem to use the two bit_sizes, basically only the ref list p/b0/b1 are the "special" needs [19:17] <tfiga> ndufresne: gnurou is Alex :) [19:18] <ndufresne> ah great, didn't know your nickname gnurou [19:18] <tfiga> Although he's probably asleep right now [19:18] <ndufresne> So baiscally the suggetion to make the slice parameters non-array kind of make sense, all frame based decoder don't need that control at all [19:19] <ndufresne> and for slice decoder, you have to request decoding of each slice separatly anyway [19:20] <ndufresne> Anyway, tfiga gnurou we'll try and get an RFC sorted withing the "makeing public" API patchset [19:20] <ndufresne> otherwise we are kind of force to solve an unsolvable problem, which is support for variable size controls [19:21] <tfiga> I think we still need to solve a hard API problem regardless of that [19:22] <tfiga> either approach is currently inefficient [19:22] <tfiga> passing slices separately multiplies the number of system calls significantly [19:22] <tfiga> if you have many slices of course [19:24] <tfiga> and with big number of slices it could be impossible to decode even a single frame without doing the dequeue/queue exchange [19:24] <ndufresne> tfiga: apple trailers have dynamic number of slices, but sometimes it goes over 16 frames [19:24] <ndufresne> sorry slices [19:24] <ndufresne> that being said, this is fairly old copy of apple trailers I had around, might not represent the current way to encode stuff [19:25] <tfiga> so it is only better than the slice array approach in the fact that it can theoretically handle any unbound number of slices [19:26] <ndufresne> but yeah, per frame ioctl is much bigger is you can't combine, but in absence of kernel implementation of an API I very unconfortable pushing forward [19:26] <tfiga> what do you mean by absence of kernel implementation? [19:26] <ndufresne> the only possible users is Cedrus driver, and they already had good performance back when they had to requeued the capture buffer between every slices [19:26] <ndufresne> tfiga: there is no driver that ever use slice paramters passed index 0 [19:27] <tfiga> I think it's only because cedrus initially *misimplemented* our Chromium API [19:27] <tfiga> to actually know if they get good performance or not, someone would have to implement the other and compare [19:27] <tfiga> and present some numbers [19:28] <tfiga> unfortunately we don't have any cedrus hardware [19:28] <ndufresne> well, the chromium API is flawed by an arbitrary 16 slice limit, what should they have done about it ? and we need a low latency method [19:28] <tfiga> no, it's not [19:28] <tfiga> it's 16 slices per buffer [19:28] <tfiga> but you can still have more buffers [19:28] <ndufresne> tfiga: if you need one, I got a libre computer tritium, cheap and easy to use [19:29] <tfiga> I wouldn't have time to do this kind of testing myself sadly [19:30] <ndufresne> I think it's the first time I see the intention of this API being written down [19:32] <tfiga> what do you mean? [19:32] <ndufresne> It's first time that someone actually explicitly state that that was just batching API, that one could pass from 1 to 16 slices per buffer [19:33] <tfiga> I believe I mentioned that several times in the past when the API was being discussed [19:33] <tfiga> didn't call that "batching" probably [19:33] <tfiga> also in the end we didn't implement that anywhere in Chromium, because it turned out that Rockchip deals with full frames [19:33] <ndufresne> well, folks have been complaining over and over that 16 was not enough, not sure why these were just ignored, but anyway, without the buffer HOLDING mechanism, not sure how that would have worked [19:34] <tfiga> yeah, that was the missing bit [19:34] <ndufresne> ok, so we need a userspace implementation of that mechanism, even if it's just a test and cedrus implementation imho [19:34] <ndufresne> we need to prove our API before turning public [19:34] <tfiga> yes [19:35] <tfiga> and to be honest, I'd agree that 1 slice per buffer is cleaner [19:35] <tfiga> but it's just crippled by the API limitations today [19:35] <ndufresne> I think I can demo this, tbh, I've tested a lot of streams now on Cedrus, and the CPU overhead difference with Hantro is so small [19:35] <ndufresne> just the startcode scanning cost more then this ;-P [19:36] <tfiga> well, I guess neither is an x86 system [19:37] <tfiga> so possibly there isn't too much speculative exploit mitigation in the system calls [19:38] <tfiga> but I would still check things like idle residency [19:38] <tfiga> and not only the raw CPU usage [19:38] <ndufresne> I didn't do serious enough testing, that's is for sure [19:39] <ndufresne> batching with GPU commands is pretty common, but it's done naturally from the pushback of the previous command being sent [19:39] <tfiga> also there is the problem of buffer size [19:39] <tfiga> I suppose some slices could be quite big [19:39] <tfiga> does it mean that we need to allocate NUM_SLICES * MAX_ASSUMED_SLICE_SIZE buffers? [19:39] <ndufresne> we mostly all do that same, we calculate the raw size of the frame and use that as bitstream buffer size [19:40] <ndufresne> it's a huge waste of course [19:40] <ndufresne> but if we don't have 1 buffer per frame (or per slice) we endup having to bounce in the driver [19:40] <tfiga> because cedrus has some alignment restrictions on the bistream? [19:41] <ndufresne> no, I meant if you split your bitstream on multiple buffer, because a slice overflow the buffer size, you have randomly aligned buffers [19:42] <ndufresne> while the HW wants 1 memory pointer with all the data [19:42] <ndufresne> that was a suggestion at some point from hverkuil , to allow sending 1 frame over multiple bitstream buffer [19:43] <ndufresne> (similar to how OMX or DXVA2 works, except that we are in the kernel, these API are in userspace) [19:45] <ndufresne> tfiga: anyway, I think it's wrong directions, we use CREATE_BUFS when an overflow is detected, and introduce DELETE_BUFS to be able to get rid of the extra unneeded buffers [19:45] <ndufresne> * we should use [19:45] <ndufresne> this way the driver does not need to something fancy [19:46] <tfiga> you mean, if 1 slice overflows [19:46] <tfiga> but that shouldn't be split between buffers then [19:46] <ndufresne> yes, if 1 slice overflow your prediction [19:46] <tfiga> the approach with CREATE_BUFS doesn't solve the problem entirely [19:47] <ndufresne> how so? [19:47] <tfiga> because we end up reallocating bigger buffers and keep the higher memory footprint just for one slice that overflowed [19:47] <ndufresne> it's userspace decision to allocate the rest bigger, or to just later discard the exception [19:48] <ndufresne> * it could be userspace decision [19:48] <tfiga> if we could put multiple slices in one buffer, we could fill the buffers as much as possible [19:49] <ndufresne> well, in theory, in the current API we could [19:49] <ndufresne> since the slice params include an offset relative to the bitstream buffer [19:49] <tfiga> even without that, one can just requeue the same buffer multiple times with different data_offset [19:50] <tfiga> but not in MMAP mode [19:50] <ndufresne> I believe we concluded that data offset was only written by driver [19:50] <tfiga> that's for CAPTURE buffers [19:50] <ndufresne> now I'm confused [19:50] <tfiga> and for OUTPUT buffers it's controlled by the application [19:51] <ndufresne> hmm, so importing dmabuf in output with data_offset is fine then !?! [19:51] <tfiga> yes, but it's not mandatory for a driver to support it [19:52] <ndufresne> ah, well, if it's not mandatory to support it, it'S not fine at all [19:52] <tfiga> I suppose we could make it mandatory for stateless codecs [19:52] <ndufresne> well, it should be mandatory for encoders, stateless or full [19:53] <ndufresne> if you are lucky you can use the non-mplane format and a single buffer to import it [19:53] <tfiga> not sure about stateful decoders, but for encoders it should be doable [19:53] <ndufresne> but consider MFC [19:53] <ndufresne> that being said, having a fully symetric API would be less confusing [19:54] <ndufresne> I believe MFC always uses two buffers (one for each planes) [19:54] <tfiga> it doesn't have to in newer revisions [19:55] <tfiga> Exynos5 for sure [19:55] <tfiga> at least for the encoder [19:55] <ndufresne> good to know, I haven't used an exynos HW for over 5 years now [19:55] <tfiga> the decoder is a bit more quirky [19:55] <tfiga> well, we're not using it either [19:56] <tfiga> AFAIR last exynos platforms on Chrome OS became EOL last year [19:57] <ndufresne> ok, to go back to stateless, do you think it's worth documenting an testing the ability to submit a request for N slices (with a maximum of 16 being well documented this time) ? [19:57] <tfiga> I'd at least go for some testing [19:57] <tfiga> even with some really hacky code [19:57] <ndufresne> ok [19:58] <ndufresne> I have the code in gst to submit multiple slice parameters, so that should do, I'll take a stream with less then 16 slices, so I don't have to worry about multiple submission per frame [19:58] <ndufresne> by submission I mean having to queue the request and create another one [19:59] <ndufresne> (it's called submit in both gst and ffmpeg that's why I use this term) [19:59] <tfiga> okay, that should make it easier then [19:59] <tfiga> and we also need some changes for cedrus [20:00] <ndufresne> tfiga: and I think I'll document that FRAME_BASED + NO_START_CODE should simply be considered like a bad parameter [20:00] <tfiga> I suppose the simplest approach could be to maintain a slice counter [20:00] <ndufresne> tfiga: yes, I'll check, perhaps there is already some of that implemented [20:00] <tfiga> and then not removing the OUTPUT buffer from the m2m queue until the last slice is done [20:01] <tfiga> then the m2m framework would take care of repeating the buffer for each slice [20:02] <ndufresne> yeah, that's the bit that seems rather complicated, since in term of kernel, this is 1 request, but in term of driving the HW, this is N [20:02] <ndufresne> might be why it hurst for drivers, and long term we'd likely need the framework to make it look like N request, who knows [20:02] <tfiga> I think it would make it quite simple if we do what I suggested above [20:03] <tfiga> I suppose cedrus uses the vb2 request helpers [20:03] <ndufresne> it does [20:05] <tfiga> yeah, so I guess as long as we don't call m2m_vb2_buf_done(), the request would stay [20:06] <tfiga> although I wonder if it couldn't end up running some request operation multiple times [20:06] <tfiga> which would actually introduce overhead... [20:07] <ndufresne> sorry, what do you refer to by "request operation" ? [20:08] <ndufresne> I thought we had access to the controls as pointers (e.g. copied only once) [20:08] <ndufresne> so I don't know what internal may cause overhead [20:08] <tfiga> okay, maybe I was wrong [20:08] <tfiga> yeah, I think there is just a control handler in each request [20:11] <tfiga> oops, nope [20:11] <tfiga> https://elixir.bootlin.com/linux/latest/source/drivers/staging/media/sunxi/cedrus/cedrus_dec.c#L25 [20:11] <ndufresne> that's odd, I was checking gnurou's code, and he seem to get a pointer there [20:12] <tfiga> if we make the m2m helpers repeat the same OUTPUT buffer for each slice, cedrus_device_run() would end up calling v4l2_ctrl_request_setup() for each slice [20:12] <tfiga> and v4l2_ctrl_request_setup() basically copies from the request control handler to the driver control handler [20:13] <ndufresne> tfiga: well, you have to let other process use the HW base on scheduling [20:14] <ndufresne> so there is no choice but to fully re-program the HW, but if there is a copy of some invariant per slice controls, that should be optimized in the driver I suppose [20:14] <tfiga> that's a per-context control handler [20:14] <tfiga> I suspect the call to v4l2_ctrl_request_setup() isn't even necessary [20:14] <tfiga> would need to check gnurou's code [20:15] <tfiga> but I suspect it's indeed possible to just get the control pointers directly from the request [20:15] <ndufresne> he's got a nice little get_ctrl_ptr() helper, which simply get the pointer [20:15] <ndufresne> but then he copies to a HW specific structure, but that HW seems more like a GPU, you speak to a firmware [20:18] <tfiga> hmm, that's still not the case [20:19] <tfiga> the code there still calls v4l2_ctrl_request_setup() and then get_ctrl_ptr() gets the pointer to the control handler for the context [20:19] <tfiga> so there is a copy of all the controls from the request to the context control handler [20:21] <tfiga> so for testing purposes, I guess one would have to call v4l2_ctrl_request_setup() for the first slice [20:21] <tfiga> and then v4l2_ctrl_request_complete() after the last slice [20:22] <ndufresne> why do we make this copy ? [20:22] <tfiga> I think it's just how the requests are supposed to work [20:22] <tfiga> any control that is in the request need to be applied to the context [20:22] <tfiga> and persist after the request is complete [20:22] <ndufresne> but the request itself isn't sufficient as a context ? [20:22] <tfiga> but it doesn't make much sense for stateless decoders, where one needs to set the controls every slice/frame [20:23] <tfiga> no, in the general case it isn't, because it can include only an arbitrary subset of controls [20:23] <tfiga> although this is a bit of an implementation detail [20:23] <ndufresne> basically, what's the point of the request if we can't freeze it (after queued) and read directly from [20:24] <tfiga> it allows attaching controls to buffers [20:24] <tfiga> and right now that's it [20:24] <ndufresne> hmm, right, another thing we didn't test yet [20:24] <ndufresne> in ffmpeg/gst, we set all the controls per slice submission [20:25] <tfiga> it's basically a way to defer the S_EXT_CTRLS operation to the time the buffer processing begins [20:26] <ndufresne> so basically it assume that userspace will set a subset of the required controls in the request, and when the processing start, it copied the controls from the request, and complete with previously submitted values ? [20:27] <ndufresne> I have a feeling there is too many ways to do the same thing in that API [20:28] <ndufresne> ah, I think I understand better, it has a copy of the last job controls, and copies that "delta" from the request into that [20:29] <ndufresne> tfiga: so would a slice count down works ? or be too hacky [20:31] <tfiga> yes, basically that's how it works :) [20:31] <tfiga> I think it would work, with some changes to cedrus_device_run() to avoid repeating v4l2_ctrl_request_setup() [20:32] <ndufresne> ok, I would need to read more code, we also need _run() to be called again, that is likely not auto-magic [20:47] <ndufresne> tfiga: ok, looking at this, I think we likely want to process all the slice in one entry from the job_queue, otherwise it seems to clash with the design [21:47] <tfiga> ndufresne: possibly. I guess paulk-leonov and others would know better than me :P [21:51] <ndufresne> tfiga: yep, I've followed on #cedrus, jernej believe that it's more efficient to trigger again from the IRQ [21:51] <ndufresne> and reading the code, seems the most sensible thing to do too [21:52] <ndufresne> so bascailly handle all N slices in one context [21:52] <tfiga> hmm, I see [21:52] <tfiga> in hantro, device_run would be executed from the IRQ context [21:52] <tfiga> (AFAIR...) [21:52] <ndufresne> as this way we can also redude the HW programming overhead (thats cedrus specific though) [21:53] <ndufresne> in the case of cedrus, there would be one device_run() but the completion will trigger after N interrupt [21:53] <tfiga> indeed, there would be no need to repeat the common settings [21:53] <ndufresne> the setup() ups would need to be split between frame invariant and slice setup [21:54] <tfiga> okay, thanks a lot for looking into this [22:00] <ndufresne> well, you are the one who unlocked this situation, I was missing some understanding, that will be key in finishing and proof testing this API [22:01] <ndufresne> tfiga: btw, I just learned that cedrus can be be slices in parts, as long as you can signal the start/end [22:04] <jernej> ndufresne: there seems to be some alignment consideration in such case [22:12] <ndufresne> hmm, that would be the only blocker, as we don't have any means currently to signal that alignment back to userspace [22:14] <ndufresne> but we didn't plan to implement this for now anyway [22:53] <ezequielg> tfiga: ndufresne: .device_run doesn't ever run in atomic context, due to v4l2_ctrl_request_setup and how we handle controls. [22:54] <tfiga> ouch [22:54] <ezequielg> yup. [22:54] <ezequielg> we can/could fix that, if we introduce some clean semantic to signal that v4l2_ctrl_request_setup won't be called. [23:09] <tfiga> I suppose it's primarily an issue with the locking design of the control framework? [23:10] <tfiga> although the way the requests are implemented there is also quite inefficient [23:10] <tfiga> see also the discussion above about copying the controls [23:28] <ezequielg> exactly.