Hi folks,
I've found that with some DVB tuner drivers poll() returns when there are only small (1-3) number of TS packets available to read(). This results in high CPU usage of the TS buffer thread which is busy reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners I tested were affected with this issue. Even with a "good" tuner TS buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7. With the proposed patch it is below 1-2% on all ARM and x86_64 platforms I've tested. The delay value of 10 ms can be considered safe:
media/dvb-core/dmxdev.h:109 #define DVR_BUFFER_SIZE (10*188*1024)
It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 = 1540 Mbit/s to overflow the device buffer within 10 ms interval. A smaller delay is not enough for ARM. cDvbDevice has a ring buffer of 5MB which is larger.
This patch was made against VDR 2.3.1, but it can be applied to VDR 2.2.0 as well.
Please review. glenvt18 --- Index: b/device.c =================================================================== --- a/device.c 2015-09-18 01:04:12.000000000 +0300 +++ b/device.c 2016-03-10 03:38:50.078400715 +0300 @@ -1768,6 +1768,8 @@ break; } } + else + cCondWait::SleepMs(10); } } }
On 10.03.2016 02:54, glenvt18 wrote:
Hi folks,
I've found that with some DVB tuner drivers poll() returns when there are only small (1-3) number of TS packets available to read(). This results in high CPU usage of the TS buffer thread which is busy reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners I tested were affected with this issue. Even with a "good" tuner TS buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7. With the proposed patch it is below 1-2% on all ARM and x86_64 platforms I've tested. The delay value of 10 ms can be considered safe:
media/dvb-core/dmxdev.h:109 #define DVR_BUFFER_SIZE (10*188*1024)
It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 = 1540 Mbit/s to overflow the device buffer within 10 ms interval. A smaller delay is not enough for ARM. cDvbDevice has a ring buffer of 5MB which is larger.
This patch was made against VDR 2.3.1, but it can be applied to VDR 2.2.0 as well.
Please review. glenvt18
Index: b/device.c
--- a/device.c 2015-09-18 01:04:12.000000000 +0300 +++ b/device.c 2016-03-10 03:38:50.078400715 +0300 @@ -1768,6 +1768,8 @@ break; } }
else
cCondWait::SleepMs(10); } } }
I'm not too fond of the idea of introducing an additional, unconditional wait here. The actual problem should be fixed within the drivers. However, maybe waiting in case there is only a small number of TS packets available is acceptable:
--- device.c 2015/09/05 11:42:17 4.2 +++ device.c 2016/03/10 09:34:11 @@ -1768,6 +1768,8 @@ break; } } + else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE) + cCondWait::SleepMs(10); } } }
The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick to avoid using a concrete literal number here. Can you please test if this still solves your problem?
Klaus
Hi Klaus,
I've tested your patch on x86_64 and ARM (I picked the weakest HW I have). I've gathered some statistics:
Notes. A "good" tuner there means the one that returns bigger chunks on average and generally consumes less CPU cycles with a vanilla (not patched) VDR. CPU usage means the CPU usage of the TS buffer thread reported by top -d2. Bit rates are average and coarse. I didn't include unconditional delay dumps. They are "smooth" like those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically 1.5-2 times less.
Some observations: 1. Small reads are really expensive and must be avoided. 2. "Good" tuners are not that "good" at higher bit rates. 3. An "non-delayed" read() (the last read() in the sequence poll()->read()->sleep()->poll()->read()->poll()->read()) is very likely to return a small chunk.
With bit rates higher that 30 Mbit/s I got device buffer overflows on ARM - ring buffer processing couldn't catch up.
So, generally, it solves the issue, but is not as efficient as with an unconditional delay. Compared to the unconditional patch: 1. CPU usage is higher and not as steady. 2. The read() sizes are not smooth too.
Those CPU usage percents (without the patch) may look small, but it's 20-35% of the whole VDR CPU usage and are several times less with the patch, conditional or not.
I understand, an unconditional delay here looks a bit scaring. How about increasing the threshold value to, say, 100000, or 500 * TS_SIZE or even higher? In other words, treat delayed read() as a normal operation, and not delayed one as an emergency case. I can test it on x86_64 with 50-60 Mbit/s. What do you think of it?
Best, glenvt18.
2016-03-10 12:41 GMT+03:00 Klaus Schmidinger Klaus.Schmidinger@tvdr.de:
On 10.03.2016 02:54, glenvt18 wrote:
Hi folks,
I've found that with some DVB tuner drivers poll() returns when there are only small (1-3) number of TS packets available to read(). This results in high CPU usage of the TS buffer thread which is busy reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners I tested were affected with this issue. Even with a "good" tuner TS buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7. With the proposed patch it is below 1-2% on all ARM and x86_64 platforms I've tested. The delay value of 10 ms can be considered safe:
media/dvb-core/dmxdev.h:109 #define DVR_BUFFER_SIZE (10*188*1024)
It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 = 1540 Mbit/s to overflow the device buffer within 10 ms interval. A smaller delay is not enough for ARM. cDvbDevice has a ring buffer of 5MB which is larger.
This patch was made against VDR 2.3.1, but it can be applied to VDR 2.2.0 as well.
Please review. glenvt18
Index: b/device.c
--- a/device.c 2015-09-18 01:04:12.000000000 +0300 +++ b/device.c 2016-03-10 03:38:50.078400715 +0300 @@ -1768,6 +1768,8 @@ break; } }
else
cCondWait::SleepMs(10); } } }
I'm not too fond of the idea of introducing an additional, unconditional wait here. The actual problem should be fixed within the drivers. However, maybe waiting in case there is only a small number of TS packets available is acceptable:
--- device.c 2015/09/05 11:42:17 4.2 +++ device.c 2016/03/10 09:34:11 @@ -1768,6 +1768,8 @@ break; } }
else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
cCondWait::SleepMs(10); } } }
The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick to avoid using a concrete literal number here. Can you please test if this still solves your problem?
Klaus
vdr mailing list vdr@linuxtv.org http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
UPDATE.
Here are 60 Mbit/s (entire transponder) test results:
- the same behaviour as for 30- Mbit/s, no huge chunks on the second read. Threshold value of 70000 was chosen for demonstration only. It's just a bit smaller than the mean which is about 75000. Notice that the max value is 1.6x higher in the case of vanilla VDR. So, the threshold value of about 100000 is OK.
Best, glenvt18.
2016-03-14 19:05 GMT+03:00 glenvt18 glenvt18@gmail.com:
Hi Klaus,
I've tested your patch on x86_64 and ARM (I picked the weakest HW I have). I've gathered some statistics:
Notes. A "good" tuner there means the one that returns bigger chunks on average and generally consumes less CPU cycles with a vanilla (not patched) VDR. CPU usage means the CPU usage of the TS buffer thread reported by top -d2. Bit rates are average and coarse. I didn't include unconditional delay dumps. They are "smooth" like those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically 1.5-2 times less.
Some observations:
- Small reads are really expensive and must be avoided.
- "Good" tuners are not that "good" at higher bit rates.
- An "non-delayed" read() (the last read() in the sequence
poll()->read()->sleep()->poll()->read()->poll()->read()) is very likely to return a small chunk.
With bit rates higher that 30 Mbit/s I got device buffer overflows on ARM - ring buffer processing couldn't catch up.
So, generally, it solves the issue, but is not as efficient as with an unconditional delay. Compared to the unconditional patch:
- CPU usage is higher and not as steady.
- The read() sizes are not smooth too.
Those CPU usage percents (without the patch) may look small, but it's 20-35% of the whole VDR CPU usage and are several times less with the patch, conditional or not.
I understand, an unconditional delay here looks a bit scaring. How about increasing the threshold value to, say, 100000, or 500 * TS_SIZE or even higher? In other words, treat delayed read() as a normal operation, and not delayed one as an emergency case. I can test it on x86_64 with 50-60 Mbit/s. What do you think of it?
Best, glenvt18.
2016-03-10 12:41 GMT+03:00 Klaus Schmidinger Klaus.Schmidinger@tvdr.de:
On 10.03.2016 02:54, glenvt18 wrote:
Hi folks,
I've found that with some DVB tuner drivers poll() returns when there are only small (1-3) number of TS packets available to read(). This results in high CPU usage of the TS buffer thread which is busy reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners I tested were affected with this issue. Even with a "good" tuner TS buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7. With the proposed patch it is below 1-2% on all ARM and x86_64 platforms I've tested. The delay value of 10 ms can be considered safe:
media/dvb-core/dmxdev.h:109 #define DVR_BUFFER_SIZE (10*188*1024)
It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 = 1540 Mbit/s to overflow the device buffer within 10 ms interval. A smaller delay is not enough for ARM. cDvbDevice has a ring buffer of 5MB which is larger.
This patch was made against VDR 2.3.1, but it can be applied to VDR 2.2.0 as well.
Please review. glenvt18
Index: b/device.c
--- a/device.c 2015-09-18 01:04:12.000000000 +0300 +++ b/device.c 2016-03-10 03:38:50.078400715 +0300 @@ -1768,6 +1768,8 @@ break; } }
else
cCondWait::SleepMs(10); } } }
I'm not too fond of the idea of introducing an additional, unconditional wait here. The actual problem should be fixed within the drivers. However, maybe waiting in case there is only a small number of TS packets available is acceptable:
--- device.c 2015/09/05 11:42:17 4.2 +++ device.c 2016/03/10 09:34:11 @@ -1768,6 +1768,8 @@ break; } }
else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
cCondWait::SleepMs(10); } } }
The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick to avoid using a concrete literal number here. Can you please test if this still solves your problem?
Klaus
vdr mailing list vdr@linuxtv.org http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
On 18.03.2016 22:14, glenvt18 wrote:
UPDATE.
Here are 60 Mbit/s (entire transponder) test results:
- the same behaviour as for 30- Mbit/s, no huge chunks on the second
read. Threshold value of 70000 was chosen for demonstration only. It's just a bit smaller than the mean which is about 75000. Notice that the max value is 1.6x higher in the case of vanilla VDR. So, the threshold value of about 100000 is OK.
Well, I guess then we can just as well make this unconditional after all. Because any concrete threshold is just guesswork and may only apply to a particular environment.
I have added your original patch to my VDR now and will observe how it behaves with my devices. If all goes well, I'll add it to the official source.
Please send me your full name for the CONTRIBUTORS file (somebody has to take the blame if this ever causes trouble ;-).
Klaus
2016-03-14 19:05 GMT+03:00 glenvt18 glenvt18@gmail.com:
Hi Klaus,
I've tested your patch on x86_64 and ARM (I picked the weakest HW I have). I've gathered some statistics:
Notes. A "good" tuner there means the one that returns bigger chunks on average and generally consumes less CPU cycles with a vanilla (not patched) VDR. CPU usage means the CPU usage of the TS buffer thread reported by top -d2. Bit rates are average and coarse. I didn't include unconditional delay dumps. They are "smooth" like those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically 1.5-2 times less.
Some observations:
- Small reads are really expensive and must be avoided.
- "Good" tuners are not that "good" at higher bit rates.
- An "non-delayed" read() (the last read() in the sequence
poll()->read()->sleep()->poll()->read()->poll()->read()) is very likely to return a small chunk.
With bit rates higher that 30 Mbit/s I got device buffer overflows on ARM - ring buffer processing couldn't catch up.
So, generally, it solves the issue, but is not as efficient as with an unconditional delay. Compared to the unconditional patch:
- CPU usage is higher and not as steady.
- The read() sizes are not smooth too.
Those CPU usage percents (without the patch) may look small, but it's 20-35% of the whole VDR CPU usage and are several times less with the patch, conditional or not.
I understand, an unconditional delay here looks a bit scaring. How about increasing the threshold value to, say, 100000, or 500 * TS_SIZE or even higher? In other words, treat delayed read() as a normal operation, and not delayed one as an emergency case. I can test it on x86_64 with 50-60 Mbit/s. What do you think of it?
Best, glenvt18.
2016-03-10 12:41 GMT+03:00 Klaus Schmidinger Klaus.Schmidinger@tvdr.de:
On 10.03.2016 02:54, glenvt18 wrote:
Hi folks,
I've found that with some DVB tuner drivers poll() returns when there are only small (1-3) number of TS packets available to read(). This results in high CPU usage of the TS buffer thread which is busy reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners I tested were affected with this issue. Even with a "good" tuner TS buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7. With the proposed patch it is below 1-2% on all ARM and x86_64 platforms I've tested. The delay value of 10 ms can be considered safe:
media/dvb-core/dmxdev.h:109 #define DVR_BUFFER_SIZE (10*188*1024)
It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 = 1540 Mbit/s to overflow the device buffer within 10 ms interval. A smaller delay is not enough for ARM. cDvbDevice has a ring buffer of 5MB which is larger.
This patch was made against VDR 2.3.1, but it can be applied to VDR 2.2.0 as well.
Please review. glenvt18
Index: b/device.c
--- a/device.c 2015-09-18 01:04:12.000000000 +0300 +++ b/device.c 2016-03-10 03:38:50.078400715 +0300 @@ -1768,6 +1768,8 @@ break; } }
else
cCondWait::SleepMs(10); } } }
I'm not too fond of the idea of introducing an additional, unconditional wait here. The actual problem should be fixed within the drivers. However, maybe waiting in case there is only a small number of TS packets available is acceptable:
--- device.c 2015/09/05 11:42:17 4.2 +++ device.c 2016/03/10 09:34:11 @@ -1768,6 +1768,8 @@ break; } }
else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
cCondWait::SleepMs(10); } } }
The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick to avoid using a concrete literal number here. Can you please test if this still solves your problem?
Klaus
Yes, I too don't see how it can be harmful. In case you want to collect statistics, I did it with this patch (for a single device):
Best, Sergey Chernyavskiy.
2016-03-20 16:07 GMT+03:00 Klaus Schmidinger Klaus.Schmidinger@tvdr.de:
On 18.03.2016 22:14, glenvt18 wrote:
UPDATE.
Here are 60 Mbit/s (entire transponder) test results:
- the same behaviour as for 30- Mbit/s, no huge chunks on the second
read. Threshold value of 70000 was chosen for demonstration only. It's just a bit smaller than the mean which is about 75000. Notice that the max value is 1.6x higher in the case of vanilla VDR. So, the threshold value of about 100000 is OK.
Well, I guess then we can just as well make this unconditional after all. Because any concrete threshold is just guesswork and may only apply to a particular environment.
I have added your original patch to my VDR now and will observe how it behaves with my devices. If all goes well, I'll add it to the official source.
Please send me your full name for the CONTRIBUTORS file (somebody has to take the blame if this ever causes trouble ;-).
Klaus
2016-03-14 19:05 GMT+03:00 glenvt18 glenvt18@gmail.com:
Hi Klaus,
I've tested your patch on x86_64 and ARM (I picked the weakest HW I have). I've gathered some statistics:
Notes. A "good" tuner there means the one that returns bigger chunks on average and generally consumes less CPU cycles with a vanilla (not patched) VDR. CPU usage means the CPU usage of the TS buffer thread reported by top -d2. Bit rates are average and coarse. I didn't include unconditional delay dumps. They are "smooth" like those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically 1.5-2 times less.
Some observations:
- Small reads are really expensive and must be avoided.
- "Good" tuners are not that "good" at higher bit rates.
- An "non-delayed" read() (the last read() in the sequence
poll()->read()->sleep()->poll()->read()->poll()->read()) is very likely to return a small chunk.
With bit rates higher that 30 Mbit/s I got device buffer overflows on ARM - ring buffer processing couldn't catch up.
So, generally, it solves the issue, but is not as efficient as with an unconditional delay. Compared to the unconditional patch:
- CPU usage is higher and not as steady.
- The read() sizes are not smooth too.
Those CPU usage percents (without the patch) may look small, but it's 20-35% of the whole VDR CPU usage and are several times less with the patch, conditional or not.
I understand, an unconditional delay here looks a bit scaring. How about increasing the threshold value to, say, 100000, or 500 * TS_SIZE or even higher? In other words, treat delayed read() as a normal operation, and not delayed one as an emergency case. I can test it on x86_64 with 50-60 Mbit/s. What do you think of it?
Best, glenvt18.
2016-03-10 12:41 GMT+03:00 Klaus Schmidinger Klaus.Schmidinger@tvdr.de:
On 10.03.2016 02:54, glenvt18 wrote:
Hi folks,
I've found that with some DVB tuner drivers poll() returns when there are only small (1-3) number of TS packets available to read(). This results in high CPU usage of the TS buffer thread which is busy reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners I tested were affected with this issue. Even with a "good" tuner TS buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7. With the proposed patch it is below 1-2% on all ARM and x86_64 platforms I've tested. The delay value of 10 ms can be considered safe:
media/dvb-core/dmxdev.h:109 #define DVR_BUFFER_SIZE (10*188*1024)
It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 = 1540 Mbit/s to overflow the device buffer within 10 ms interval. A smaller delay is not enough for ARM. cDvbDevice has a ring buffer of 5MB which is larger.
This patch was made against VDR 2.3.1, but it can be applied to VDR 2.2.0 as well.
Please review. glenvt18
Index: b/device.c
--- a/device.c 2015-09-18 01:04:12.000000000 +0300 +++ b/device.c 2016-03-10 03:38:50.078400715 +0300 @@ -1768,6 +1768,8 @@ break; } }
else
cCondWait::SleepMs(10); } } }
I'm not too fond of the idea of introducing an additional, unconditional wait here. The actual problem should be fixed within the drivers. However, maybe waiting in case there is only a small number of TS packets available is acceptable:
--- device.c 2015/09/05 11:42:17 4.2 +++ device.c 2016/03/10 09:34:11 @@ -1768,6 +1768,8 @@ break; } }
else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
cCondWait::SleepMs(10); } } }
The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick to avoid using a concrete literal number here. Can you please test if this still solves your problem?
Klaus
vdr mailing list vdr@linuxtv.org http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr