Fix TS buffer thread high CPU usage

Message ID 56E14161.3050307@tvdr.de
State New
Headers

Commit Message

Klaus Schmidinger March 10, 2016, 9:41 a.m. UTC
  On 10.03.2016 02:54, glenvt18 wrote:
> Hi folks,
>
> I've found that with some DVB tuner drivers poll() returns when there
> are only small (1-3) number of TS packets available to read(). This
> results in high CPU usage of the TS buffer thread which is busy
> reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners
> I tested were affected with this issue. Even with a "good" tuner TS
> buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7.
> With the proposed patch it is below 1-2% on all ARM and x86_64
> platforms I've tested. The delay value of 10 ms can be considered
> safe:
>
> media/dvb-core/dmxdev.h:109
> #define DVR_BUFFER_SIZE (10*188*1024)
>
> It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 =
> 1540 Mbit/s to overflow the device buffer within 10 ms interval. A
> smaller delay is not enough for ARM. cDvbDevice has a ring buffer of
> 5MB which is larger.
>
> This patch was made against VDR 2.3.1, but it can be applied to VDR
> 2.2.0 as well.
>
> Please review.
> glenvt18
> ---
> Index: b/device.c
> ===================================================================
> --- a/device.c    2015-09-18 01:04:12.000000000 +0300
> +++ b/device.c    2016-03-10 03:38:50.078400715 +0300
> @@ -1768,6 +1768,8 @@
>                       break;
>                       }
>                    }
> +              else
> +                 cCondWait::SleepMs(10);
>                 }
>              }
>        }

I'm not too fond of the idea of introducing an additional, unconditional
wait here. The actual problem should be fixed within the drivers.
However, maybe waiting in case there is only a small number of TS packets
available is acceptable:


The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick
to avoid using a concrete literal number here.
Can you please test if this still solves your problem?

Klaus
  

Comments

glenvt18 March 14, 2016, 4:05 p.m. UTC | #1
Hi Klaus,

I've tested your patch on x86_64 and ARM (I picked the weakest HW I
have). I've gathered some statistics:

http://pastebin.com/kXEP9Bp7

Notes.
A "good" tuner there means the one that returns bigger chunks on
average and generally consumes less CPU cycles with a vanilla (not
patched) VDR.
CPU usage means the CPU usage of the TS buffer thread reported by top -d2.
Bit rates are average and coarse.
I didn't include unconditional delay dumps. They are "smooth" like
those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically
1.5-2 times less.

Some observations:
1. Small reads are really expensive and must be avoided.
2. "Good" tuners are not that "good" at higher bit rates.
3. An "non-delayed" read() (the last read() in the sequence
poll()->read()->sleep()->poll()->read()->poll()->read()) is very
likely to return a small chunk.

With bit rates higher that 30 Mbit/s I got device buffer overflows on
ARM - ring buffer processing couldn't catch up.

So, generally, it solves the issue, but is not as efficient as with an
unconditional delay. Compared to the unconditional patch:
1. CPU usage is higher and not as steady.
2. The read() sizes are not smooth too.

Those CPU usage percents (without the patch) may look small, but it's
20-35% of the whole VDR CPU usage and are several times less with the
patch, conditional or not.

I understand, an unconditional delay here looks a bit scaring. How
about increasing the threshold value to, say, 100000, or 500 * TS_SIZE
or even higher? In other words, treat delayed read() as a normal
operation, and not delayed one as an emergency case. I can test it on
x86_64 with 50-60 Mbit/s. What do you think of it?

Best,
glenvt18.

2016-03-10 12:41 GMT+03:00 Klaus Schmidinger <Klaus.Schmidinger@tvdr.de>:
> On 10.03.2016 02:54, glenvt18 wrote:
>>
>> Hi folks,
>>
>> I've found that with some DVB tuner drivers poll() returns when there
>> are only small (1-3) number of TS packets available to read(). This
>> results in high CPU usage of the TS buffer thread which is busy
>> reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners
>> I tested were affected with this issue. Even with a "good" tuner TS
>> buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7.
>> With the proposed patch it is below 1-2% on all ARM and x86_64
>> platforms I've tested. The delay value of 10 ms can be considered
>> safe:
>>
>> media/dvb-core/dmxdev.h:109
>> #define DVR_BUFFER_SIZE (10*188*1024)
>>
>> It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 =
>> 1540 Mbit/s to overflow the device buffer within 10 ms interval. A
>> smaller delay is not enough for ARM. cDvbDevice has a ring buffer of
>> 5MB which is larger.
>>
>> This patch was made against VDR 2.3.1, but it can be applied to VDR
>> 2.2.0 as well.
>>
>> Please review.
>> glenvt18
>> ---
>> Index: b/device.c
>> ===================================================================
>> --- a/device.c    2015-09-18 01:04:12.000000000 +0300
>> +++ b/device.c    2016-03-10 03:38:50.078400715 +0300
>> @@ -1768,6 +1768,8 @@
>>                       break;
>>                       }
>>                    }
>> +              else
>> +                 cCondWait::SleepMs(10);
>>                 }
>>              }
>>        }
>
>
> I'm not too fond of the idea of introducing an additional, unconditional
> wait here. The actual problem should be fixed within the drivers.
> However, maybe waiting in case there is only a small number of TS packets
> available is acceptable:
>
> --- device.c    2015/09/05 11:42:17     4.2
> +++ device.c    2016/03/10 09:34:11
> @@ -1768,6 +1768,8 @@
>                      break;
>                      }
>                   }
> +              else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
> +                 cCondWait::SleepMs(10);
>                }
>             }
>       }
>
> The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick
> to avoid using a concrete literal number here.
> Can you please test if this still solves your problem?
>
> Klaus
>
> _______________________________________________
> vdr mailing list
> vdr@linuxtv.org
> http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
  
glenvt18 March 18, 2016, 9:14 p.m. UTC | #2
UPDATE.

Here are 60 Mbit/s (entire transponder) test results:

http://pastebin.com/nHDN8nsW

- the same behaviour as for 30- Mbit/s, no huge chunks on the second
read. Threshold value of 70000 was chosen for demonstration only. It's
just a bit smaller than the mean which is about 75000. Notice that the
max value is 1.6x higher in the case of vanilla VDR. So, the threshold
value of about 100000 is OK.

Best,
glenvt18.


2016-03-14 19:05 GMT+03:00 glenvt18 <glenvt18@gmail.com>:
> Hi Klaus,
>
> I've tested your patch on x86_64 and ARM (I picked the weakest HW I
> have). I've gathered some statistics:
>
> http://pastebin.com/kXEP9Bp7
>
> Notes.
> A "good" tuner there means the one that returns bigger chunks on
> average and generally consumes less CPU cycles with a vanilla (not
> patched) VDR.
> CPU usage means the CPU usage of the TS buffer thread reported by top -d2.
> Bit rates are average and coarse.
> I didn't include unconditional delay dumps. They are "smooth" like
> those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically
> 1.5-2 times less.
>
> Some observations:
> 1. Small reads are really expensive and must be avoided.
> 2. "Good" tuners are not that "good" at higher bit rates.
> 3. An "non-delayed" read() (the last read() in the sequence
> poll()->read()->sleep()->poll()->read()->poll()->read()) is very
> likely to return a small chunk.
>
> With bit rates higher that 30 Mbit/s I got device buffer overflows on
> ARM - ring buffer processing couldn't catch up.
>
> So, generally, it solves the issue, but is not as efficient as with an
> unconditional delay. Compared to the unconditional patch:
> 1. CPU usage is higher and not as steady.
> 2. The read() sizes are not smooth too.
>
> Those CPU usage percents (without the patch) may look small, but it's
> 20-35% of the whole VDR CPU usage and are several times less with the
> patch, conditional or not.
>
> I understand, an unconditional delay here looks a bit scaring. How
> about increasing the threshold value to, say, 100000, or 500 * TS_SIZE
> or even higher? In other words, treat delayed read() as a normal
> operation, and not delayed one as an emergency case. I can test it on
> x86_64 with 50-60 Mbit/s. What do you think of it?
>
> Best,
> glenvt18.
>
> 2016-03-10 12:41 GMT+03:00 Klaus Schmidinger <Klaus.Schmidinger@tvdr.de>:
>> On 10.03.2016 02:54, glenvt18 wrote:
>>>
>>> Hi folks,
>>>
>>> I've found that with some DVB tuner drivers poll() returns when there
>>> are only small (1-3) number of TS packets available to read(). This
>>> results in high CPU usage of the TS buffer thread which is busy
>>> reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners
>>> I tested were affected with this issue. Even with a "good" tuner TS
>>> buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7.
>>> With the proposed patch it is below 1-2% on all ARM and x86_64
>>> platforms I've tested. The delay value of 10 ms can be considered
>>> safe:
>>>
>>> media/dvb-core/dmxdev.h:109
>>> #define DVR_BUFFER_SIZE (10*188*1024)
>>>
>>> It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 =
>>> 1540 Mbit/s to overflow the device buffer within 10 ms interval. A
>>> smaller delay is not enough for ARM. cDvbDevice has a ring buffer of
>>> 5MB which is larger.
>>>
>>> This patch was made against VDR 2.3.1, but it can be applied to VDR
>>> 2.2.0 as well.
>>>
>>> Please review.
>>> glenvt18
>>> ---
>>> Index: b/device.c
>>> ===================================================================
>>> --- a/device.c    2015-09-18 01:04:12.000000000 +0300
>>> +++ b/device.c    2016-03-10 03:38:50.078400715 +0300
>>> @@ -1768,6 +1768,8 @@
>>>                       break;
>>>                       }
>>>                    }
>>> +              else
>>> +                 cCondWait::SleepMs(10);
>>>                 }
>>>              }
>>>        }
>>
>>
>> I'm not too fond of the idea of introducing an additional, unconditional
>> wait here. The actual problem should be fixed within the drivers.
>> However, maybe waiting in case there is only a small number of TS packets
>> available is acceptable:
>>
>> --- device.c    2015/09/05 11:42:17     4.2
>> +++ device.c    2016/03/10 09:34:11
>> @@ -1768,6 +1768,8 @@
>>                      break;
>>                      }
>>                   }
>> +              else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
>> +                 cCondWait::SleepMs(10);
>>                }
>>             }
>>       }
>>
>> The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick
>> to avoid using a concrete literal number here.
>> Can you please test if this still solves your problem?
>>
>> Klaus
>>
>> _______________________________________________
>> vdr mailing list
>> vdr@linuxtv.org
>> http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
  
Klaus Schmidinger March 20, 2016, 1:07 p.m. UTC | #3
On 18.03.2016 22:14, glenvt18 wrote:
> UPDATE.
>
> Here are 60 Mbit/s (entire transponder) test results:
>
> http://pastebin.com/nHDN8nsW
>
> - the same behaviour as for 30- Mbit/s, no huge chunks on the second
> read. Threshold value of 70000 was chosen for demonstration only. It's
> just a bit smaller than the mean which is about 75000. Notice that the
> max value is 1.6x higher in the case of vanilla VDR. So, the threshold
> value of about 100000 is OK.

Well, I guess then we can just as well make this unconditional after all.
Because any concrete threshold is just guesswork and may only apply to
a particular environment.

I have added your original patch to my VDR now and will observe how it
behaves with my devices. If all goes well, I'll add it to the official
source.

Please send me your full name for the CONTRIBUTORS file (somebody has to
take the blame if this ever causes trouble ;-).

Klaus


> 2016-03-14 19:05 GMT+03:00 glenvt18 <glenvt18@gmail.com>:
>> Hi Klaus,
>>
>> I've tested your patch on x86_64 and ARM (I picked the weakest HW I
>> have). I've gathered some statistics:
>>
>> http://pastebin.com/kXEP9Bp7
>>
>> Notes.
>> A "good" tuner there means the one that returns bigger chunks on
>> average and generally consumes less CPU cycles with a vanilla (not
>> patched) VDR.
>> CPU usage means the CPU usage of the TS buffer thread reported by top -d2.
>> Bit rates are average and coarse.
>> I didn't include unconditional delay dumps. They are "smooth" like
>> those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically
>> 1.5-2 times less.
>>
>> Some observations:
>> 1. Small reads are really expensive and must be avoided.
>> 2. "Good" tuners are not that "good" at higher bit rates.
>> 3. An "non-delayed" read() (the last read() in the sequence
>> poll()->read()->sleep()->poll()->read()->poll()->read()) is very
>> likely to return a small chunk.
>>
>> With bit rates higher that 30 Mbit/s I got device buffer overflows on
>> ARM - ring buffer processing couldn't catch up.
>>
>> So, generally, it solves the issue, but is not as efficient as with an
>> unconditional delay. Compared to the unconditional patch:
>> 1. CPU usage is higher and not as steady.
>> 2. The read() sizes are not smooth too.
>>
>> Those CPU usage percents (without the patch) may look small, but it's
>> 20-35% of the whole VDR CPU usage and are several times less with the
>> patch, conditional or not.
>>
>> I understand, an unconditional delay here looks a bit scaring. How
>> about increasing the threshold value to, say, 100000, or 500 * TS_SIZE
>> or even higher? In other words, treat delayed read() as a normal
>> operation, and not delayed one as an emergency case. I can test it on
>> x86_64 with 50-60 Mbit/s. What do you think of it?
>>
>> Best,
>> glenvt18.
>>
>> 2016-03-10 12:41 GMT+03:00 Klaus Schmidinger <Klaus.Schmidinger@tvdr.de>:
>>> On 10.03.2016 02:54, glenvt18 wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> I've found that with some DVB tuner drivers poll() returns when there
>>>> are only small (1-3) number of TS packets available to read(). This
>>>> results in high CPU usage of the TS buffer thread which is busy
>>>> reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners
>>>> I tested were affected with this issue. Even with a "good" tuner TS
>>>> buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7.
>>>> With the proposed patch it is below 1-2% on all ARM and x86_64
>>>> platforms I've tested. The delay value of 10 ms can be considered
>>>> safe:
>>>>
>>>> media/dvb-core/dmxdev.h:109
>>>> #define DVR_BUFFER_SIZE (10*188*1024)
>>>>
>>>> It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 =
>>>> 1540 Mbit/s to overflow the device buffer within 10 ms interval. A
>>>> smaller delay is not enough for ARM. cDvbDevice has a ring buffer of
>>>> 5MB which is larger.
>>>>
>>>> This patch was made against VDR 2.3.1, but it can be applied to VDR
>>>> 2.2.0 as well.
>>>>
>>>> Please review.
>>>> glenvt18
>>>> ---
>>>> Index: b/device.c
>>>> ===================================================================
>>>> --- a/device.c    2015-09-18 01:04:12.000000000 +0300
>>>> +++ b/device.c    2016-03-10 03:38:50.078400715 +0300
>>>> @@ -1768,6 +1768,8 @@
>>>>                        break;
>>>>                        }
>>>>                     }
>>>> +              else
>>>> +                 cCondWait::SleepMs(10);
>>>>                  }
>>>>               }
>>>>         }
>>>
>>>
>>> I'm not too fond of the idea of introducing an additional, unconditional
>>> wait here. The actual problem should be fixed within the drivers.
>>> However, maybe waiting in case there is only a small number of TS packets
>>> available is acceptable:
>>>
>>> --- device.c    2015/09/05 11:42:17     4.2
>>> +++ device.c    2016/03/10 09:34:11
>>> @@ -1768,6 +1768,8 @@
>>>                       break;
>>>                       }
>>>                    }
>>> +              else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
>>> +                 cCondWait::SleepMs(10);
>>>                 }
>>>              }
>>>        }
>>>
>>> The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random pick
>>> to avoid using a concrete literal number here.
>>> Can you please test if this still solves your problem?
>>>
>>> Klaus
  
glenvt18 March 20, 2016, 4:31 p.m. UTC | #4
Yes, I too don't see how it can be harmful. In case you want to
collect statistics, I did it with this patch (for a single device):

http://pastebin.com/6NAVNMda

Best,
Sergey Chernyavskiy.


2016-03-20 16:07 GMT+03:00 Klaus Schmidinger <Klaus.Schmidinger@tvdr.de>:
> On 18.03.2016 22:14, glenvt18 wrote:
>>
>> UPDATE.
>>
>> Here are 60 Mbit/s (entire transponder) test results:
>>
>> http://pastebin.com/nHDN8nsW
>>
>> - the same behaviour as for 30- Mbit/s, no huge chunks on the second
>> read. Threshold value of 70000 was chosen for demonstration only. It's
>> just a bit smaller than the mean which is about 75000. Notice that the
>> max value is 1.6x higher in the case of vanilla VDR. So, the threshold
>> value of about 100000 is OK.
>
>
> Well, I guess then we can just as well make this unconditional after all.
> Because any concrete threshold is just guesswork and may only apply to
> a particular environment.
>
> I have added your original patch to my VDR now and will observe how it
> behaves with my devices. If all goes well, I'll add it to the official
> source.
>
> Please send me your full name for the CONTRIBUTORS file (somebody has to
> take the blame if this ever causes trouble ;-).
>
> Klaus
>
>
>
>> 2016-03-14 19:05 GMT+03:00 glenvt18 <glenvt18@gmail.com>:
>>>
>>> Hi Klaus,
>>>
>>> I've tested your patch on x86_64 and ARM (I picked the weakest HW I
>>> have). I've gathered some statistics:
>>>
>>> http://pastebin.com/kXEP9Bp7
>>>
>>> Notes.
>>> A "good" tuner there means the one that returns bigger chunks on
>>> average and generally consumes less CPU cycles with a vanilla (not
>>> patched) VDR.
>>> CPU usage means the CPU usage of the TS buffer thread reported by top
>>> -d2.
>>> Bit rates are average and coarse.
>>> I didn't include unconditional delay dumps. They are "smooth" like
>>> those 8 and 15 Mbit/s ARM figures. CPU usage on x86_64 is typically
>>> 1.5-2 times less.
>>>
>>> Some observations:
>>> 1. Small reads are really expensive and must be avoided.
>>> 2. "Good" tuners are not that "good" at higher bit rates.
>>> 3. An "non-delayed" read() (the last read() in the sequence
>>> poll()->read()->sleep()->poll()->read()->poll()->read()) is very
>>> likely to return a small chunk.
>>>
>>> With bit rates higher that 30 Mbit/s I got device buffer overflows on
>>> ARM - ring buffer processing couldn't catch up.
>>>
>>> So, generally, it solves the issue, but is not as efficient as with an
>>> unconditional delay. Compared to the unconditional patch:
>>> 1. CPU usage is higher and not as steady.
>>> 2. The read() sizes are not smooth too.
>>>
>>> Those CPU usage percents (without the patch) may look small, but it's
>>> 20-35% of the whole VDR CPU usage and are several times less with the
>>> patch, conditional or not.
>>>
>>> I understand, an unconditional delay here looks a bit scaring. How
>>> about increasing the threshold value to, say, 100000, or 500 * TS_SIZE
>>> or even higher? In other words, treat delayed read() as a normal
>>> operation, and not delayed one as an emergency case. I can test it on
>>> x86_64 with 50-60 Mbit/s. What do you think of it?
>>>
>>> Best,
>>> glenvt18.
>>>
>>> 2016-03-10 12:41 GMT+03:00 Klaus Schmidinger <Klaus.Schmidinger@tvdr.de>:
>>>>
>>>> On 10.03.2016 02:54, glenvt18 wrote:
>>>>>
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> I've found that with some DVB tuner drivers poll() returns when there
>>>>> are only small (1-3) number of TS packets available to read(). This
>>>>> results in high CPU usage of the TS buffer thread which is busy
>>>>> reading small chunks of data in cTSBuffer::Action(). 2 out of 3 tuners
>>>>> I tested were affected with this issue. Even with a "good" tuner TS
>>>>> buffer thread CPU usage is up to 10% per HD stream on ARM Cortex-A7.
>>>>> With the proposed patch it is below 1-2% on all ARM and x86_64
>>>>> platforms I've tested. The delay value of 10 ms can be considered
>>>>> safe:
>>>>>
>>>>> media/dvb-core/dmxdev.h:109
>>>>> #define DVR_BUFFER_SIZE (10*188*1024)
>>>>>
>>>>> It will take a tuner to receive (10*188*1024)*8*(1000/10) / 1000000 =
>>>>> 1540 Mbit/s to overflow the device buffer within 10 ms interval. A
>>>>> smaller delay is not enough for ARM. cDvbDevice has a ring buffer of
>>>>> 5MB which is larger.
>>>>>
>>>>> This patch was made against VDR 2.3.1, but it can be applied to VDR
>>>>> 2.2.0 as well.
>>>>>
>>>>> Please review.
>>>>> glenvt18
>>>>> ---
>>>>> Index: b/device.c
>>>>> ===================================================================
>>>>> --- a/device.c    2015-09-18 01:04:12.000000000 +0300
>>>>> +++ b/device.c    2016-03-10 03:38:50.078400715 +0300
>>>>> @@ -1768,6 +1768,8 @@
>>>>>                        break;
>>>>>                        }
>>>>>                     }
>>>>> +              else
>>>>> +                 cCondWait::SleepMs(10);
>>>>>                  }
>>>>>               }
>>>>>         }
>>>>
>>>>
>>>>
>>>> I'm not too fond of the idea of introducing an additional, unconditional
>>>> wait here. The actual problem should be fixed within the drivers.
>>>> However, maybe waiting in case there is only a small number of TS
>>>> packets
>>>> available is acceptable:
>>>>
>>>> --- device.c    2015/09/05 11:42:17     4.2
>>>> +++ device.c    2016/03/10 09:34:11
>>>> @@ -1768,6 +1768,8 @@
>>>>                       break;
>>>>                       }
>>>>                    }
>>>> +              else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
>>>> +                 cCondWait::SleepMs(10);
>>>>                 }
>>>>              }
>>>>        }
>>>>
>>>> The number MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE is just a random
>>>> pick
>>>> to avoid using a concrete literal number here.
>>>> Can you please test if this still solves your problem?
>>>>
>>>> Klaus
>
>
> _______________________________________________
> vdr mailing list
> vdr@linuxtv.org
> http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
  

Patch

--- device.c    2015/09/05 11:42:17     4.2
+++ device.c    2016/03/10 09:34:11
@@ -1768,6 +1768,8 @@ 
                      break;
                      }
                   }
+              else if (r < MIN_TS_PACKETS_FOR_FRAME_DETECTOR * TS_SIZE)
+                 cCondWait::SleepMs(10);
                }
             }
       }