VDR-1.3.41: speedup for cVideoRepacker
Commit Message
Hi,
as you may have noticed: ZDF increased it's bitrate to about 7.5 Mbit/s.
As a result, less powerful VDR systems "suffered" when taking a
recording or when running in transfer mode.
The attached patch speeds up cVideoRepacker by choosing a known art
search algorithm for finding start codes in the video stream.
With this patch and a similar one for xine-lib I was finally able to
watch and record ZDF at the same time on my 600 MHz "budget only" EPIA
VDR system.
Bye.
Comments
Reinhard Nissl wrote:
> +bool cVideoRepacker::ScanDataForStartCodeFast(const uchar *&Data, const uchar *Limit)
> +{
> + Limit--;
> +
> + while (Data < Limit) {
> + if (*Data > 0x01)
> + Data += 3;
> + else if (*Data == 0x00)
> + Data++;
> + else if (Data[-2] != 0x00 || Data[-1] != 0x00)
> + Data += 3;
> + else {
> + scanner = 0x00000100 | *++Data;
> + return true;
> + }
> + }
Did you consider using memchr()? e.g. something like
...
while (Data < Limit) {
Data = memchr(Data, 0x01, Limit - Data);
if (Data == NULL)
break;
if (Data[-2] != 0x00 || Data[-1] != 0x00)
Data += 3;
...
It makes no noticeable difference on my AMD64 machine (<1%), but maybe
it is worth trying on your EPIA?
Jon
Hi,
Jon Burgess wrote:
>> +bool cVideoRepacker::ScanDataForStartCodeFast(const uchar *&Data,
>> const uchar *Limit)
>> +{
>> + Limit--;
>> +
>> + while (Data < Limit) {
>> + if (*Data > 0x01)
>> + Data += 3;
>> + else if (*Data == 0x00)
>> + Data++;
>> + else if (Data[-2] != 0x00 || Data[-1] != 0x00)
>> + Data += 3;
>> + else {
>> + scanner = 0x00000100 | *++Data;
>> + return true;
>> + }
>> + }
>
> Did you consider using memchr()? e.g. something like
> ...
> while (Data < Limit) {
> Data = memchr(Data, 0x01, Limit - Data);
> if (Data == NULL)
> break;
> if (Data[-2] != 0x00 || Data[-1] != 0x00)
> Data += 3;
> ...
>
> It makes no noticeable difference on my AMD64 machine (<1%), but maybe
> it is worth trying on your EPIA?
I don't think that it is worth a try as it tests every byte while the
above code tests most of the time only every third byte.
Consider the following data:
hh ii jj kk [ 00 00 01 01 ] mm nn oo 01 01 pp 01
^^^^^^^^^^^^^^^^^^^ ++++++++++^^^ ^ ^^
6 4 0 1
which will give you the above distance between consecutive 01 which are
not part of a startcode (i. e. first slice start code [ 00 00 01 01 ])
or immediately following a startcode (the area marked with +++++). ++ or
^^ indicate bytes which have to be considered for the distance.
For the above case, a statistical analysis will give this numbers:
distance: 0, count: 1
distance: 1, count: 1
distance: 2, count: 0
distance: 3, count: 0
distance: 4, count: 1
distance: 5, count: 0
distance: 6, count: 1
distance: >, count: 0
Now consider real data, like 001.vdr of "Wetten, dass ..?" which was
broadcast last Saturday on ZDF:
distance: 0, count: 75248
distance: 1, count: 519949
distance: 2, count: 316632
distance: 3, count: 331874
distance: 4, count: 381855
distance: 5, count: 367280
distance: 6, count: 370620
distance: 7, count: 369649
distance: 8, count: 405861
distance: 9, count: 360555
distance: 10, count: 332593
distance: 11, count: 366192
distance: 12, count: 319692
distance: 13, count: 313795
distance: 14, count: 309567
distance: 15, count: 305986
distance: 16, count: 297510
distance: 17, count: 291607
distance: 18, count: 286045
distance: >, count: 18794252
As you can easily see, it's a waste of time to test every byte (=
distance 0) for 01 as it is very unlikely to find one.
Bye.
Reinhard Nissl wrote:
> I don't think that it is worth a try as it tests every byte while the
> above code tests most of the time only every third byte.
I agree that your algorithm is clever and does greatly cut down the
number of comparisons as compared to the old code.
The glibc memchr() implementation does the comparisons 4 bytes at a time
using a clever algorithm. It also has assembler optimised variants for
some CPU's. I don't think that only doing a comparison of every 3rd byte
wins you anything over memchr().
I believe the bulk of the time taken by the routine is transferring all
the data from memory into the CPU. Every byte of the data will have to
be read into the CPU caches due to cacheline effects. I believe that the
asm optimisations will take into account the possibilities of
speculative readahead etc. I've not looked into the assembler to see
whether it actually exploits this.
I've atached the quickly hacked up test program that I wrote. The output
is the time taken for many iterations of the 2 different algorithms.
For me the difference is within the measurement noise. It certainly
isn't any slower. I'd be interested to know whether it makes any
difference on your EPIA, both in the test program and in VDR.
$ ./search /video0/%Click_Online/2005-04-10.04\:28.99.99.rec/001.vdr
Found 10585344 matches in 12.5873 seconds
Found 10585344 matches in 12.6235 seconds
Jon
On Wednesday 01 February 2006 00:33, Jon Burgess wrote:
> Reinhard Nissl wrote:
> > I don't think that it is worth a try as it tests every byte while the
> > above code tests most of the time only every third byte.
>
> I agree that your algorithm is clever and does greatly cut down the
> number of comparisons as compared to the old code.
>
> The glibc memchr() implementation does the comparisons 4 bytes at a time
> using a clever algorithm. It also has assembler optimised variants for
> some CPU's. I don't think that only doing a comparison of every 3rd byte
> wins you anything over memchr().
>
> I believe the bulk of the time taken by the routine is transferring all
> the data from memory into the CPU. Every byte of the data will have to
> be read into the CPU caches due to cacheline effects. I believe that the
> asm optimisations will take into account the possibilities of
> speculative readahead etc. I've not looked into the assembler to see
> whether it actually exploits this.
>
> I've atached the quickly hacked up test program that I wrote. The output
> is the time taken for many iterations of the 2 different algorithms.
> For me the difference is within the measurement noise. It certainly
> isn't any slower. I'd be interested to know whether it makes any
> difference on your EPIA, both in the test program and in VDR.
>
> $ ./search /video0/%Click_Online/2005-04-10.04\:28.99.99.rec/001.vdr
> Found 10585344 matches in 12.5873 seconds
> Found 10585344 matches in 12.6235 seconds
Just done a quick test on my Epia MII-12000 system (whilst under heavy load
running softdevice, etc. so the numbers are probably crap!):
laz@vdr-tng 2006-02-01.18.28.50.50.del $ ~/stmp 001.vdr
Found 12721024 matches in 276.0596 seconds
Found 12721024 matches in 201.9245 seconds
I'll try to test it again later when it has finished recording so it is more
idle! With the above numbers, it looks like the memchr() is quite a bit
faster, at least under these conditions!
Cheers,
Laz
Cheers,
Laz
On Wednesday 01 February 2006 21:54, Laz wrote:
> Just done a quick test on my Epia MII-12000 system (whilst under heavy load
> running softdevice, etc. so the numbers are probably crap!):
>
> laz@vdr-tng 2006-02-01.18.28.50.50.del $ ~/stmp 001.vdr
> Found 12721024 matches in 276.0596 seconds
> Found 12721024 matches in 201.9245 seconds
>
> I'll try to test it again later when it has finished recording so it is
> more idle! With the above numbers, it looks like the memchr() is quite a
> bit faster, at least under these conditions!
And not under load:
laz@vdr-tng 2006-02-01.18.28.50.50.del $ ~/stmp 001.vdr
Found 12721024 matches in 47.0021 seconds
Found 12721024 matches in 32.8197 seconds
memchr() definitely looks better for my Epia system.
I'll try Reinhard's new patch next...
Cheers,
Laz
@@ -248,6 +313,14 @@ private:
scanPicture
};
int state;
+ void HandleStartCode(const uchar *const Data, cRingBufferLinear *const ResultBuffer, const uchar *&Payload, const uchar StreamID, const ePesHeader MpegLevel);
+ inline bool ScanDataForStartCodeSlow(const uchar *const Data);
+ inline bool ScanDataForStartCodeFast(const uchar *&Data, const uchar *Limit);
+ inline bool ScanDataForStartCode(const uchar *&Data, int &Done, int &Todo);
+ inline void AdjustCounters(const int Delta, int &Done, int &Todo);
+ inline bool ScanForEndOfPictureSlow(const uchar *&Data);
+ inline bool ScanForEndOfPictureFast(const uchar *&Data, const uchar *Limit);
+ inline bool ScanForEndOfPicture(const uchar *&Data, const uchar *Limit);
public:
cVideoRepacker(void);
virtual void Reset(void);
@@ -267,6 +340,166 @@ void cVideoRepacker::Reset(void)
state = syncing;
}
+void cVideoRepacker::HandleStartCode(const uchar *const Data, cRingBufferLinear *const ResultBuffer, const uchar *&Payload, const uchar StreamID, const ePesHeader MpegLevel)
+{
+ // synchronisation is detected some bytes after frame start.
+ const int SkippedBytesLimit = 4;
+
+ // which kind of start code have we got?
+ switch (*Data) {
+ case 0xB9 ... 0xFF: // system start codes
+ LOG("cVideoRepacker: found system start code: stream seems to be scrambled or not demultiplexed");
+ break;
+ case 0xB0 ... 0xB1: // reserved start codes
+ case 0xB6:
+ LOG("cVideoRepacker: found reserved start code: stream seems to be scrambled");
+ break;
+ case 0xB4: // sequence error code
+ LOG("cVideoRepacker: found sequence error code: stream seems to be damaged");
+ case 0xB2: // user data start code
+ case 0xB5: // extension start code
+ break;
+ case 0xB7: // sequence end code
+ case 0xB3: // sequence header code
+ case 0xB8: // group start code
+ case 0x00: // picture start code
+ if (state == scanPicture) {
+ // the above start codes indicate that the current picture is done. So
+ // push out the packet to start a new packet for the next picuture. If
+ // the byte count get's negative then the current buffer ends in a
+ // partitial start code that must be stripped off, as it shall be put
+ // in the next packet.
+ PushOutPacket(ResultBuffer, Payload, Data - 3 - Payload);
+ // go on with syncing to the next picture
+ state = syncing;
+ }
+ if (state == syncing) {
+ if (initiallySyncing) // omit report for the typical initial case
+ initiallySyncing = false;
+ else if (skippedBytes > SkippedBytesLimit) // report that syncing dropped some bytes
+ LOG("cVideoRepacker: skipped %d bytes to sync on next picture", skippedBytes - SkippedBytesLimit);
+ skippedBytes = 0;
+ // if there is a PES header available, then use it ...
+ if (pesHeaderBackupLen > 0) {
+ // ISO 13818-1 says:
+ // In the case of video, if a PTS is present in a PES packet header
+ // it shall refer to the access unit containing the first picture start
+ // code that commences in this PES packet. A picture start code commences
+ // in PES packet if the first byte of the picture start code is present
+ // in the PES packet.
+ memcpy(pesHeader, pesHeaderBackup, pesHeaderBackupLen);
+ pesHeaderLen = pesHeaderBackupLen;
+ pesHeaderBackupLen = 0;
+ }
+ else {
+ // ... otherwise create a continuation PES header
+ pesHeaderLen = 0;
+ pesHeader[pesHeaderLen++] = 0x00;
+ pesHeader[pesHeaderLen++] = 0x00;
+ pesHeader[pesHeaderLen++] = 0x01;
+ pesHeader[pesHeaderLen++] = StreamID; // video stream ID
+ pesHeader[pesHeaderLen++] = 0x00; // length still unknown
+ pesHeader[pesHeaderLen++] = 0x00; // length still unknown
+
+ if (MpegLevel == phMPEG2) {
+ pesHeader[pesHeaderLen++] = 0x80;
+ pesHeader[pesHeaderLen++] = 0x00;
+ pesHeader[pesHeaderLen++] = 0x00;
+ }
+ else
+ pesHeader[pesHeaderLen++] = 0x0F;
+ }
+ // append the first three bytes of the start code
+ pesHeader[pesHeaderLen++] = 0x00;
+ pesHeader[pesHeaderLen++] = 0x00;
+ pesHeader[pesHeaderLen++] = 0x01;
+ // the next packet's payload will begin with the fourth byte of
+ // the start code (= the actual code)
+ Payload = Data;
+ // as there is no length information available, assume the
+ // maximum we can hold in one PES packet
+ packetTodo = maxPacketSize - pesHeaderLen;
+ // go on with finding the picture data
+ state++;
+ }
+ break;
+ case 0x01 ... 0xAF: // slice start codes
+ if (state == findPicture) {
+ // go on with scanning the picture data
+ state++;
+ }
+ break;
+ }
+}
+
+bool cVideoRepacker::ScanDataForStartCodeSlow(const uchar *const Data)
+{
+ scanner <<= 8;
+ bool FoundStartCode = (scanner == 0x00000100);
+ scanner |= *Data;
+ return FoundStartCode;
+}
+
+bool cVideoRepacker::ScanDataForStartCodeFast(const uchar *&Data, const uchar *Limit)
+{
+ Limit--;
+
+ while (Data < Limit) {
+ if (*Data > 0x01)
+ Data += 3;
+ else if (*Data == 0x00)
+ Data++;
+ else if (Data[-2] != 0x00 || Data[-1] != 0x00)
+ Data += 3;
+ else {
+ scanner = 0x00000100 | *++Data;
+ return true;
+ }
+ }
+
+ Data = Limit;
+ unsigned long *Scanner = (unsigned long *)(Data - 3);
+ scanner = ntohl(*Scanner);
+ return false;
+}
+
+bool cVideoRepacker::ScanDataForStartCode(const uchar *&Data, int &Done, int &Todo)
+{
+ const uchar *const DataOrig = Data;
+ const int MinDataSize = 4;
+
+ if (Todo < MinDataSize || (state != syncing && packetTodo < MinDataSize))
+ return ScanDataForStartCodeSlow(Data);
+
+ int Limit = Todo;
+ if (state != syncing && Limit > packetTodo)
+ Limit = packetTodo;
+
+ if (ScanDataForStartCodeSlow(Data))
+ return true;
+
+ if (ScanDataForStartCodeSlow(++Data)) {
+ AdjustCounters(1, Done, Todo);
+ return true;
+ }
+ ++Data;
+
+ bool FoundStartCode = ScanDataForStartCodeFast(Data, DataOrig + Limit);
+ AdjustCounters(Data - DataOrig, Done, Todo);
+ return FoundStartCode;
+}
+
+void cVideoRepacker::AdjustCounters(const int Delta, int &Done, int &Todo)
+{
+ Done += Delta;
+ Todo -= Delta;
+
+ if (state <= syncing)
+ skippedBytes += Delta;
+ else
+ packetTodo -= Delta;
+}
+
void cVideoRepacker::Repack(cRingBufferLinear *ResultBuffer, const uchar *Data, int Count)
{
// synchronisation is detected some bytes after frame start.
@@ -300,98 +533,9 @@ void cVideoRepacker::Repack(cRingBufferL
if (state <= syncing)
skippedBytes++;
// did we reach a start code?
- scanner <<= 8;
- if (scanner != 0x00000100)
- scanner |= *data;
- else {
- scanner |= *data;
-
- // which kind of start code have we got?
- switch (*data) {
- case 0xB9 ... 0xFF: // system start codes
- LOG("cVideoRepacker: found system start code: stream seems to be scrambled or not demultiplexed");
- break;
- case 0xB0 ... 0xB1: // reserved start codes
- case 0xB6:
- LOG("cVideoRepacker: found reserved start code: stream seems to be scrambled");
- break;
- case 0xB4: // sequence error code
- LOG("cVideoRepacker: found sequence error code: stream seems to be damaged");
- case 0xB2: // user data start code
- case 0xB5: // extension start code
- break;
- case 0xB7: // sequence end code
- case 0xB3: // sequence header code
- case 0xB8: // group start code
- case 0x00: // picture start code
- if (state == scanPicture) {
- // the above start codes indicate that the current picture is done. So
- // push out the packet to start a new packet for the next picuture. If
- // the byte count get's negative then the current buffer ends in a
- // partitial start code that must be stripped off, as it shall be put
- // in the next packet.
- PushOutPacket(ResultBuffer, payload, data - 3 - payload);
- // go on with syncing to the next picture
- state = syncing;
- }
- if (state == syncing) {
- if (initiallySyncing) // omit report for the typical initial case
- initiallySyncing = false;
- else if (skippedBytes > SkippedBytesLimit) // report that syncing dropped some bytes
- LOG("cVideoRepacker: skipped %d bytes to sync on next picture", skippedBytes - SkippedBytesLimit);
- skippedBytes = 0;
- // if there is a PES header available, then use it ...
- if (pesHeaderBackupLen > 0) {
- // ISO 13818-1 says:
- // In the case of video, if a PTS is present in a PES packet header
- // it shall refer to the access unit containing the first picture start
- // code that commences in this PES packet. A picture start code commences
- // in PES packet if the first byte of the picture start code is present
- // in the PES packet.
- memcpy(pesHeader, pesHeaderBackup, pesHeaderBackupLen);
- pesHeaderLen = pesHeaderBackupLen;
- pesHeaderBackupLen = 0;
- }
- else {
- // ... otherwise create a continuation PES header
- pesHeaderLen = 0;
- pesHeader[pesHeaderLen++] = 0x00;
- pesHeader[pesHeaderLen++] = 0x00;
- pesHeader[pesHeaderLen++] = 0x01;
- pesHeader[pesHeaderLen++] = Data[3]; // video stream ID
- pesHeader[pesHeaderLen++] = 0x00; // length still unknown
- pesHeader[pesHeaderLen++] = 0x00; // length still unknown
-
- if (mpegLevel == phMPEG2) {
- pesHeader[pesHeaderLen++] = 0x80;
- pesHeader[pesHeaderLen++] = 0x00;
- pesHeader[pesHeaderLen++] = 0x00;
- }
- else
- pesHeader[pesHeaderLen++] = 0x0F;
- }
- // append the first three bytes of the start code
- pesHeader[pesHeaderLen++] = 0x00;
- pesHeader[pesHeaderLen++] = 0x00;
- pesHeader[pesHeaderLen++] = 0x01;
- // the next packet's payload will begin with the fourth byte of
- // the start code (= the actual code)
- payload = data;
- // as there is no length information available, assume the
- // maximum we can hold in one PES packet
- packetTodo = maxPacketSize - pesHeaderLen;
- // go on with finding the picture data
- state++;
- }
- break;
- case 0x01 ... 0xAF: // slice start codes
- if (state == findPicture) {
- // go on with scanning the picture data
- state++;
- }
- break;
- }
- }
+ if (ScanDataForStartCode(data, done, todo))
+ HandleStartCode(data, ResultBuffer, payload, Data[3], mpegLevel);
+ // move on
data++;
done++;
todo--;
@@ -501,6 +645,83 @@ void cVideoRepacker::Repack(cRingBufferL
}
}
+bool cVideoRepacker::ScanForEndOfPictureSlow(const uchar *&Data)
+{
+ localScanner <<= 8;
+ localScanner |= *Data++;
+ // check start codes which follow picture data
+ switch (localScanner) {
+ case 0x00000100: // picture start code
+ case 0x000001B8: // group start code
+ case 0x000001B3: // sequence header code
+ case 0x000001B7: // sequence end code
+ return true;
+ }
+ return false;
+}
+
+bool cVideoRepacker::ScanForEndOfPictureFast(const uchar *&Data, const uchar *Limit)
+{
+ Limit--;
+
+ while (Data < Limit) {
+ if (*Data > 0x01)
+ Data += 3;
+ else if (*Data == 0x00)
+ Data++;
+ else if (Data[-2] != 0x00 || Data[-1] != 0x00)
+ Data += 3;
+ else {
+ localScanner = 0x00000100 | *++Data;
+ // check start codes which follow picture data
+ switch (localScanner) {
+ case 0x00000100: // picture start code
+ case 0x000001B8: // group start code
+ case 0x000001B3: // sequence header code
+ case 0x000001B7: // sequence end code
+ Data++;
+ return true;
+ default:
+ Data += 3;
+ }
+ }
+ }
+
+ Data = Limit + 1;
+ unsigned long *LocalScanner = (unsigned long *)(Data - 4);
+ localScanner = ntohl(*LocalScanner);
+ return false;
+}
+
+bool cVideoRepacker::ScanForEndOfPicture(const uchar *&Data, const uchar *Limit)
+{
+ const uchar *const DataOrig = Data;
+ const int MinDataSize = 4;
+ bool FoundEndOfPicture;
+
+ if (Limit - Data <= MinDataSize) {
+ FoundEndOfPicture = false;
+ while (Data < Limit) {
+ if (ScanForEndOfPictureSlow(Data)) {
+ FoundEndOfPicture = true;
+ break;
+ }
+ }
+ }
+ else {
+ FoundEndOfPicture = true;
+ if (!ScanForEndOfPictureSlow(Data)) {
+ if (!ScanForEndOfPictureSlow(Data)) {
+ if (!ScanForEndOfPictureFast(Data, Limit))
+ FoundEndOfPicture = false;
+ }
+ }
+ }
+
+ localStart += (Data - DataOrig);
+ return FoundEndOfPicture;
+}
+
int cVideoRepacker::BreakAt(const uchar *Data, int Count)
{
if (initiallySyncing)
@@ -522,19 +743,8 @@ int cVideoRepacker::BreakAt(const uchar
const uchar *data = Data + PesPayloadOffset + localStart;
const uchar *limit = Data + Count;
// scan data
- while (data < limit) {
- localStart++;
- localScanner <<= 8;
- localScanner |= *data++;
- // check start codes which follow picture data
- switch (localScanner) {
- case 0x00000100: // picture start code
- case 0x000001B8: // group start code
- case 0x000001B3: // sequence header code
- case 0x000001B7: // sequence end code
- return data - Data;
- }
- }
+ if (ScanForEndOfPicture(data, limit))
+ return data - Data;
}
// just fill up packet and append next start code
return PesPayloadOffset + packetTodo + 4;