Fix EPG for UPC direct

Message ID 200612082105.20841.rollercoaster@reel-multimedia.com
State New
Headers

Commit Message

rollercoaster@reel-multimedia.com Dec. 8, 2006, 8:05 p.m. UTC
  UPC is a provider for middle european countries (Czechia, Hungary and Poland). 
They use iso6937-2 for encoding their EPG data so this looks quite strange in 
the vdr.
The applied patch does a "remapping" to iso8859-2 so that characters are 
displayed correct. (Currently only tested with Czech and Hungarian, but 
should also work for Polish)

While testing this with the help of an hungarian user, i also found out that 
the the codepage for Hungary must be 8859-2, not -1.

The patch is work by Helmut Auer.

cheers,
Tim
  

Comments

Klaus Schmidinger Dec. 10, 2006, 10:51 a.m. UTC | #1
Thiemo Gehrke wrote:
> UPC is a provider for middle european countries (Czechia, Hungary and Poland). 
> They use iso6937-2 for encoding their EPG data so this looks quite strange in 
> the vdr.
> The applied patch does a "remapping" to iso8859-2 so that characters are 
> displayed correct. (Currently only tested with Czech and Hungarian, but 
> should also work for Polish)
> 
> While testing this with the help of an hungarian user, i also found out that 
> the the codepage for Hungary must be 8859-2, not -1.
> 
> The patch is work by Helmut Auer.
> 
> cheers,
> Tim
> 
> 
> ------------------------------------------------------------------------
> 
> --- vdr-1.4.4-vanilla/epg.c	2006-10-28 11:12:42.000000000 +0200
> +++ vdr-1.4/epg.c	2006-11-28 12:39:33.000000000 +0100
> @@ -18,6 +18,165 @@
> 
>  #define RUNNINGSTATUSTIMEOUT 30 // seconds before the running status is considered unknown
> 
> +// UPC Direct / HBO strange two-character encoding. 0xC2 means acute, 0xCF caron.
> +// many thanks to the czechs who helped me while solving this.
> ...

How is their encoding coded in the first byte of the texts?
I can't seem to find an encoding for iso6937-2 in ETSI EN 300 46, section A.2.

Also, what happens if you run such a string through iconv() to convert it
from iso6937-2 to iso8859-2 or UTF-8?

I'm asking because this is how VDR will handle character sets in the next
version.

Klaus
  
Darren Salt Dec. 10, 2006, 5:50 p.m. UTC | #2
I demand that Thiemo Gehrke may or may not have written...

> UPC is a provider for middle european countries (Czechia, Hungary and
> Poland). They use iso6937-2 for encoding their EPG data so this looks quite
> strange in the vdr.

> The applied patch does a "remapping" to iso8859-2 so that characters are
> displayed correct. (Currently only tested with Czech and Hungarian, but
> should also work for Polish)

Hmm. I recall seeing similar encoding being used by the BBC, though I don't
see any examples of it ATM.

Obviously, the output encoding should be ISO8859-1, not -2...
  
Adrian C. Dec. 31, 2006, 3:40 p.m. UTC | #3
On Fri, 8 Dec 2006, Thiemo Gehrke wrote:

> UPC is a provider for middle european countries (Czechia, Hungary and Poland).
> They use iso6937-2 for encoding their EPG data so this looks quite strange in
> the vdr.

Hello the applied patch made no diffrence on my system,
you can see the snapshot here:
http://sysphere.org/~anrxc/upc.png
  
m.kapoun@kapik.net Jan. 2, 2007, 10:47 p.m. UTC | #4
Hi,
I converted iso6937 to  iso8859-1 by small patch.
It was simplest way, because I only deleted non-spacing characters
(diacritical marks).
It was fast solution form me (strange letters in EPG and bad recordings file
names ), but it is not
good because EPG doesn't  contents correct  Czech, .... , ... letters.


How it working in Czech rep:
All  DVB-T TVs using  'table 00 - Latin alphabet'.  This table is a superset
of ISO/IEC 6937. I talked with persons from Ceské radiokomunice
(broadcaster) and CzechTV, and wonted explain them that iso8859-2 is better
choice :-(.  Their opinion is that "table 00" is more complex, they can
display non-czech characters without problems.

I am afraid that most  Europe DVB using  'table 00 - Latin alphabet'. , but
they don't use characters
0xC0 to 0xCF. (non-spacing characters: the character is printed together
with next character. Like mechanical type writer. It is crazy).

I  haven't any idea how it solve correctly, and how select default character
set for VDR.


A few lines from ETSI EN 300 468 V1.7.1 (2005-12)

Annex A.2
If the first byte of the text field has a value in the range "0x20" to
"0xFF" then this and all subsequent bytes in the text
item are coded using the default character coding table (table 00 - Latin
alphabet) of figure A.1.


Notes for picture A.1

Figure A.1: Character code table 00 - Latin alphabet
NOTE 1: The SPACE character is located in position 20h of the code table.
NOTE 2: NBSP = no-break space.
NOTE 3: SHY = soft hyphen.
NOTE 4: This table is a superset of ISO/IEC 6937 [24] with addition of the
Euro symbol.
NOTE 5: All characters in column C are non-spacing characters (diacritical
marks).




Milos

----- Original Message ----- 
From: "Klaus Schmidinger" <Klaus.Schmidinger@cadsoft.de>
To: <vdr@linuxtv.org>
Sent: Sunday, December 10, 2006 11:51 AM
Subject: Re: [vdr] [PATCH] Fix EPG for UPC direct


> Thiemo Gehrke wrote:
>> UPC is a provider for middle european countries (Czechia, Hungary and 
>> Poland). They use iso6937-2 for encoding their EPG data so this looks 
>> quite strange in the vdr.
>> The applied patch does a "remapping" to iso8859-2 so that characters are 
>> displayed correct. (Currently only tested with Czech and Hungarian, but 
>> should also work for Polish)
>>
>> While testing this with the help of an hungarian user, i also found out 
>> that the the codepage for Hungary must be 8859-2, not -1.
>>
>> The patch is work by Helmut Auer.
>>
>> cheers,
>> Tim
>>
>>
>> ------------------------------------------------------------------------
>>
>> --- vdr-1.4.4-vanilla/epg.c 2006-10-28 11:12:42.000000000 +0200
>> +++ vdr-1.4/epg.c 2006-11-28 12:39:33.000000000 +0100
>> @@ -18,6 +18,165 @@
>>
>>  #define RUNNINGSTATUSTIMEOUT 30 // seconds before the running status is 
>> considered unknown
>>
>> +// UPC Direct / HBO strange two-character encoding. 0xC2 means acute, 
>> 0xCF caron.
>> +// many thanks to the czechs who helped me while solving this.
>> ...
>
> How is their encoding coded in the first byte of the texts?
> I can't seem to find an encoding for iso6937-2 in ETSI EN 300 46, section 
> A.2.
>
> Also, what happens if you run such a string through iconv() to convert it
> from iso6937-2 to iso8859-2 or UTF-8?
>
> I'm asking because this is how VDR will handle character sets in the next
> version.
>
> Klaus
>
> _______________________________________________
> vdr mailing list
> vdr@linuxtv.org
> http://www.linuxtv.org/cgi-bin/mailman/listinfo/vdr
>
  
Thomas Günther Feb. 3, 2007, 11:49 p.m. UTC | #5
Thiemo Gehrke wrote:
> UPC is a provider for middle european countries (Czechia, Hungary and
> Poland).  They use iso6937-2 for encoding their EPG data so this looks
> quite strange in  the vdr.
> The applied patch does a "remapping" to iso8859-2 so that characters
> are  displayed correct. (Currently only tested with Czech and
> Hungarian, but  should also work for Polish)

Here is an iconv version of the patch:
http://toms-cafe.de/vdr/download/vdr-epg-conv-iso6937-1.4.5.diff

Tom
  

Patch

--- vdr-1.4.4-vanilla/epg.c	2006-10-28 11:12:42.000000000 +0200
+++ vdr-1.4/epg.c	2006-11-28 12:39:33.000000000 +0100
@@ -18,6 +18,165 @@ 

 #define RUNNINGSTATUSTIMEOUT 30 // seconds before the running status is considered unknown

+// UPC Direct / HBO strange two-character encoding. 0xC2 means acute, 0xCF caron.
+// many thanks to the czechs who helped me while solving this.
+void checkUPC( char *str )
+{
+   char *s1 = str;
+   char *s2 = str;
+   char nc;
+
+   if (!str)
+      return;
+
+   while (*s1 != '\0') {
+      nc = *s1;
+      switch (*s1) {
+         case 0xC2: // acute: á é í ó ú ý
+            s1++;
+            switch (*s1) {
+               case 'A': nc = (char)0xC1;
+                  break;
+               case 'a': nc = (char)0xE1;
+                  break;
+               case 'E': nc = (char)0xC9;
+                  break;
+               case 'e': nc = (char)0xE9;
+                  break;
+               case 'I': nc = (char)0xCD;
+                  break;
+               case 'i': nc = (char)0xED;
+                  break;
+               case 'O': nc = (char)0xD3;
+                  break;
+               case 'o': nc = (char)0xF3;
+                  break;
+               case 'U': nc = (char)0xDA;
+                  break;
+               case 'u': nc = (char)0xFA;
+                  break;
+               case 'Y': nc = (char)0xDD;
+                  break;
+               case 'y': nc = (char)0xFD;
+                  break;
+               default:
+                  s1--;
+                  break;
+            }
+	         break;
+         case 0xC6:
+            s1++;
+            switch (*s1) {
+               case 'S': nc = (char)0xA9;
+                  break;
+               case 's': nc = (char)0xB9;
+                  break;
+               default:
+                  s1--;
+                  break;
+            }
+            break;
+         case 0xC8:
+            s1++;
+            switch (*s1) {
+               case 'A': nc = (char)0xC4;
+                  break;
+               case 'a': nc = (char)0xE4;
+                  break;
+               case 'O': nc = (char)0xD6;
+                  break;
+               case 'o': nc = (char)0xF6;
+                  break;
+               case 'U': nc = (char)0xDC;
+                  break;
+               case 'u': nc = (char)0xFC;
+                  break;
+               default:
+                  s1--;
+                  break;
+            }
+            break;
+         case 0xCA: // krouzek http://de.wikipedia.org/wiki/Krouzek
+            s1++;
+            switch (*s1) {
+               case 'U': nc = (char)0xD9;
+                  break;
+               case 'u': nc = (char)0xF9;
+                  break;
+               default:
+                  s1--;
+                  break;
+            }
+            break;
+         case 0xCD:
+            s1++;
+            switch (*s1) {
+               case 'O': nc = (char)0xD5;
+                  break;
+               case 'o': nc = (char)0xF5;
+                  break;
+               case 'U': nc = (char)0xDB;
+                  break;
+               case 'u': nc = (char)0xFB;
+                  break;
+               default:
+                  s1--;
+                  break;
+            }
+            break;
+         case 0xCF: // caron
+            s1++;
+            switch (*s1) {
+               case 'C': nc =  (char)0xC8;
+                  break;
+               case 'c': nc =  (char)0xE8;
+                  break;
+               case 'D': nc =  (char)0xCF;
+                  break;
+               case 'd': nc =  (char)0xEF;
+                  break;
+               case 'E': nc =  (char)0xCC;
+                  break;
+               case 'e': nc =  (char)0xEC;
+                  break;
+               case 'L': nc =  (char)0xC5;        // not sure if they really exist.
+                  break;
+               case 'l': nc =  (char)0xE5;
+                  break;
+               case 'N': nc =  (char)0xD2;
+                  break;
+               case 'n': nc =  (char)0xF2;
+                  break;
+               case 'R': nc =  (char)0xD8;
+                  break;
+               case 'r': nc =  (char)0xF8;
+                  break;
+               case 'S': nc =  (char)0xA9;
+                  break;
+               case 's': nc =  (char)0xB9;
+                  break;
+               case 'T': nc =  (char)0xAB;
+                  break;
+               case 't': nc =  (char)0xBB;
+                  break;
+               case 'Z': nc =  (char)0xAE;
+                  break;
+               case 'z': nc =  (char)0xBE;
+                  break;
+               default:
+                  s1--;
+                  break;
+            }
+	         break;
+         default:
+            break;
+      }
+      s1++;
+      *s2 = nc;
+      s2++;
+   }
+   *s2 = '\0';
+}
 // --- tComponent ------------------------------------------------------------

 cString tComponent::ToString(void)
@@ -641,6 +800,11 @@ 
   strreplace(shortText, '\x87', ' ');
   strreplace(description, '\x86', ' ');
   strreplace(description, '\x87', ' ');
+
+  // Check for some strange czech characters :)
+  checkUPC( title );
+  checkUPC( shortText );
+  checkUPC( description );
 }

 // --- cSchedule -------------------------------------------------------------
--- vdr-1.4.4-vanilla/i18n.c	2006-10-14 11:26:41.000000000 +0200
+++ vdr-1.4/i18n.c	2006-12-08 19:31:00.000000000 +0100
@@ -119,7 +119,7 @@ 
     "iso8859-7",
     "iso8859-1",
     "iso8859-2",
-    "iso8859-1",
+    "iso8859-2",
     "iso8859-1",
     "iso8859-5",
     "iso8859-2",