[SOLVED] Character set problem in python plug-in

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

I am retrieving strings from a MySQL database and many of the characters are not English, but Latin-Croatian. Here is a sample string:

Code: Select all

žćšćščĐđćžćŽ

I am having problems putting these strings into text boxes. I am getting various errors depending on my attempts. My last attempt was something like,

Code: Select all

import codecs

....

    my_string = codecs.encode(odd_char_string) + normal_string

And I get error,

Code: Select all

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9e in position 5: invalid start byte

I've been through this mess before in both MySQL and Perl and it took me weeks to get it straightened out. The on-line documentation is no help.

Does anyone have a clue what to do? As a demonstration of the problems I am having, trying posting that sample Latin Croatian text outside of CODE markup and the post will not preview.

ofnuts · **Joined:** Oct 25, 2010 **Posts:** 4739

The $10K question is what encoding you get from MySQL? What type are the strings, str or Unicode. How do they display with a "print" instruction? And remember that the Gimp API expects Unicode, not str.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

ofnuts wrote:

The $10K question is what encoding you get from MySQL? What type are the strings, str or Unicode. How do they display with a "print" instruction? And remember that the Gimp API expects Unicode, not str.

I tried using pdb.gimp_message(my_string) and got,

Code: Select all

Calling error for procedure 'mysql_field':
Procedure 'gimp-message' has been called with value 'xxxxxži?' for argument 'message' (#1, type gchararray). This value is out of range.

xxxxx..i? represents the string with the Latin Croatian characters in it. The Latin Croatian "z" made it, but the question mark represents a character that didn't.

Is there another "print" function I should try to expose further what is going on?

saulgoode · **Posted:** Sat Apr 06, 2013 1:54 pm **(#4)**

GIMP handles printing the string as copied from your first post just fine. (Also, in my browser the text appears fine even if outside of CODE blocks.)

Python:
pdb.gimp_message("žćšćščĐđćžćŽ")

Script-fu:
(gimp-message "žćšćščĐđćžćŽ")

This suggests (as Ofnuts opined) that your problem lies with how MySQL is encoding the string.

ofnuts · **Joined:** Oct 25, 2010 **Posts:** 4739

Grafx wrote:

ofnuts wrote:

The $10K question is what encoding you get from MySQL? What type are the strings, str or Unicode. How do they display with a "print" instruction? And remember that the Gimp API expects Unicode, not str.

I tried using pdb.gimp_message(my_string) and got,

Code: Select all

Calling error for procedure 'mysql_field':
Procedure 'gimp-message' has been called with value 'xxxxxži?' for argument 'message' (#1, type gchararray). This value is out of range.

xxxxx..i? represents the string with the Latin Croatian characters in it. The Latin Croatian "z" made it, but the question mark represents a character that didn't.

Is there another "print" function I should try to expose further what is going on?

Assumin your string varaible is "s" what is the output of:

Code: Select all

pdb.gimp_message(type(s))

and

Code: Select all

pdb.gimp_message(','.join([str(ord(c)) for c in s]))

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

Code: Select all

pdb.gimp_message(type(s))

For that message, I get,

Code: Select all

<type 'str'>

Code: Select all

pdb.gimp_message(','.join([str(ord(c)) for c in s]))

For that message, I get,

Code: Select all

75,97,114,97,100,158,105,63

I'm pretty sure that this string represents the same string I posted previous. It most certainly contains Latin Croatian characters. No doubt about it.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

saulgoode wrote:

GIMP handles printing the string as copied from your first post just fine. (Also, in my browser the text appears fine even if outside of CODE blocks.)
Python:
pdb.gimp_message("žćšćščĐđćžćŽ")
Script-fu:
(gimp-message "žćšćščĐđćžćŽ")
This suggests (as Ofnuts opined) that your problem lies with how MySQL is encoding the string.

I can confirm your success here. I also found that if you apply the problem in a simple non-plug-in script with the text in the script, you can't even save the script for application unless the file is in UTF-8. And furthermore, the Python compiler will reject the characters unless you add a magic quotes at the top,

Code: Select all

# coding=utf-8

as explained in a link provided from a message from the compiler,

http://www.python.org/dev/peps/pep-0263/

The only control I seem to have with MySQL is by way of "collation" and the collation I have applied to this table is Latin Croatian. I have tried,

Code: Select all

SELECT ... CONVERT(My_Field USING utf8) ...

but get the same results.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

saulgoode wrote:

GIMP handles printing the string as copied from your first post just fine. (Also, in my browser the text appears fine even if outside of CODE blocks.)
Python:

Code: Select all

pdb.gimp_message("žćšćščĐđćžćŽ")

Script-fu:

Code: Select all

(gimp-message "žćšćščĐđćžćŽ")

This suggests (as Ofnuts opined) that your problem lies with how MySQL is encoding the string.

I can confirm your success here. I also found that if you apply the problem in a simple non-plug-in script with the text in the script, you can't even save the script for application unless the file is in UTF-8. And furthermore, the Python compiler will reject the characters unless you add a magic quotes at the top,

Code: Select all

# coding=utf-8

as explained in a link provided from a message from the compiler,

http://www.python.org/dev/peps/pep-0263/

The only control I seem to have with MySQL is by way of "collation" and the collation I have applied to this table is Latin Croatian. I have tried,

Code: Select all

SELECT ... CONVERT(My_Field USING utf8) ...

but get the same results. Is there another character set I might try for this?

paynekj · **Posted:** Tue Apr 09, 2013 8:24 am **(#9)**

Have you tried using the Python encode/decode functions to change the encoding?

Here's a discussion on something similar on Stackoverflow:
http://stackoverflow.com/questions/4299 ... to-latin-1

And you're referring to your data as being "Latin Croatian" is that actually latin-2 encoding?:
http://en.wikipedia.org/wiki/ISO/IEC_8859-2

Kevin

ofnuts · **Joined:** Oct 25, 2010 **Posts:** 4739

Grafx wrote:

Code: Select all

pdb.gimp_message(type(s))

For that message, I get,

Code: Select all

<type 'str'>

Code: Select all

pdb.gimp_message(','.join([str(ord(c)) for c in s]))

For that message, I get,

Code: Select all

75,97,114,97,100,158,105,63

I'm pretty sure that this string represents the same string I posted previous. It most certainly contains Latin Croatian characters. No doubt about it.

Definitely not,it contains... whatever the encoding decides, but this looks like some legacy encoding (since it's a <str> it is not Unicode). These values are, in ASCII: 'K','a','r','a','d',###,'i','?' and 158 isn't a valid character in the various variants of ISO-8859 (-1 and -15 for Western European and -2 for Central European (which supports Croatian)). And it's not valid UTF-8 either. It may be valid Croatian but the actual encoding should be determined. Once you know it you can use:

Code: Select all

unicodeString=unicode(string, encoding={your encoding})

where {your encoding} is hopefully among Python's known encodings. You can then use unicodeString in the Gimp API.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

ofnuts wrote:

....

Definitely not,it contains... whatever the encoding decides, but this looks like some legacy encoding (since it's a <str> it is not Unicode). These values are, in ASCII: 'K','a','r','a','d',###,'i','?' and 158 isn't a valid character in the various variants of ISO-8859 (-1 and -15 for Western European and -2 for Central European (which supports Croatian)). And it's not valid UTF-8 either. It may be valid Croatian but the actual encoding should be determined. Once you know it you can use:

Code: Select all

unicodeString=unicode(string, encoding={your encoding})

where {your encoding} is hopefully among Python's known encodings. You can then use unicodeString in the Gimp API.

OK. MySQL has a function in it to retrieve the character set of the string being retrieved, SELECT CHARSET(strng) ... I applied this and I got "latin2" returned. I googled "latin2 character set" and found a page, Code Page 852. I looked "CP852" up at the link you provided for Python's standard encodings and found it there. I tried and found only the alias, "852" didn't give a "not defined" error on that entry.

But now I am getting other errors from

Code: Select all

unicode(my_string, encoding={852})

Code: Select all

TypeError: unicode() argument 2 must be string, not set

I've tried entering a string "my_string" directly and get the same error.

ofnuts · **Joined:** Oct 25, 2010 **Posts:** 4739

Grafx wrote:

ofnuts wrote:

....

Definitely not,it contains... whatever the encoding decides, but this looks like some legacy encoding (since it's a <str> it is not Unicode). These values are, in ASCII: 'K','a','r','a','d',###,'i','?' and 158 isn't a valid character in the various variants of ISO-8859 (-1 and -15 for Western European and -2 for Central European (which supports Croatian)). And it's not valid UTF-8 either. It may be valid Croatian but the actual encoding should be determined. Once you know it you can use:

Code: Select all

unicodeString=unicode(string, encoding={your encoding})

where {your encoding} is hopefully among Python's known encodings. You can then use unicodeString in the Gimp API.

OK. MySQL has a function in it to retrieve the character set of the string being retrieved, SELECT CHARSET(strng) ... I applied this and I got "latin2" returned. I googled "latin2 character set" and found a page, Code Page 852. I looked "CP852" up at the link you provided for Python's standard encodings and found it there. I tried and found only the alias, "852" didn't give a "not defined" error on that entry.

But now I am getting other errors from

Code: Select all

unicode(my_string, encoding={852})

Code: Select all

TypeError: unicode() argument 2 must be string, not set

I've tried entering a string "my_string" directly and get the same error.

An encoding is a name and is always a string:

Code: Select all

unicodeString=unicode('abcd',encoding='CP852')

is accepted... and the string above gives 'Karad×i?'

Now, why one is still using CP852 (which dates back to the original PC, in 1982 in Europe) in 2013 is beyond my understanding

paynekj · **Posted:** Tue Apr 09, 2013 3:41 pm **(#13)**

I would argue that it should be

Code: Select all

unicodeString=unicode('abcd',encoding='latin2')

as that's what MySQL is identifying it as.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

It is still not converting right,

Code: Select all

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP852')

Gives me

Code: Select all

Karadži?
Karad×i?
Staniši? & Simatovi?
StaniÜi? & Simatovi?
Boškoski & Tar?ulovski
BoÜkoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ra×natovi?, Äeljko - "Arkan"

This should look something like,

Code: Select all

Karadžić
Stanišić & Simatović
Boškoski & Tarčulovski
Đorđević
Ražnatović, Željko - "Arkan"

And I noticed that character 158 gives the multiplication sign on the CP852 web page. Maybe there is some other designation for latin2. Where might I look?

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

paynekj wrote:

I would argue that it should be

Code: Select all

unicodeString=unicode('abcd',encoding='latin2')

as that's what MySQL is identifying it as.

I just tried that and got,

Code: Select all

UnicodeEncodeError: 'ascii' codec can't encode character u'\x9e' in position 5: ordinal not in range(128)

paynekj · **Posted:** Tue Apr 09, 2013 4:04 pm **(#16)**

The list of known Python encodings that Ofnuts pointed you to earlier http://docs.python.org/2/library/codecs ... -encodings has a mac_latin2 encoding that you could try

And Googling for one of the problem characters: Đ

got me here:
http://en.wikipedia.org/wiki/D_with_str ... r_encoding

which also mentions Latin-4 and Latin-10 and the Wiki page for Latin-10 mentions Croatian : http://en.wikipedia.org/wiki/ISO/IEC_8859-16 but the character with ordinal value 158 is still invalid.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

I've tried the following and they either mangle the character or return errors,

Code: Select all

mac_latin2
mac_cyrillic
latin2
latin4
latin10
cp855
cp775
CP1250
cp1251
iso8859_2
iso8859_5
iso8859_16

ofnuts · **Joined:** Oct 25, 2010 **Posts:** 4739

Grafx wrote:

It is still not converting right,

Code: Select all

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP852')

Gives me

Code: Select all

Karadži?
Karad×i?
Staniši? & Simatovi?
StaniÜi? & Simatovi?
Boškoski & Tar?ulovski
BoÜkoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ra×natovi?, Äeljko - "Arkan"

This is exactly ('×,Ä,Ü', for 'ž,Ž,š') what you get when you use CP852 to display something which is intended for CP1250. So it looks like your data in is CP1250. What kind of errors do you get when you use CP1250?

The final question mark is really the ASCII code for a question mark (63₁₀), so the data may have been corrupted on input, maybe because the 'ć' was not considered a valid character. But then an automatic change of 'i?' to 'ić' a the end of strings may be possible to fix this.

Grafx · **Joined:** Mar 15, 2013 **Posts:** 36

ofnuts wrote:

....
This is exactly ('×,Ä,Ü', for 'ž,Ž,š') what you get when you use CP852 to display something which is intended for CP1250. So it looks like your data in is CP1250. What kind of errors do you get when you use CP1250?

The final question mark is really the ASCII code for a question mark (63₁₀), so the data may have been corrupted on input, maybe because the 'ć' was not considered a valid character. But then an automatic change of 'i?' to 'ić' a the end of strings may be possible to fix this.

Here is what I get when I apply CP1250 to that bit of code,

Code: Select all

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP1250')

Code: Select all

Karadži?
Karadži?
Staniši? & Simatovi?
Staniši? & Simatovi?
Boškoski & Tar?ulovski
Boškoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ražnatovi?, Željko - "Arkan"

So the results are exactly the same whether I apply the unicode function or not. The only thing missing at this point is the odd 'c' and 'd'.

Again, it should be something like,

Code: Select all

Karadžić
Stanišić & Simatović
Boškoski & Tarčulovski
Đorđević
Ražnatović, Željko - "Arkan"

I might add, the names are Serbian. The native script for Serbian is cyrillic, but printing Serbian words in English requires other characters used by Croatians who use a basic Latin script with the addition of these other characters in order to accomodate the Serbian language which the Croatians also speak.

ofnuts · **Joined:** Oct 25, 2010 **Posts:** 4739

Grafx wrote:

ofnuts wrote:

....
This is exactly ('×,Ä,Ü', for 'ž,Ž,š') what you get when you use CP852 to display something which is intended for CP1250. So it looks like your data in is CP1250. What kind of errors do you get when you use CP1250?

The final question mark is really the ASCII code for a question mark (63₁₀), so the data may have been corrupted on input, maybe because the 'ć' was not considered a valid character. But then an automatic change of 'i?' to 'ić' a the end of strings may be possible to fix this.

Here is what I get when I apply CP1250 to that bit of code,

Code: Select all

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP1250')

Code: Select all

Karadži?
Karadži?
Staniši? & Simatovi?
Staniši? & Simatovi?
Boškoski & Tar?ulovski
Boškoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ražnatovi?, Željko - "Arkan"

So the results are exactly the same whether I apply the unicode function or not. The only thing missing at this point is the odd 'c' and 'd'.

Again, it should be something like,

Code: Select all

Karadžić
Stanišić & Simatović
Boškoski & Tarčulovski
Đorđević
Ražnatović, Željko - "Arkan"

I might add, the names are Serbian. The native script for Serbian is cyrillic, but printing Serbian words in English requires other characters used by Croatians who use a basic Latin script with the addition of these other characters in order to accomodate the Serbian language which the Croatians also speak.

1) I'm afraid that the '?' are just question marks as returned by MySQL so there is nothing to do about it in the Python code. Either the data in MySQL is still good(*) and it may be a MySQL configuration or request setup problem, or it is already corrupted and its hopeless.

2) the big difference is that Gimp will accept the <unicode> but not the <str>.

(*) Not a MySQL (or even SQL) expert but if

Code: Select all

select NAME from TABLE where NAME like '%?%'

doesn't return anything, then there is some hope the data is good, since it doesn't really contain question marks. But if you get all the names above, then the question marks are there and your data is corrupted.

	Similar Topics	Replies
	use in python of plug-in lighting	4
	GIMP Python-Fu Plug-in template	4
	GIMP 2.10 doesn't install my python plug-ins	1
	Plug-in crashes after OS upgrade: python version mismatch? maybe?	4
	Convert GIMP plugin from Python 2 to Python 3	4