It is currently Fri May 03, 2024 1:53 am


All times are UTC - 5 hours [ DST ]



Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: [SOLVED] Character set problem in python plug-in
PostPosted: Fri Apr 05, 2013 5:42 pm  (#1) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
I am retrieving strings from a MySQL database and many of the characters are not English, but Latin-Croatian. Here is a sample string:
žćšćščĐđćžćŽ


I am having problems putting these strings into text boxes. I am getting various errors depending on my attempts. My last attempt was something like,

import codecs

....

    my_string = codecs.encode(odd_char_string) + normal_string


And I get error,

UnicodeDecodeError: 'utf8' codec can't decode byte 0x9e in position 5: invalid start byte


I've been through this mess before in both MySQL and Perl and it took me weeks to get it straightened out. The on-line documentation is no help.

Does anyone have a clue what to do? As a demonstration of the problems I am having, trying posting that sample Latin Croatian text outside of CODE markup and the post will not preview.


Last edited by Grafx on Thu Apr 11, 2013 12:30 pm, edited 2 times in total.

Share on Facebook Share on Twitter Share on Orkut Share on Digg Share on MySpace Share on Delicious Share on Technorati
Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Fri Apr 05, 2013 6:32 pm  (#2) 
Offline
Script Coder
User avatar

Joined: Oct 25, 2010
Posts: 4739
The $10K question is what encoding you get from MySQL? What type are the strings, str or Unicode. How do they display with a "print" instruction? And remember that the Gimp API expects Unicode, not str.

_________________
Image


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Sat Apr 06, 2013 12:48 pm  (#3) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
ofnuts wrote:
The $10K question is what encoding you get from MySQL? What type are the strings, str or Unicode. How do they display with a "print" instruction? And remember that the Gimp API expects Unicode, not str.


I tried using pdb.gimp_message(my_string) and got,

Calling error for procedure 'mysql_field':
Procedure 'gimp-message' has been called with value 'xxxxxži?' for argument 'message' (#1, type gchararray). This value is out of range.


xxxxx..i? represents the string with the Latin Croatian characters in it. The Latin Croatian "z" made it, but the question mark represents a character that didn't.

Is there another "print" function I should try to expose further what is going on?


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Sat Apr 06, 2013 1:54 pm  (#4) 
Offline
Script Coder
User avatar

Joined: Apr 23, 2010
Posts: 1553
Location: not from Guildford after all
GIMP handles printing the string as copied from your first post just fine. (Also, in my browser the text appears fine even if outside of CODE blocks.)

Python:
pdb.gimp_message("žćšćščĐđćžćŽ")

Script-fu:
(gimp-message "žćšćščĐđćžćŽ")

This suggests (as Ofnuts opined) that your problem lies with how MySQL is encoding the string.

_________________
Any sufficiently primitive technology is indistinguishable from a rock.


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Sat Apr 06, 2013 3:23 pm  (#5) 
Offline
Script Coder
User avatar

Joined: Oct 25, 2010
Posts: 4739
Grafx wrote:
ofnuts wrote:
The $10K question is what encoding you get from MySQL? What type are the strings, str or Unicode. How do they display with a "print" instruction? And remember that the Gimp API expects Unicode, not str.


I tried using pdb.gimp_message(my_string) and got,

Calling error for procedure 'mysql_field':
Procedure 'gimp-message' has been called with value 'xxxxxži?' for argument 'message' (#1, type gchararray). This value is out of range.


xxxxx..i? represents the string with the Latin Croatian characters in it. The Latin Croatian "z" made it, but the question mark represents a character that didn't.

Is there another "print" function I should try to expose further what is going on?

Assumin your string varaible is "s" what is the output of:
pdb.gimp_message(type(s))

and
pdb.gimp_message(','.join([str(ord(c)) for c in s]))

_________________
Image


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Sun Apr 07, 2013 10:53 am  (#6) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
pdb.gimp_message(type(s))

For that message, I get,
<type 'str'>

pdb.gimp_message(','.join([str(ord(c)) for c in s]))

For that message, I get,
75,97,114,97,100,158,105,63

I'm pretty sure that this string represents the same string I posted previous. It most certainly contains Latin Croatian characters. No doubt about it.


Top
 Post subject: No joy yet.
PostPosted: Mon Apr 08, 2013 4:04 pm  (#7) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
saulgoode wrote:
GIMP handles printing the string as copied from your first post just fine. (Also, in my browser the text appears fine even if outside of CODE blocks.)
Python:
pdb.gimp_message("žćšćščĐđćžćŽ")
Script-fu:
(gimp-message "žćšćščĐđćžćŽ")
This suggests (as Ofnuts opined) that your problem lies with how MySQL is encoding the string.


I can confirm your success here. I also found that if you apply the problem in a simple non-plug-in script with the text in the script, you can't even save the script for application unless the file is in UTF-8. And furthermore, the Python compiler will reject the characters unless you add a magic quotes at the top,
# coding=utf-8
as explained in a link provided from a message from the compiler,

http://www.python.org/dev/peps/pep-0263/

The only control I seem to have with MySQL is by way of "collation" and the collation I have applied to this table is Latin Croatian. I have tried,

SELECT ... CONVERT(My_Field USING utf8) ...


but get the same results.


Top
 Post subject: No joy yet.
PostPosted: Mon Apr 08, 2013 4:05 pm  (#8) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
saulgoode wrote:
GIMP handles printing the string as copied from your first post just fine. (Also, in my browser the text appears fine even if outside of CODE blocks.)
Python:
pdb.gimp_message("žćšćščĐđćžćŽ")

Script-fu:
(gimp-message "žćšćščĐđćžćŽ")

This suggests (as Ofnuts opined) that your problem lies with how MySQL is encoding the string.


I can confirm your success here. I also found that if you apply the problem in a simple non-plug-in script with the text in the script, you can't even save the script for application unless the file is in UTF-8. And furthermore, the Python compiler will reject the characters unless you add a magic quotes at the top,
# coding=utf-8
as explained in a link provided from a message from the compiler,

http://www.python.org/dev/peps/pep-0263/

The only control I seem to have with MySQL is by way of "collation" and the collation I have applied to this table is Latin Croatian. I have tried,

SELECT ... CONVERT(My_Field USING utf8) ...


but get the same results. Is there another character set I might try for this?


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 8:24 am  (#9) 
Offline
Script Coder
User avatar

Joined: Jun 22, 2010
Posts: 1171
Location: Here and there
Have you tried using the Python encode/decode functions to change the encoding?

Here's a discussion on something similar on Stackoverflow:
http://stackoverflow.com/questions/4299 ... to-latin-1

And you're referring to your data as being "Latin Croatian" is that actually latin-2 encoding?:
http://en.wikipedia.org/wiki/ISO/IEC_8859-2

Kevin


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 8:58 am  (#10) 
Offline
Script Coder
User avatar

Joined: Oct 25, 2010
Posts: 4739
Grafx wrote:
pdb.gimp_message(type(s))

For that message, I get,
<type 'str'>

pdb.gimp_message(','.join([str(ord(c)) for c in s]))

For that message, I get,
75,97,114,97,100,158,105,63

I'm pretty sure that this string represents the same string I posted previous. It most certainly contains Latin Croatian characters. No doubt about it.

Definitely not,it contains... whatever the encoding decides, but this looks like some legacy encoding (since it's a <str> it is not Unicode). These values are, in ASCII: 'K','a','r','a','d',###,'i','?' and 158 isn't a valid character in the various variants of ISO-8859 (-1 and -15 for Western European and -2 for Central European (which supports Croatian)). And it's not valid UTF-8 either. It may be valid Croatian but the actual encoding should be determined. Once you know it you can use:
unicodeString=unicode(string, encoding={your encoding})

where {your encoding} is hopefully among Python's known encodings. You can then use unicodeString in the Gimp API.

_________________
Image


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 3:10 pm  (#11) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
ofnuts wrote:
....

Definitely not,it contains... whatever the encoding decides, but this looks like some legacy encoding (since it's a <str> it is not Unicode). These values are, in ASCII: 'K','a','r','a','d',###,'i','?' and 158 isn't a valid character in the various variants of ISO-8859 (-1 and -15 for Western European and -2 for Central European (which supports Croatian)). And it's not valid UTF-8 either. It may be valid Croatian but the actual encoding should be determined. Once you know it you can use:
unicodeString=unicode(string, encoding={your encoding})

where {your encoding} is hopefully among Python's known encodings. You can then use unicodeString in the Gimp API.


OK. MySQL has a function in it to retrieve the character set of the string being retrieved, SELECT CHARSET(strng) ... I applied this and I got "latin2" returned. I googled "latin2 character set" and found a page, Code Page 852. I looked "CP852" up at the link you provided for Python's standard encodings and found it there. I tried and found only the alias, "852" didn't give a "not defined" error on that entry.

But now I am getting other errors from
unicode(my_string, encoding={852})
TypeError: unicode() argument 2 must be string, not set
I've tried entering a string "my_string" directly and get the same error.


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 3:30 pm  (#12) 
Offline
Script Coder
User avatar

Joined: Oct 25, 2010
Posts: 4739
Grafx wrote:
ofnuts wrote:
....

Definitely not,it contains... whatever the encoding decides, but this looks like some legacy encoding (since it's a <str> it is not Unicode). These values are, in ASCII: 'K','a','r','a','d',###,'i','?' and 158 isn't a valid character in the various variants of ISO-8859 (-1 and -15 for Western European and -2 for Central European (which supports Croatian)). And it's not valid UTF-8 either. It may be valid Croatian but the actual encoding should be determined. Once you know it you can use:
unicodeString=unicode(string, encoding={your encoding})

where {your encoding} is hopefully among Python's known encodings. You can then use unicodeString in the Gimp API.


OK. MySQL has a function in it to retrieve the character set of the string being retrieved, SELECT CHARSET(strng) ... I applied this and I got "latin2" returned. I googled "latin2 character set" and found a page, Code Page 852. I looked "CP852" up at the link you provided for Python's standard encodings and found it there. I tried and found only the alias, "852" didn't give a "not defined" error on that entry.

But now I am getting other errors from
unicode(my_string, encoding={852})
TypeError: unicode() argument 2 must be string, not set
I've tried entering a string "my_string" directly and get the same error.


An encoding is a name and is always a string:
unicodeString=unicode('abcd',encoding='CP852')

is accepted... and the string above gives 'Karad×i?'

Now, why one is still using CP852 (which dates back to the original PC, in 1982 in Europe) in 2013 is beyond my understanding :)

_________________
Image


Last edited by ofnuts on Tue Apr 09, 2013 3:42 pm, edited 1 time in total.

Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 3:41 pm  (#13) 
Offline
Script Coder
User avatar

Joined: Jun 22, 2010
Posts: 1171
Location: Here and there
I would argue that it should be
unicodeString=unicode('abcd',encoding='latin2')
as that's what MySQL is identifying it as.


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 3:55 pm  (#14) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
It is still not converting right,

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP852')


Gives me

Karadži?
Karad×i?
Staniši? & Simatovi?
StaniÜi? & Simatovi?
Boškoski & Tar?ulovski
BoÜkoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ra×natovi?, Äeljko - "Arkan"


This should look something like,

Karadžić
Stanišić & Simatović
Boškoski & Tarčulovski
Đorđević
Ražnatović, Željko - "Arkan"


And I noticed that character 158 gives the multiplication sign on the CP852 web page. Maybe there is some other designation for latin2. Where might I look?


Top
 Post subject: No go.
PostPosted: Tue Apr 09, 2013 3:57 pm  (#15) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
paynekj wrote:
I would argue that it should be
unicodeString=unicode('abcd',encoding='latin2')
as that's what MySQL is identifying it as.


I just tried that and got,

UnicodeEncodeError: 'ascii' codec can't encode character u'\x9e' in position 5: ordinal not in range(128)


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Tue Apr 09, 2013 4:04 pm  (#16) 
Offline
Script Coder
User avatar

Joined: Jun 22, 2010
Posts: 1171
Location: Here and there
The list of known Python encodings that Ofnuts pointed you to earlier http://docs.python.org/2/library/codecs ... -encodings has a mac_latin2 encoding that you could try

And Googling for one of the problem characters: Đ

got me here:
http://en.wikipedia.org/wiki/D_with_str ... r_encoding

which also mentions Latin-4 and Latin-10 and the Wiki page for Latin-10 mentions Croatian : http://en.wikipedia.org/wiki/ISO/IEC_8859-16 but the character with ordinal value 158 is still invalid.


Top
 Post subject: Still no joy
PostPosted: Tue Apr 09, 2013 4:54 pm  (#17) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
I've tried the following and they either mangle the character or return errors,

mac_latin2
mac_cyrillic
latin2
latin4
latin10
cp855
cp775
CP1250
cp1251
iso8859_2
iso8859_5
iso8859_16


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Wed Apr 10, 2013 2:35 am  (#18) 
Offline
Script Coder
User avatar

Joined: Oct 25, 2010
Posts: 4739
Grafx wrote:
It is still not converting right,

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP852')


Gives me

Karadži?
Karad×i?
Staniši? & Simatovi?
StaniÜi? & Simatovi?
Boškoski & Tar?ulovski
BoÜkoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ra×natovi?, Äeljko - "Arkan"


This is exactly ('×,Ä,Ü', for 'ž,Ž,š') what you get when you use CP852 to display something which is intended for CP1250. So it looks like your data in is CP1250. What kind of errors do you get when you use CP1250?

The final question mark is really the ASCII code for a question mark (6310), so the data may have been corrupted on input, maybe because the 'ć' was not considered a valid character. But then an automatic change of 'i?' to 'ić' a the end of strings may be possible to fix this.

_________________
Image


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Wed Apr 10, 2013 11:20 am  (#19) 
Offline
GimpChat Member

Joined: Mar 15, 2013
Posts: 36
ofnuts wrote:
....
This is exactly ('×,Ä,Ü', for 'ž,Ž,š') what you get when you use CP852 to display something which is intended for CP1250. So it looks like your data in is CP1250. What kind of errors do you get when you use CP1250?

The final question mark is really the ASCII code for a question mark (6310), so the data may have been corrupted on input, maybe because the 'ć' was not considered a valid character. But then an automatic change of 'i?' to 'ić' a the end of strings may be possible to fix this.


Here is what I get when I apply CP1250 to that bit of code,

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP1250')


Karadži?
Karadži?
Staniši? & Simatovi?
Staniši? & Simatovi?
Boškoski & Tar?ulovski
Boškoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ražnatovi?, Željko - "Arkan"


So the results are exactly the same whether I apply the unicode function or not. The only thing missing at this point is the odd 'c' and 'd'.

Again, it should be something like,

Karadžić
Stanišić & Simatović
Boškoski & Tarčulovski
Đorđević
Ražnatović, Željko - "Arkan"


I might add, the names are Serbian. The native script for Serbian is cyrillic, but printing Serbian words in English requires other characters used by Croatians who use a basic Latin script with the addition of these other characters in order to accomodate the Serbian language which the Croatians also speak.


Top
 Post subject: Re: Character set problem in python plug-in
PostPosted: Wed Apr 10, 2013 3:42 pm  (#20) 
Offline
Script Coder
User avatar

Joined: Oct 25, 2010
Posts: 4739
Grafx wrote:
ofnuts wrote:
....
This is exactly ('×,Ä,Ü', for 'ž,Ž,š') what you get when you use CP852 to display something which is intended for CP1250. So it looks like your data in is CP1250. What kind of errors do you get when you use CP1250?

The final question mark is really the ASCII code for a question mark (6310), so the data may have been corrupted on input, maybe because the 'ć' was not considered a valid character. But then an automatic change of 'i?' to 'ić' a the end of strings may be possible to fix this.


Here is what I get when I apply CP1250 to that bit of code,

for row in results:
    trial_nam = row[0]
    print trial_nam
    print unicode(trial_nam, encoding = 'CP1250')


Karadži?
Karadži?
Staniši? & Simatovi?
Staniši? & Simatovi?
Boškoski & Tar?ulovski
Boškoski & Tar?ulovski
?or?evi?
?or?evi?
Ražnatovi?, Željko - "Arkan"
Ražnatovi?, Željko - "Arkan"


So the results are exactly the same whether I apply the unicode function or not. The only thing missing at this point is the odd 'c' and 'd'.

Again, it should be something like,

Karadžić
Stanišić & Simatović
Boškoski & Tarčulovski
Đorđević
Ražnatović, Željko - "Arkan"


I might add, the names are Serbian. The native script for Serbian is cyrillic, but printing Serbian words in English requires other characters used by Croatians who use a basic Latin script with the addition of these other characters in order to accomodate the Serbian language which the Croatians also speak.


1) I'm afraid that the '?' are just question marks as returned by MySQL so there is nothing to do about it in the Python code. Either the data in MySQL is still good(*) and it may be a MySQL configuration or request setup problem, or it is already corrupted and its hopeless.

2) the big difference is that Gimp will accept the <unicode> but not the <str>.

(*) Not a MySQL (or even SQL) expert but if
select NAME from TABLE where NAME like '%?%'

doesn't return anything, then there is some hope the data is good, since it doesn't really contain question marks. But if you get all the names above, then the question marks are there and your data is corrupted.

_________________
Image


Top
Post new topic Reply to topic  [ 24 posts ]  Go to page 1, 2  Next

All times are UTC - 5 hours [ DST ]


   Similar Topics   Replies 
No new posts use in python of plug-in lighting

4

No new posts Attachment(s) GIMP Python-Fu Plug-in template

4

No new posts GIMP 2.10 doesn't install my python plug-ins

1

No new posts Plug-in crashes after OS upgrade: python version mismatch? maybe?

4

No new posts Convert GIMP plugin from Python 2 to Python 3

4



* Login  



Powered by phpBB3 © phpBB Group