Daniel Lemire's blog

, 3 min read

Is Python going bad? or The curse of unicode….

I’ve wasted a considerable amount of time in the last two days upgrading my RSS aggregate so that it will have better support for atom feeds. I use the feedparser library.

One thing that gets to me is how unintuitive unicode is under Python. For example, the following is a string…

t="éee"

Just copy this in your python interpreter, and it will work nicely. For example,


t='éee'
print t
�ee

However, for some reason, if I just type “t”, then it can’t print it properly…

t
'xe9ee'

See how it is already confusing? (And we haven’t used unicode yet!)

Next, we can map this string to unicode…

r=unicode(t)

which has the following result…

r=unicode(t)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)
</stdin>

Ah… so it tries to interpret t as ascii… fair enough, we know it is “latin-1” or “iso8859-1”. It is already quite strange that “print” knows what to do with my string, but nothing else in Python seems to know… so we do


r=unicode(t,'latin-1')
r
u'xe9ee'
print r
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'xe9' in position 0: ordinal not in range(128)
</stdin>

because, see, you can’t print unicode to the string… but you can do the following…


print r.encode('latin-1')
éee
print r.encode('iso-8859-1')
éee

but also


r.encode('latin-1')
'xe9ee'
r.encode('iso-8859-1')
'xe9ee'

What is my beef?

  • If ‘print’ assumes ‘latin-1’ then shouldn’t everything else? Why is this not consistent? If it is unsafe to assume ‘latin-1’, then why does print do it?
  • The encode, decode thing is a mess. We had a perfectly valid construct for converting things to strings, and that’s ‘str’. Now, we have a new one called ‘encode’. So that, given some unicode, I can do either t.encode(‘ascii’) or str(t) for the same result. Bad. Now, I’m stuck forever in a world where I have to figure out whether I encode or decode a string, and which is which. This is hard. This is confusing.
  • A string object should know its encoding so I don’t have to. What happens if I receive a string from some library and I need to convert it to unicode? How am I supposed to know what the encoding of the string is? There is no sensible way to communicate this right now which makes debugging a pain. The only excuse I see is that sometimes it is impossible for python to know the encoding… well, then it should just fail and require the programmer to specify the encoding. There are way too many things that can go wrong when you expect the programmer to keep tracks of his strings and which is encoded how…