Opened at 2014-01-31T17:47:02Z
Closed at 2014-03-02T10:45:40Z
#1121 closed defect (fixed)
Unicode astral characters displayed incorrectly
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Version: | 3.2 | |
Keywords: | unicode astral character twitter | Cc: | |
IRC client+version: | Client-independent | Operating System: | Linux |
OS version/distro: |
Description
I follow the FakeUnicode Twitter account, which often uses astral-plane characters such as 💩 and 👽. These are displayed incorrectly when I receive them; please find attached two clippings showing the erroneous output in bitlbee and the correct output on the Twitter web interface.
I have not yet ascertained whether this occurs elsewhere in bitlbee.
Attachments (2)
Change History (15)
Changed at 2014-01-31T17:47:35Z by
comment:1 Changed at 2014-02-02T12:02:11Z by
that's most likely fonts in your irc client not being able to display those characters correctly
comment:2 Changed at 2014-02-02T12:06:49Z by
There *is* a possibility that the JSON decoder is getting these wrong, actually. Could you try connecting to BitlBee using an X11 client instead of a terminal one to see if the characters look better?
Alternatively, maybe you could even just copy-paste the stufffrom your terminal to something with more fonts and see if things improve.
comment:3 Changed at 2014-02-02T15:59:50Z by
That IS an X11 client - specifically, Quassel. It's definitely not a font issue - I can happily paste and send my own astral characters and receive them from others without issue. It's only bitlbee, and (so far as I know) only IRC which has this problem.
comment:4 Changed at 2014-02-04T04:32:24Z by
Added FakeUnicode to my bitlbee twitter, can confirm.
This tweet in particular, which was one of the most recent ones when i tested, shows correctly using the commandline client t
$ t status 430377627916836864 ID 430377627916836864 Text RT @Mistermatt007: 𝓥𝓲𝓿𝓮 𝓤𝓷𝓲𝓬𝓸𝓭𝓮 !! Screen name @FakeUnicode Posted at Feb 3 13:29 (12 hours ago) Retweets 1 Favorites 0 Source web
Excluding spaces and other ascii, there are 11 unicode characters, which encode to 44 bytes in UTF-8.
It displays incorrectly with bitlbee
13:29 < FakeUnicode> RT @Mistermatt007: ������������������������ ������������������������������������������ !!
Those are 66 U+FFFD characters, again not counting the spaces. I don't really understand what happened here. Should do some proper debugging later.
comment:5 Changed at 2014-02-04T08:58:14Z by
I know there have been bug like this before in the JSON library I'm using. It appears that there are some more.
(IIRC JSON has ASCII-escaped Unicde sequences, converting them to UTF-8 takes some effort/magic.)
comment:6 Changed at 2014-02-08T16:19:39Z by
I think the problem is that these are 32-bit Unicode sequences, which from how I read it briefly, json.c doesn't currently support.
>>> a=u'𝓥𝓲𝓿𝓮 𝓤𝓷𝓲𝓬𝓸𝓭𝓮' >>> a u'\U0001d4e5\U0001d4f2\U0001d4ff\U0001d4ee \U0001d4e4\U0001d4f7\U0001d4f2\U0001d4ec\U0001d4f8\U0001d4ed\U0001d4ee'
json.c expect just two bytes/four nibbles, also it only counts on \u (lower case) escaped sequences. Ah fun, this is what Twitter sends:
"text":"\ud835\udce5\ud835\udcf2\ud835\udcff\ud835\udcee \ud835\udce4\ud835\udcf7\ud835\udcf2\ud835\udcec\ud835\udcf8\ud835\udced\ud835\udcee bla bla bla"
So those are in fact two \u's per character! We'll need to find a spec on that..
comment:7 Changed at 2014-02-08T16:21:36Z by
In fact that looks a lot like utf-16:
wilmer@peer:/tmp$ iconv -f utf-8 -t utf-16be bla.utf8 | hd 00000000 d8 35 dc e5 d8 35 dc f2 d8 35 dc ff d8 35 dc ee |.5...5...5...5..| 00000010 00 20 d8 35 dc e4 d8 35 dc f7 d8 35 dc f2 d8 35 |. .5...5...5...5| 00000020 dc ec d8 35 dc f8 d8 35 dc ed d8 35 dc ee |...5...5...5..|
That's awful.
comment:8 Changed at 2014-02-08T16:35:54Z by
13:26 < dx> >>> print json.dumps(u'𝓥𝓲𝓿𝓮 𝓤𝓷𝓲𝓬𝓸𝓭𝓮') 13:26 < dx> "\ud835\udce5\ud835\udcf2\ud835\udcff\ud835\udcee \ud835\udce4\ud835\udcf7\ud835\udcf2\ud835\udcec\ud835\udcf8\ud835\udced\ud835\udcee" 13:26 < wilmer> \o/ 13:27 < wilmer> So it's "standard" 13:27 < dx> yup.
comment:9 Changed at 2014-02-08T23:34:40Z by
Some random tests with both python (>>>
lines) and bitlbee's json.c ((gdb)
lines)
>>> print json.dumps(u"𝓥") "\ud835\udce5" >>> u"\ud835\udce5" u'\ud835\udce5'
>>> print u"\ud835\udce5"[0] � >>> print u"\ud835\udce5"[1] � >>> print u"\ud835\udce5" 𝓥 >>> u"\ud835\udce5".encode("utf-8") '\xf0\x9d\x93\xa5' >>> '\xf0\x9d\x93\xa5'.decode('utf-8') u'\U0001d4e5'
>>> print u"\ud835\udce5".encode("utf-8") 𝓥 >>> u"\ud835\udce5".encode("utf-8") '\xf0\x9d\x93\xa5' >>> '\xf0\x9d\x93\xa5'.decode('utf-8') u'\U0001d4e5'
(gdb) p json_parse("\"\\ud835\"").u.string.ptr $7 = 0x604510 "\355\240\265" (gdb) p json_parse("\"\\ud835\\udce5\"").u.string.ptr $8 = 0x604580 "\355\240\265\355\263\245"
>>> "\355\240\265" '\xed\xa0\xb5' >>> u'\ud835'.encode('utf-8') '\xed\xa0\xb5' >>> u'\ud835\udce5'.encode('utf-8') '\xf0\x9d\x93\xa5'
>>> json.loads("\"\\ud835\\udce5\"") u'\U0001d4e5' >>> json.loads("\"\\ud835\"") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/json/__init__.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python2.7/json/decoder.py", line 381, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Unpaired high surrogate: line 1 column 3 (char 2)
Unicode is fun.
comment:10 Changed at 2014-02-13T08:53:38Z by
I've updated json.c to the latest upstream version, which will still have this bug, but at least it'll make it easier to fix this bug and contribute back to upstream.
Will maybe give that a shot later today/this week.
comment:11 Changed at 2014-03-01T00:32:31Z by
That's more like it:
00:26:07 <@root> You: [13] 𝓥𝓲𝓿𝓮 𝓤𝓷𝓲𝓬𝓸𝓭𝓮 nog een keer nope nope nope
Will commit that tomorrow and see whether upstream will take it back.
comment:12 Changed at 2014-03-02T00:40:00Z by
changeset:devel,1014. Success/failure reports are welcome.
comment:13 Changed at 2014-03-02T10:45:40Z by
Resolution: | → fixed |
---|---|
Status: | new → closed |
I'll mark this as resolved, feel free to reopen if you find some sequences that it still doesn't support properly.
Appearance in IRC client