Modify

#1121 closed defect (fixed)

Unicode astral characters displayed incorrectly

Reported by: twwinwood@… Owned by:
Priority: normal Milestone:
Component: Twitter Version: 3.2
Keywords: unicode astral character twitter Cc:
IRC client+version: Client-independent Operating System: Linux
OS version/distro:

Description

I follow the FakeUnicode Twitter account, which often uses astral-plane characters such as 💩 and 👽. These are displayed incorrectly when I receive them; please find attached two clippings showing the erroneous output in bitlbee and the correct output on the Twitter web interface.

I have not yet ascertained whether this occurs elsewhere in bitlbee.

Attachments (2)

36.png (5.7 KB) - added by twwinwood@… at 2014-01-31T17:47:35Z.
Appearance in IRC client
45.png (20.4 KB) - added by twwinwood@… at 2014-01-31T17:47:59Z.
Appearance on Twitter web interface

Download all attachments as: .zip

Change History (15)

Changed at 2014-01-31T17:47:35Z by twwinwood@…

Attachment: 36.png added

Appearance in IRC client

Changed at 2014-01-31T17:47:59Z by twwinwood@…

Attachment: 45.png added

Appearance on Twitter web interface

comment:1 Changed at 2014-02-02T12:02:11Z by anonymous

that's most likely fonts in your irc client not being able to display those characters correctly

comment:2 Changed at 2014-02-02T12:06:49Z by wilmer

There *is* a possibility that the JSON decoder is getting these wrong, actually. Could you try connecting to BitlBee using an X11 client instead of a terminal one to see if the characters look better?

Alternatively, maybe you could even just copy-paste the stufffrom your terminal to something with more fonts and see if things improve.

comment:3 Changed at 2014-02-02T15:59:50Z by twwinwood@…

That IS an X11 client - specifically, Quassel. It's definitely not a font issue - I can happily paste and send my own astral characters and receive them from others without issue. It's only bitlbee, and (so far as I know) only IRC which has this problem.

comment:4 Changed at 2014-02-04T04:32:24Z by dx

Added FakeUnicode to my bitlbee twitter, can confirm.

This tweet in particular, which was one of the most recent ones when i tested, shows correctly using the commandline client t

$ t status 430377627916836864 
ID           430377627916836864
Text         RT @Mistermatt007: 𝓥𝓲𝓿𝓮  𝓤𝓷𝓲𝓬𝓸𝓭𝓮  !!
Screen name  @FakeUnicode
Posted at    Feb  3 13:29 (12 hours ago)
Retweets     1
Favorites    0
Source       web

Excluding spaces and other ascii, there are 11 unicode characters, which encode to 44 bytes in UTF-8.

It displays incorrectly with bitlbee

13:29 < FakeUnicode> RT @Mistermatt007: ������������������������  ������������������������������������������  !!

Those are 66 U+FFFD characters, again not counting the spaces. I don't really understand what happened here. Should do some proper debugging later.

comment:5 Changed at 2014-02-04T08:58:14Z by wilmer

I know there have been bug like this before in the JSON library I'm using. It appears that there are some more.

(IIRC JSON has ASCII-escaped Unicde sequences, converting them to UTF-8 takes some effort/magic.)

comment:6 Changed at 2014-02-08T16:19:39Z by wilmer

I think the problem is that these are 32-bit Unicode sequences, which from how I read it briefly, json.c doesn't currently support.

>>> a=u'𝓥𝓲𝓿𝓮  𝓤𝓷𝓲𝓬𝓸𝓭𝓮'
>>> a
u'\U0001d4e5\U0001d4f2\U0001d4ff\U0001d4ee  \U0001d4e4\U0001d4f7\U0001d4f2\U0001d4ec\U0001d4f8\U0001d4ed\U0001d4ee'

json.c expect just two bytes/four nibbles, also it only counts on \u (lower case) escaped sequences. Ah fun, this is what Twitter sends:

"text":"\ud835\udce5\ud835\udcf2\ud835\udcff\ud835\udcee \ud835\udce4\ud835\udcf7\ud835\udcf2\ud835\udcec\ud835\udcf8\ud835\udced\ud835\udcee bla bla bla"

So those are in fact two \u's per character! We'll need to find a spec on that..

comment:7 Changed at 2014-02-08T16:21:36Z by wilmer

In fact that looks a lot like utf-16:

wilmer@peer:/tmp$ iconv -f utf-8 -t utf-16be bla.utf8 | hd
00000000  d8 35 dc e5 d8 35 dc f2  d8 35 dc ff d8 35 dc ee  |.5...5...5...5..|
00000010  00 20 d8 35 dc e4 d8 35  dc f7 d8 35 dc f2 d8 35  |. .5...5...5...5|
00000020  dc ec d8 35 dc f8 d8 35  dc ed d8 35 dc ee        |...5...5...5..|

That's awful.

comment:8 Changed at 2014-02-08T16:35:54Z by dx

13:26 < dx> >>> print json.dumps(u'𝓥𝓲𝓿𝓮  𝓤𝓷𝓲𝓬𝓸𝓭𝓮')
13:26 < dx> "\ud835\udce5\ud835\udcf2\ud835\udcff\ud835\udcee  \ud835\udce4\ud835\udcf7\ud835\udcf2\ud835\udcec\ud835\udcf8\ud835\udced\ud835\udcee"
13:26 < wilmer> \o/
13:27 < wilmer> So it's "standard"
13:27 < dx> yup.

comment:9 Changed at 2014-02-08T23:34:40Z by dx

Some random tests with both python (>>> lines) and bitlbee's json.c ((gdb) lines)

>>> print json.dumps(u"𝓥")
"\ud835\udce5"
>>> u"\ud835\udce5"
u'\ud835\udce5'
>>> print u"\ud835\udce5"[0]
�
>>> print u"\ud835\udce5"[1]
�
>>> print u"\ud835\udce5"
𝓥
>>> u"\ud835\udce5".encode("utf-8")
'\xf0\x9d\x93\xa5'
>>> '\xf0\x9d\x93\xa5'.decode('utf-8')
u'\U0001d4e5'
>>> print u"\ud835\udce5".encode("utf-8")
𝓥
>>> u"\ud835\udce5".encode("utf-8")
'\xf0\x9d\x93\xa5'
>>> '\xf0\x9d\x93\xa5'.decode('utf-8')
u'\U0001d4e5'
(gdb) p json_parse("\"\\ud835\"").u.string.ptr
$7 = 0x604510 "\355\240\265"
(gdb) p json_parse("\"\\ud835\\udce5\"").u.string.ptr
$8 = 0x604580 "\355\240\265\355\263\245"
>>> "\355\240\265"
'\xed\xa0\xb5'
>>> u'\ud835'.encode('utf-8')
'\xed\xa0\xb5'
>>> u'\ud835\udce5'.encode('utf-8')
'\xf0\x9d\x93\xa5'
>>> json.loads("\"\\ud835\\udce5\"")
u'\U0001d4e5'
>>> json.loads("\"\\ud835\"")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 381, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Unpaired high surrogate: line 1 column 3 (char 2)

Unicode is fun.

comment:10 Changed at 2014-02-13T08:53:38Z by wilmer

I've updated json.c to the latest upstream version, which will still have this bug, but at least it'll make it easier to fix this bug and contribute back to upstream.

Will maybe give that a shot later today/this week.

comment:11 Changed at 2014-03-01T00:32:31Z by wilmer

That's more like it:

00:26:07 <@root> You: [13] 𝓥𝓲𝓿𝓮 𝓤𝓷𝓲𝓬𝓸𝓭𝓮 nog een keer nope nope nope

Will commit that tomorrow and see whether upstream will take it back.

comment:12 Changed at 2014-03-02T00:40:00Z by wilmer

changeset:devel,1014. Success/failure reports are welcome.

comment:13 Changed at 2014-03-02T10:45:40Z by wilmer

Resolution: fixed
Status: newclosed

I'll mark this as resolved, feel free to reopen if you find some sequences that it still doesn't support properly.

Modify Ticket

Action
as closed The ticket will remain with no owner.
The resolution will be deleted.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.