česky | english
Unicode for CAcert
First we have to research:
- the possibilities for UTF-8 in all the standards we have (OpenPGP, X.509, PKIX, ...?)
- interoperability for UTF-8 with existing software (how much breaks, when we deploy them?)
- How do we have to configure OpenSSL to do it?
- OpenSSL reads user data from a file, if this file has utf-8 codepage, it is supposed to work on unix.
- Then we have to examine how PHP and our Email systems do it properly (Encoding Subject: and other headers are quite fun with UTF-8)
- Then we have to examine how MySQL does it properly
- UTF-8 support for our PDF generator
- Then we have to work out how we can migrate the existing MySQL database contents we have to UTF-8
- Then we have to work on the security aspects of UTF-8: UTF-8 exploits (stray \x00 inside a UTF-8 character for example)
- Then we have to work on the homograph-security of UTF-8
- We should implement a similar security mechanism as Konqueror did, to print all UTF-8 characters in bold.
- Then we have to examine the security aspects of Punycode
- If all those things work out well, we can plan the migration
- Then we can do the migration.
- And then we can hope that it worked.
OpenPGP
OpenPGP is rather good in that area, since the OpenPGP standard defines UTF-8 to be the only encoding possible. (Likely a few applications don´t do that properly yet, but at least the standard is clear).
X.509
For X.509, I think there is a UTF8-String string-type, which could be used, but I don´t know much about the compatibility of the applications. I heard that there are a few standards which demand other stringtypes than UTF8String for specific fields, so the standards have to be examinde.
PHP
utf8_decode
Unicode exploits
We have to search for Unicode exploits that happened to other software, verify the Unicode handling routines that are implemented in the software that we are using, to see whether it can be exploited. One potential problem are Beginning-of-Unicode-character Bytes followed by 0x00 Bytes. Another often found problem are Non-Unicode Bytes inside a supposed to be Unicode string, which KDE for example likes to crash on.
TODO:
- perldoc perlunicode
- perldoc charnames
- perldoc utf8
- HTML::Entities
- Encode::Byte
- URI::Escape
Help Needed
If you want to help us with the Unicode Taskforce, please contact us!