[MTOS-dev] UTF8 Encoding Issues

Reed A. Cartwright reed at scit.us
Mon Dec 17 22:24:34 PST 2007


I would like to propose that since UTF-8 has clearly established itself 
as the web standard that MTOS focusing on becoming clean UTF-8 system, 
where all internal data is stored and manipulated in UTF-8.  If 
necessary, there could be proper input and output filters to do encoding 
translation. Right now, MT seems to be rather hackish in handling its 
text encoding, often making assumptions about when strings have and 
haven't been decoded.

I bring this up because I had a lot of issues using UTF8 strings and 
postgresql with MT4 a while back.  (There are still issues, but I solved 
the major ones.)  Basically, MT did not understand that utf8 strings 
were utf8.  This included retrieving text from the database as well as 
form submitted text.  I ended up solving the errors doing two important 
things.

#1 Updating my MT-Dispatcher to Decode all CGI params except 'file'. 
This resulted in all form data being properly recognized as UTF-8.

#2 Added '$dbh->{pg_enable_utf8} = 1' in 
MT::ObjectDriver::Driver::DBD::Pg.  This resulted in all database data 
being properly recognized as UTF-8.

After I did this, several bugs emerged where MT was trying to decode 
data that was already decoded.  Also I found out that MT was writing the 
static pages in 'text' encoding and not 'utf8' encoding, which was 
causing more issues with serving the pages in the proper encoding.

The patch against 4.01 that I ended up with is at [1].  It fixes what I 
wanted fixed, but not all the bugs that I've seen.  To fix all of them 
would take a change in philosophy, I think.  (And yes, I did file a bug 
report a couple months ago, but these issues seem to be bigger than a 
simple bug.)

[1] http://scit.us/~reed/utf8.patch

-- 
Reed A. Cartwright


More information about the MTOS-dev mailing list