gh-ost: gh-ost aborts on trying to UPDATE, errors on invalid character

v1.0.21

gh-ost with --test-on-replica failed like this:

FATAL Error 1366: Incorrect string value: '\xE1...' for column 'column_name' at row 1; query=

column_name is a latin1 VARCHAR and the value coming in the UPDATE statement contained this problematic character:

á

So this may be related to the safe SQL mode settings.

The binlogs showed this:

  • first, a row with the relevant id is INSERTed into the ghost table. At the moment, the data contains no \xE1 character.
  • eventually, there is incoming UPDATE for the original table that introduces the \xE1 character. gh-ost refuses to handle it and aborts

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 44 (19 by maintainers)

Commits related to this issue

Most upvoted comments

no luck …

To give a little more context, the database is entirely latin1 for historical reasons that no longer apply. The application connects to the DB using a latin1 connection, but receives UTF8 data as user input. The app has no idea the data is in a different character set, and just sees it as a sequence of bytes, so it writes those bytes out to the DB in “latin1”, because MySQL doesn’t check to see if the characters are valid when in latin1. So the HEX() value of this data in latin1 would be identical to if it were written over a UTF8 connection to a UTF8-labeled column.

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select mycol from mytable where id=1\G
*************************** 1. row ***************************
mycol: สำหรับ "บางคน" เกิดมาเพื่อให้เรารัก
1 row in set (0.00 sec)

The main problem is that if you were to connect over a UTF8 connection, MySQL will attempt to convert the “latin1” data. Instead of returning the correct string of bytes, it would send a series of garbage characters that are the result of running them through the UTF8 conversion.

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select mycol, from mytable where id=1\G
*************************** 1. row ***************************
mycol: สำหรับ "บางคน" เà¸à¸´à¸”มาเพื่อให้เรารัà¸
1 row in set (0.00 sec)

You can see the severity of the problem by looking at the HEX() values before and after running it through CONVERT():

mysql> select hex(mycol), hex(convert(mycol using utf8)) from mytable where id=1\G
*************************** 1. row ***************************
                    hex(mycol): E0B8AAE0B8B3E0B8ABE0B8A3E0B8B1E0B89A2022E0B89AE0B8B2E0B887E0B884E0B8992220E0B980E0B881E0B8B4E0B894E0B8A1E0B8B2E0B980E0B89EE0B8B7E0B988E0B8ADE0B983E0B8ABE0B989E0B980E0B8A3E0B8B2E0B8A3E0B8B1E0B881
hex(convert(mycol using utf8)): C3A0C2B8C2AAC3A0C2B8C2B3C3A0C2B8C2ABC3A0C2B8C2A3C3A0C2B8C2B1C3A0C2B8C5A12022C3A0C2B8C5A1C3A0C2B8C2B2C3A0C2B8E280A1C3A0C2B8E2809EC3A0C2B8E284A22220C3A0C2B9E282ACC3A0C2B8C281C3A0C2B8C2B4C3A0C2B8E2809DC3A0C2B8C2A1C3A0C2B8C2B2C3A0C2B9E282ACC3A0C2B8C5BEC3A0C2B8C2B7C3A0C2B9CB86C3A0C2B8C2ADC3A0C2B9C692C3A0C2B8C2ABC3A0C2B9E280B0C3A0C2B9E282ACC3A0C2B8C2A3C3A0C2B8C2B2C3A0C2B8C2A3C3A0C2B8C2B1C3A0C2B8C281
1 row in set (0.00 sec)

Worse, if you try to directly run an ALTER changing the charset of the column, it would automatically run that CONVERT() step, permanently corrupting the data. The solution is that we need to force MySQL to forget its charset label by temporarily changing the VARCHAR into a VARBINARY. Once that’s done, the app can start connecting via a UTF8 connection and the data will still be in the correct encoding. Then, we can do a second migration to change the VARBINARY back into a VARCHAR, this time with the right label, and we’re good moving forward.

PR for this at openark/gh-ost#27

  • we do SET NAMES utf8mb4 upon connection
  • the table structures are identical, the tables have DEFAULT CHARSET=latin1, the column is defined as simply ‘text’

reproduces with 1.0.28, will try to find time to create a simple testcase