gh-ost: gh-ost aborts on trying to UPDATE, errors on invalid character
v1.0.21
gh-ost with --test-on-replica failed like this:
FATAL Error 1366: Incorrect string value: '\xE1...' for column 'column_name' at row 1; query=
column_name
is a latin1 VARCHAR and the value coming in the UPDATE statement contained this problematic character:
á
So this may be related to the safe SQL mode settings.
The binlogs showed this:
- first, a row with the relevant
id
is INSERTed into the ghost table. At the moment, the data contains no \xE1 character. - eventually, there is incoming UPDATE for the original table that introduces the \xE1 character. gh-ost refuses to handle it and aborts
About this issue
- Original URL
- State: open
- Created 8 years ago
- Comments: 44 (19 by maintainers)
Commits related to this issue
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
- add failing test for #290, invalid utf8 char this test assumes a latin1-encoded table with content containing bytes in the \x80-\xFF, which are invalid single-byte characters in utf8 and cannot be in... — committed to jbielick/gh-ost by jbielick 3 years ago
- copy and update text using convert when charset changes addresses #290 Note: there is currently no issue backfilling the ghost table when the characterset changes, likely because it's a insert-into-... — committed to jbielick/gh-ost by jbielick 3 years ago
no luck …
To give a little more context, the database is entirely latin1 for historical reasons that no longer apply. The application connects to the DB using a latin1 connection, but receives UTF8 data as user input. The app has no idea the data is in a different character set, and just sees it as a sequence of bytes, so it writes those bytes out to the DB in “latin1”, because MySQL doesn’t check to see if the characters are valid when in latin1. So the HEX() value of this data in latin1 would be identical to if it were written over a UTF8 connection to a UTF8-labeled column.
The main problem is that if you were to connect over a UTF8 connection, MySQL will attempt to convert the “latin1” data. Instead of returning the correct string of bytes, it would send a series of garbage characters that are the result of running them through the UTF8 conversion.
You can see the severity of the problem by looking at the HEX() values before and after running it through CONVERT():
Worse, if you try to directly run an ALTER changing the charset of the column, it would automatically run that CONVERT() step, permanently corrupting the data. The solution is that we need to force MySQL to forget its charset label by temporarily changing the VARCHAR into a VARBINARY. Once that’s done, the app can start connecting via a UTF8 connection and the data will still be in the correct encoding. Then, we can do a second migration to change the VARBINARY back into a VARCHAR, this time with the right label, and we’re good moving forward.
PR for this at openark/gh-ost#27
reproduces with 1.0.28, will try to find time to create a simple testcase