tidb: analyze table failed for table with charset latin1

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

mysql> create table t (v1 varchar(30)) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin;
Query OK, 0 rows affected (0.09 sec)
$ python2.7
>>> f = open("1.sql", "w")
>>> f.write('INSERT INTO `t` VALUES ("\xe4NKNO\xe6");\n')
>>> f.flush()
$ mysql -h 172.16.4.18 -uroot -P4000 -D t < 1.sql
mysql> select * from t;
+--------+
| v1     |
+--------+
| �NKNO�   |
+--------+
1 row in set (0.00 sec)
mysql > 
mysql > analyze table t;
ERROR 1105 (HY000): other error: encoding failed

If change the table schema without charset and collate, then we will fail at insert phase with:

mysql> create table t (v1 varchar(30));
Query OK, 0 rows affected (0.09 sec)

[root@172.16.4.92 ontime2]# mysql -h 172.16.4.18 -uroot -P4000 -D t < 1.sql
ERROR 1366 (HY000) at line 1: incorrect utf8 value e44e4b4e4fe6(�NKNO�) for column v1

This issue is originally found by @nullnotnil when running tidb-lightning, releate issue https://github.com/pingcap/tidb-lightning/issues/351, And when I try to reproduce that issue, I found it should be a tidb issue.

2. What did you expect to see? (Required)

analyze table t should success

3. What did you see instead (Required)

ERROR 1105 (HY000): other error: encoding failed

4. Affected version (Required)

$ ./tikv-server -V
TiKV 
Release Version:   4.1.0-alpha
Edition:           Community
Git Commit Hash:   8b1fc4fc67f6d74a46a86d731eb5c152cbf0dfa8
Git Commit Branch: master
UTC Build Time:    2020-07-14 01:06:28
Rust Version:      rustc 1.46.0-nightly (16957bd4d 2020-06-30)
Enable Features:   jemalloc portable sse protobuf-codec
Profile:           dist_release

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-771-gca41972fb
Edition: Community
Git Commit Hash: ca41972fbac068c8a5de107d9075f09ac68842ac
Git Branch: master
UTC Build Time: 2020-07-14 02:41:21
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

5. Root Cause Analysis

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (16 by maintainers)

Most upvoted comments

I add some trace to the Analyze code and found that the TiDB pushes down some wrong information about the collation of the string. It should be latin1 instead, the TiKV receives a Utf8Mb4BinNoPadding:

[src/coprocessor/statistics/analyze.rs:302] columns_slice[i].encode(*logical_row, &columns_info[i],
                        &mut EvalContext::default(), &mut val) = Ok(
    (),
)
[src/coprocessor/statistics/analyze.rs:310] columns_info[i].as_accessor().collation() = Ok(
    Utf8Mb4BinNoPadding,
)
[src/coprocessor/statistics/analyze.rs:313] table::decode_col_value(&mut mut_val, &mut EvalContext::default(),
                        &columns_info[i]) = Ok(
    Bytes("\344NKNO\346"),
)
[src/coprocessor/statistics/analyze.rs:319] CollatorUtf8Mb4BinNoPadding::sort_key(&decoded_val.as_string()?.unwrap().into_owned()) = Err(
    Encoding(
        Utf8Error {
            valid_up_to: 0,
            error_len: Some(
                1,
            ),
        },
    ),
)

@wjhuang2016 PTAL uses python2 to generate the SQL and it can reproduce in master branch.

Can we add integration tests to verify the fix (and prevent future mistakes)?

I not sure if it is necessary. In fact, we shouldn’t allow writing non-UTF8 bytes to TiKV.

Seems this is a issue in tikv-side.

I deploy a cluster with v4.0.2 by ansible. This issue cannot be reproduced. tikv reversion is:

Release Version:   4.0.2
Edition:           Community
Git Commit Hash:   98ee08c587ab47d9573628aba6da741433d8855c
Git Commit Branch: heads/refs/tags/v4.0.2
UTC Build Time:    2020-07-01 09:34:18
Rust Version:      rustc 1.42.0-nightly (0de96d37f 2019-12-19)
Enable Features:   jemalloc portable sse protobuf-codec
Profile:           dist_release

After I manually update tikv version to the latest commit in master, this issue appears. TiKV version:

Release Version:   4.1.0-alpha
Edition:           Community
Git Commit Hash:   d1c0be1e7bae51735e6de4683a156374dfb917ee
Git Commit Branch: master
UTC Build Time:    2020-07-20 04:42:06
Rust Version:      rustc 1.46.0-nightly (16957bd4d 2020-06-30)
Enable Features:   jemalloc portable sse protobuf-codec
Profile:           dist_release

BTW, use sql INSERT INTO t VALUES (UNHEX('C3A44E4B4E4FC3A6')); can not reproduce either. By now, the only way to reproduce this issue is to generate a sql filed and use mysql < xxx.sql

I can reproduce on a fresh server with a shell script. But there seems to be some sort of flakyness to it. If I mysqldump and restore and then analyze that, it doesn’t reproduce it, and this script also stops reproducing it.

#!/bin/bash

mysql test -e "DROP TABLE IF EXISTS t1;"
mysql test -e "CREATE TABLE t1 (c varchar(30)) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin;"

for i in `seq 0 255`; do mysql test -e "INSERT INTO t1 VALUES (unhex(hex($i)))"; done

mysql test -e "ANALYZE TABLE t1"