gpdb: 5X_STABLE: sort order for VARCHAR is incorrect on some datasets
Summary
In Greenplum 5X_STABLE, VARCHAR sort order (e.g. produced by ORDER BY) is incorrect on certain datasets.
The error does not reproduce on master branch.
Description
This issue is closely related to https://github.com/greenplum-db/gpdb/issues/6585. There, a particular dataset causes Mergejoin node to fail with ERROR: Mergejoin: compareResult > 0, bad plan ? (nodeMergejoin.c:1172). Working on that bug, I found out its (probable) cause: varstr_cmp() comparison function.
This function is used by both Sort and MergeJoin nodes; despite this, the sort order for tuples to be merged seems to be incorrect and an assertion in nodeMergejoin.c fails.
The following queries illustrate the problem:
DROP TABLE IF EXISTS kazybek;
DROP TABLE
CREATE TABLE kazybek(t VARCHAR) DISTRIBUTED RANDOMLY;
CREATE TABLE
INSERT INTO kazybek(t) VALUES ('им'), ('Им. Казыбек'), ('им Казыбек би'), ('Им Казыбекби');
INSERT 0 4
SELECT t FROM kazybek ORDER BY t ASC;
t
---------------
им
Им. Казыбек
им Казыбек би
Им Казыбекби
(4 rows)
SELECT 'им Казыбек би' <= 'Им Казыбекби';
?column?
----------
f
(1 row)
Note that correct lexicographical sort order should put strings starting with и before ones starting with И or vice versa, but it must not mix them.
Sample from the dataset
The actual dataset consists of 5000 tuples, each with one of 261 town names and numbers associated with them (not listed). Multiple naming variants are allowed (thus, there may be multiple highly similar strings present in one column). The names are:
им
Им. Казыбек
им Казыбек би
Им Казыбекби
Им. Казыбек-би
Им Казыбек-Би
Им. Казыбек Би
Им. Казыбек-Би
Им.Казыбек-Би
Им. Кзыбек-Би
When I changed the comparison function, the sort order changed, too, but it was possible to find incorrectly sorted pair in the given dataset. The bug with median() was not fixed, too.
Proposed solution
I tried to fix the comparison function using different approaches. These changed the sort order in some cases, but did not fix the error; it was still possible to find a combination of strings (from the same dataset) on which the sort order was incorrect. The three key ideas were:
- Cutting string at the point where two different strings are of equal byte length, but taking into consideration UTF-8 variable length encoding (motivation: strings of equal length are always compared correctly);
- Use byte-by-byte strings comparison;
- Convert all strings to wide character representation (
pg_wchar) and compare them usingmb/pg_wchar.h(also try symbol-by-symbol comparison).
Instance installation information
- GPDB:
5.8.0+dev.614.g7ab79ea build dev - OS:
CentOS Linux release 7.6.1810 (Core),Linux mdw 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux - autoconf:
'--disable-orca' '--prefix=/usr/lib/gpdb' - pg_config:
BINDIR = /usr/lib/gpdb/bin
DOCDIR = /usr/lib/gpdb/doc/postgresql
HTMLDIR = /usr/lib/gpdb/doc/postgresql
INCLUDEDIR = /usr/lib/gpdb/include
PKGINCLUDEDIR = /usr/lib/gpdb/include/postgresql
INCLUDEDIR-SERVER = /usr/lib/gpdb/include/postgresql/server
LIBDIR = /usr/lib/gpdb/lib
PKGLIBDIR = /usr/lib/gpdb/lib/postgresql
LOCALEDIR = /usr/lib/gpdb/share/locale
MANDIR = /usr/lib/gpdb/share/man
SHAREDIR = /usr/lib/gpdb/share/postgresql
SYSCONFDIR = /usr/lib/gpdb/etc/postgresql
PGXS = /usr/lib/gpdb/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/lib/gpdb/include
CFLAGS = -O3 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fno-aggressive-loop-optimizations -Wno-unused-but-set-variable -Wno-address -I/usr/lib/gpdb/include
CFLAGS_SL = -fpic
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/lib/gpdb/lib',--enable-new-dtags -L/usr/lib/gpdb/lib
LDFLAGS_SL =
LIBS = -lpgport -lrt -lcrypt -ldl -lm -L/usr/lib/gpdb/lib
VERSION = PostgreSQL 8.3.23
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 17 (10 by maintainers)
Commits related to this issue
- Stop using strxfrm() for string sorting. strxfrm() is buggy on some platforms and locales. We found this out the hard way in the PostgreSQL community, when we tried to use strxfrm() for string sortin... — committed to hlinnaka/gpdb by hlinnaka 5 years ago
- Stop using strxfrm() for string sorting. strxfrm() is buggy on some platforms and locales. We found this out the hard way in the PostgreSQL community, when we tried to use strxfrm() for string sortin... — committed to arenadata/gpdb by hlinnaka 5 years ago
- Stop using strxfrm() for string sorting. strxfrm() is buggy on some platforms and locales. We found this out the hard way in the PostgreSQL community, when we tried to use strxfrm() for string sortin... — committed to arenadata/gpdb by hlinnaka 5 years ago
- Stop using strxfrm() for string sorting. strxfrm() is buggy on some platforms and locales. We found this out the hard way in the PostgreSQL community, when we tried to use strxfrm() for string sortin... — committed to arenadata/gpdb by hlinnaka 4 years ago
- Stop using strxfrm() for string sorting. strxfrm() is buggy on some platforms and locales. We found this out the hard way in the PostgreSQL community, when we tried to use strxfrm() for string sortin... — committed to arenadata/gpdb by hlinnaka 5 years ago
Oh, scratch that, that bug affected the gp_enable_motion_mk_sort=off case, but doesn’t explain the mismatch between <= and ORDER BY ASC, when gp_enable_motion_mk_sort=on.