efcore: Incorrect string matching behavior on SQL Server with whitespace patterns

I’m getting wrong results from database when trying to filter string (Contains, StartsWith, EndsWith, etc) with spaces.

Steps to reproduce

I’m build query like:

var test = "  ";
var venues = await context.Venues.Where(x => x.Name.Contains(test)).ToListAsync(cancellationToken);

Which gets translated to:

SELECT [v].[Id], [v].[Name]
FROM [Venues] AS [v]
WHERE (@__test_0 = N'') OR CHARINDEX(@__test_0, [v].[Name]) > 0

This query returns all rows from database when @__test_0 contains spaces.

Further technical details

EF Core version: 3.1.0 Database provider: Microsoft.EntityFrameworkCore.SqlServer Target framework: .NET Core 3.1

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 24 (20 by maintainers)

Commits related to this issue

Most upvoted comments

@zachrybaker my message above may have been a bit inaccurate - in some cases, the translations of Contains, StartsWith and EndsWith will work correctly with regards to trailing whitespace (i.e. when patterns are constant we will generate LIKE, which doesn’t ignore trailing whitespace). Once we implement client-side parameter transformations (which we do intend to do), the same technique can be used for parameter patterns as well.

Re other workaround with len… First, your stack overflow link seems broken. Second, that elaborate workaround still returns a false positive when the value is shorter than the pattern:

SELECT CASE WHEN LEFT('h', LEN(CAST('h ' AS nvarchar(MAX)) + 'x') - 1) = 'h ' THEN 1 ELSE 0 END; -- 1

It’s possible to add a condition for that too, but at some point that kind of complexity does not seem appropriate for translating a very common .NET method like StartsWith (and properly has some perf overhead, with all the casting, concatenation, etc. etc). In any case, as you say, it’s always possible for people to express this themselves with EF.Functions.Like.

The larger picture here is that it’s impossible to handle this well for much more basic cases without a significant perf impact - I’m thinking about simple equality. When users compare two VARCHAR columns, trailing SQL will be truncated. Any attempt to compensate for that with elaborate logic such as the above would prevent index usage - which is the last thing you want to do for the default text equality operator. Since it’s not possible to fix this for basic equality, I don’t think it makes sense to go so far for StartsWith either.

Since a = b gives incorrect results when either of them have trailing whitespace, we should do it for any case we can improve. That means, if column comparison is also hitting bug then we convert that too.

Equality can be generated outside of SqlTranslator.VisitBinary too. So SqlTranslator wouldn’t work. Option would be either SqlExpressionFactory.Equal generates different for string or post-processor would take care of rewriting. PostProcessor would be slightly more preferable as it avoids .Equal not returning SqlBinary.

Summary

On SQL Server, LEN(col) = 0 suffers from the same issue as col = '': it identifies spaces-only strings a having length 0 (incidentally it also does not utilize indexes).

However, col LIKE '' may be a good solution: it only identifies truly empty strings and uses an index. At least on PostgreSQL (where this whole trailing spaces madness doesn’t happen), LIKE also uses an index.

We should ideally verify also on Sqlite and MySQL, though I’d have to spend more time to understand how to analyze queries there. We can postpone dealing with this to see whether we end up doing #17598 for 5.0 (which I’d really like to).


SQL Server

Behavior around trailing spaces

CREATE TABLE text_tests (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
INSERT INTO text_tests (name) VALUES ('');
INSERT INTO text_tests (name) VALUES ('  ');
SELECT COUNT(*) FROM text_tests WHERE name=''; -- Returns both
SELECT LEN(name) FROM text_tests; -- Returns 0 for both
SELECT COUNT(*) FROM text_tests WHERE name LIKE '';  -- Returns only the 1st
SELECT '|' + name + '|' FROM text_tests;  -- Returns || for the first, |  | for the second

Index usage

-- Create schema and insert data
CREATE TABLE data (id INT IDENTITY(1,1) PRIMARY KEY, name NVARCHAR(10));
CREATE INDEX IX_name ON data(name);

BEGIN TRANSACTION;

DECLARE @i INT = 0;
WHILE @i < 50000
BEGIN
    INSERT INTO data (name) VALUES (CAST(@i AS NVARCHAR(MAX)));
    SET @i = @i + 1;
END;

INSERT INTO data (name) VALUES ('');
INSERT INTO data (name) VALUES ('  ');

DECLARE @j INT = @i;
WHILE @j < 100000
BEGIN
    INSERT INTO data (name) VALUES (CAST(@j AS NVARCHAR(MAX)));
    SET @j = @j + 1;
END;

COMMIT;

-- Queries
SET SHOWPLAN_ALL ON;
SELECT id FROM data WHERE name=''; -- Does an index seek with IX_name, TotalSubtreeCost=0.0032842
SELECT id FROM data WHERE LEN(name) = 0; -- Scans the entire table, TotalSubtreeCost=0.48661754
SELECT id FROM data WHERE name LIKE ''; -- Does an index scan with IX_name, TotalSubtreeCost=0.0032842
SET SHOWPLAN_ALL OFF;

PostgreSQL

Behavior around trailing spaces

DROP TABLE IF EXISTS text_types;
CREATE TABLE text_types (
    id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    text TEXT,
    varchar VARCHAR(10),
    char VARCHAR(10));
INSERT INTO text_types (text, varchar, char) VALUES ('', '', '');
INSERT INTO text_types (text, varchar, char) VALUES ('  ', '  ', '  ');

SELECT * FROM text_types WHERE text = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE varchar = ''; -- Returns 1st row only
SELECT * FROM text_types WHERE char = ''; -- Returns 1st row only

SELECT LENGTH(text), LENGTH(varchar), LENGTH(char) FROM text_types; -- 1st row all zeros, 2nd row all 2s 
SELECT '|' || text || '|' || varchar || '|' || char || '|' FROM text_types; -- 1st row no spaces, 2nd row 2 spaces everywhere

Index usage

-- Create schema and insert data
DROP TABLE IF EXISTS data;
CREATE TABLE data (id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY , name TEXT);
CREATE INDEX IX_name ON data(name);

DO $$BEGIN
  FOR i IN 1..50000 LOOP
    INSERT INTO data (name) VALUES (i::TEXT);     
  END LOOP;
  INSERT INTO data (name) VALUES ('');
  INSERT INTO data (name) VALUES ('  ');
  FOR i IN 50001..100000 LOOP
    INSERT INTO data (name) VALUES (i::TEXT);
  END LOOP;
END$$;

-- Queries - same index usage results as SQL Server
EXPLAIN SELECT * FROM data WHERE name='';
EXPLAIN SELECT * FROM data WHERE LENGTH(name) = 0;
EXPLAIN SELECT * FROM data WHERE name LIKE '';

Sqlite

Behavior around trailing spaces

CREATE TABLE text_tests (id INT, name TEXT);
INSERT INTO text_tests (id, name) VALUES (1, '');
INSERT INTO text_tests (id, name) VALUES (2, '  ');
SELECT id FROM text_tests WHERE name=''; -- Returns only id=1
SELECT LENGTH(name) FROM text_tests;  -- Returns 0 and 2
SELECT id FROM text_tests WHERE name LIKE ''; -- Returns only id=1
SELECT '|' || name || '|' FROM text_tests;  -- Returns || for the first, |  | for the second

Index usage

CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT);
CREATE INDEX IX_name ON data(name);
-- Insert 1..50000
INSERT INTO data
SELECT x, x FROM (
  WITH RECURSIVE
      cnt(x) AS (
          SELECT 1
          UNION ALL
          SELECT x+1 FROM cnt
          LIMIT 50000
      )
  SELECT x FROM cnt
);
INSERT INTO data (id, name) VALUES (200000, '');
INSERT INTO data (id, name) VALUES (200001, '  ');
-- Insert 50001..100000
INSERT INTO data
SELECT x, x FROM (
                     WITH RECURSIVE
                         cnt(x) AS (
                             SELECT 50001
                             UNION ALL
                             SELECT x+1 FROM cnt
                             LIMIT 50000
                         )
                     SELECT x FROM cnt
                 );

SELECT COUNT(*) FROM data;    
    
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name=''; -- SEARCH TABLE data USING COVERING INDEX IX_name (name=?)
EXPLAIN QUERY PLAN SELECT * FROM data WHERE LENGTH(name) = 0; -- SCAN TABLE data
EXPLAIN QUERY PLAN SELECT * FROM data WHERE name LIKE ''; -- SCAN TABLE data

Thanks for explaining, that makes sense. Right, I guess only parameter expressions can feasibly be escaped.

The current LEN/LEFT based translation is not SARGable. This means that an index cannot be used to satisfy a StartsWith predicate. I am currently facing this as a performance issue in a LINQ to SQL migration. L2S translated StartsWith to LIKE with escaping for wildcards. This has been proposed above as well.

SQL Server is capable of seeking on an index with a parameterized LIKE pattern. The pattern does not need to be constant.

This is implemented by SQL Server inserting an internal function into the query plan. This function takes the (dynamic) LIKE pattern and computes at runtime upper and lower seek bounds. This turns out to be efficient for commonly used patterns.

I’d be curious if LIKE or CHARINDEX are about equal in performance for Contains searches. If yes, then LIKE could be used as a common workhorse for all string searching operations.

I believe that LIKE has better behavior with respect to whitespace as well.

For a simple equality comparison (entity.Column == parameter) this should be translated to a normal = on the SQL side.

Design decisions:

  • Switch Contains to translate to LIKE instead of CHARINDEX on constants, like we do with StartsWith/EndsWith (with client-side escaping). This will fix both the empty string behavior and allow it to be used with XML columns.
  • For parameter patterns with Contains/StartsWith/EndsWith, at some point we’ll implement client-side parameter transformations (#17598), which would allow us to switch to LIKE for those as well (with client-side escaping). However, as this is a non-trivial query pipeline improvement, it will not be done immediately.
  • However, we’ll fix Contains empty string compensation checks to use LIKE (relevant for both parameter and column patterns).
  • We will not attempt to compensate for the general SQL Server whitespace truncation logic. In other words, if the user compares two strings, trailing whitespace differences will be disregarded by SQL Server. We don’t think there’s anything reasonable we can do about that without compromising performance significantly.

Set up a quick EF6 test project for Contains (with .NET Core on Linux!), results are below.

tl;dr EF6 seems to be doing the right thing, always using LIKE for Contains, escaping wildcards for both constants and parameters. For columns it falls back to CHARINDEX, but does not compensate for empty string (so it has a false negative bug).

For EF Core, here’s what I think we should do:

  • Switch to LIKE for constant patterns, escaping client-side (like we already do for Sqlite/Npgsql StartsWith/EndsWith)
  • Keep CHARINDEX for parameter/column patterns, but change empty-string compensation to use LIKE
  • Consider implementing client-side parameter transformation like EF6, at which point parameter patterns can also switch to LIKE

Constant pattern:

_ = ctx.Blogs.Where(b => b.Name.Contains("foo")).ToList(); //  WHERE [Extent1].[Name] LIKE N'%foo%'
_ = ctx.Blogs.Where(b => b.Name.Contains("fo%o")).ToList(); // WHERE [Extent1].[Name] LIKE N'%fo~%o%' ESCAPE N'~'

Parameter pattern:

var pattern1 = "foo";
_ = ctx.Blogs.Where(b => b.Name.Contains(pattern1)).ToList();

Translates to:

WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
-- p__linq__0: '%foo%' (Type = String, Size = 4000)

Parameterized pattern with wildcards:

var pattern = "fo%o";
_ = ctx.Blogs.Where(b => b.Name.Contains(pattern)).ToList();

Translates to:

WHERE [Extent1].[Name] LIKE @p__linq__0 ESCAPE N'~'
-- p__linq__0: '%fo~%o%' (Type = String, Size = 4000)

Column pattern:

_ = ctx.Blogs.Where(b => "foo".Contains(b.Name)).ToList();

Translates to:

WHERE ( CAST(CHARINDEX([Extent1].[Name], N'foo') AS int)) > 0

Post-processor indeed sounds better to me (but should be before relational command cache etc.). Will prepare this.

Note for team: I was able to reproduce this with 3.1. More minimal repro and output:

public class Venue
{
    public Guid Id { get; set; }
    public string Name { get; set; }
}

public class BloggingContext : DbContext
{
    private readonly ILoggerFactory Logger 
        = LoggerFactory.Create(c => c.AddConsole());//.SetMinimumLevel(LogLevel.Debug));

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        optionsBuilder
            .UseLoggerFactory(Logger)
            .EnableSensitiveDataLogging()
            .UseSqlServer(@"Server=(localdb)\mssqllocaldb;Database=Test;ConnectRetryCount=0");
    }
    
    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.Entity<Venue>();    
    }
}

public class Program
{
    public static async Task Main()
    {
        using (var context = new BloggingContext())
        {
            context.Database.EnsureDeleted();
            context.Database.EnsureCreated();
            
            context.Add(new Venue { Name = "Test" });
            
            await context.SaveChangesAsync();
        }

        using (var context = new BloggingContext())
        {
            var test = "  ";
            var venues = await context.Set<Venue>().Where(x => x.Name.Contains(test)).ToListAsync();
            Console.WriteLine($"Matched {venues.Count}");
        }
    }
}
info: Microsoft.EntityFrameworkCore.Database.Command[20101]
      Executed DbCommand (4ms) [Parameters=[@__test_0='  ' (Size = 4000)], CommandType='Text', CommandTimeout='30']
      SELECT [v].[Id], [v].[Name]
      FROM [Venue] AS [v]
      WHERE (@__test_0 = N'') OR (CHARINDEX(@__test_0, [v].[Name]) > 0)
Matched 1