RepoDB: Bug: Identity Values not set correctly with Bulk Insert from Sql Server due to Incorrect Sorting of data.

Bug Description

There is a critical issue with the existing implementation of Bulk Insert for Sql Server whereby the Identity values are not correctly set on the models passed into RepoDb. In the existing implementation the models are passed in and RepoDb mutates the models to set their Identity values that are generated by Sql Server and returned via the Output clause of the Merge queries. This has been specifically observed in SqlAzure.

However, the processing order of SqlBulkCopy API and/or the Output clause is not guaranteed and the order of items returned may be different than the order of the original entities specified, which results in mismatched Identity column values being set on the entities specified.

And, mutating the list to result in a different sort order than that which is specified would be really bad for consuming client code and pose many other problems. So the intuitive process and result would be for the original entity enumerable set to remain in the exact order specified (as it does now), but with 100% guarantee that the Idenity values will be correctly set on all entities.

This can be due to multiple possible issues such as:

  • SqlBulkCopy.WriteToServer() method may affect the order where the order of data being written is not guaranteed, and there is still no support for the ORDER() hint. In my testing this has resulted in the data being written to the temp Sql Server table in inverted order.
    • This appears to often invert the order, but ultimately cannot be relied upon, and data needs to be sortable by a known factor that matches the entity ordinal sort order as specified.
  • Sql Merge query Output clause may not write the results in the exact same order of the source data due to Sql Server Merge query optimizations and indexes on the source/target tables.
    • This can be quite complex to determine if it will impact due to many underlying reasons, but ultimately cannot be relied upon without an explicit Order By clause.

This is most easily observed when there are various Indexes set on the target tables for which Bulk Insert is writing data into.

Note: Even though a clustered is being added to the temp table after data is inserted, but before the merge query is executed, this clustered index provides a sorted index based on qualifiers, and may not match the order enumerable list of entities specified to the original BulkMerge() method.

Recommended Solution: Regardless of the potential cause of different ordering, the solution to guarantee the order of data from Sql Server is to always specify and Order By clause of the results being retrieved. An example of retrieving the output data in the correct order so that is always 100% of the time matches the original enumerable entity list order is here in the SqlBulkHelpers library: https://github.com/cajuncoding/SqlBulkHelpers/blob/7770338ad94aa933b57637662be5c2949b898c5a/NetStandard.SqlBulkHelpers/SqlBulkHelpers/QueryProcessing/SqlBulkHelpersMergeQueryBuilder.cs#L91

The easiest way to guarantee that Identity values are ordered correctly is to provide a unique key in the Temp Table that is the ordinal position (INT) of each enumerable entity added into the temp table record at execution time. So each item in the enumerable list would be added into the the temp table with an additional column that denotes it’s ordinal position from the enumerable list.

This is done here in a library that’s dedicated to Sql Server bulk inserting for identities – and adding the ordinal position as the data is converted into a DataTable (slightly less efficient, but provides expected behavior) – also sample from the SqlBulkHelpers library:

https://github.com/cajuncoding/SqlBulkHelpers/blob/7770338ad94aa933b57637662be5c2949b898c5a/NetStandard.SqlBulkHelpers/SqlBulkHelpers/QueryProcessing/SqlBulkHelpersObjectMapper.cs#L67

Schema and Model:

Please share to us the schema of the table (not actual) that could help us replicate the issue if necessary.

/****** Object:  Table [dbo].[SqlBulkHelpersTestElements]    Script Date: 1/3/2021 2:28:17 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[SqlBulkTestEntities](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[Key] [nvarchar](max) NULL,
	[Value] [nvarchar](max) NULL,
        CONSTRAINT [PK_SqlBulkTestEntities] PRIMARY KEY CLUSTERED ( [Id] ASC ) 
                WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX IDX_Id_Includes_Value
	ON [dbo].[SqlBulkTestEntities] ([Id])
	INCLUDE (Value);
GO

And also the model that corresponds the schema.

    [Map("SqlBulkTestEntities")]
    public class SqlBulkTestEntity
    {
        public int Id { get; set; }
        public string Key { get; set; }
        public string Value { get; set; }

        public override string ToString()
        {
            return $"Id=[{Id}], Key=[{Key}]";
        }
    }

Library Version: RepoDb v1.12.4 and RepoDb.SqlServer v1.1.1

But likely affects the later RepoDb v1.12.5 also based on code review…

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 15 (10 by maintainers)

Commits related to this issue

Most upvoted comments

@mikependon thanks, but no need to get us anything early, we’re working on other priorities at the moment and will be retesting this and migrating to it later once it’s released. Thanks!

Thanks! Still have to fix the one on Batch operation, both the InsertAl and MergeAll.

@mikependon yeah, at quick glance this looks great 🙌. . . should enable my team to use RepoDb as the sole library for this work going forward 😃

@cajuncoding - thanks for your response, then, I will do add a logic to get the index position of the entity based on the __RepoDb_OrderColumn column.

Also, just FYI, the one who generates the value of the __RepoDb_OrderColumn is the client itself, not the database. You can see it here, the GetValue(int i) is using the GetValueForEntities(int i) and returns the position if that is an index of the __RepoDb_OrderColumn. This position is equal to the index of the entity from the List<T>/IEnumerable<T>.

I again pre-assume that this additional logic is not needed anymore, as when we do the INSERT INTO, we also do the OUTPUT.<EntityId> and we do the ORDER BY on the __RepoDb_OrderColumn.

However, I will not spend time investigating that if you already had proven that historically. It will save me time fixing this problem instead of sleuthing around. I will just implement this additional logic.

@mikependon yes, I had similarly made the same presumption in my SqlBulkHelpers library which was written specifically for Bulk Insert with elegant support for returning Identity values from Sql Server – before RepoDb was available 😃.

And my team pretty quickly identified the ordering problem in testing it’s use in real world projects… so I had to enhance the library to then support ordinal ordering and enforcing ‘Order By’ when retrieving the results of the Merge.

This was all before RepoDb was published 😃, and we have since started using RepoDb on recent new development – instead of Dapper – with great hopes to have a unified library that also supported Bulk Insert/Updates, in addition to other great functionality such as supporting dynamic qualifiers to be specified, expression parsing, built in Model Mapping, etc. – which I had not implemented.

Unfortunately my team encountered similar (not identical) ordering issues with RepoDb, and I’m just now getting around to reporting the issue. . . therefore to keep the team going I implemented support for qualifiers to be specified as well as a couple other enhancements to SqlBulkHelpers so our project can move forward.

But, it will be great if we can get RepoDb reliability fixed for this issue . . . I’m also concerned for anyone else that may not realize they have a risk of corrupted Identity value data if they aren’t aware of this. . . . So some public notification on the Website, chat, etc. may be useful.

Thank you for reporting this. Historically we had analyzed such possibilities (exactly same possibilities) to both the batch and bulk operations, but we had not prove it happens. We had pre-assumed that it will never happened, but unfortunately, it does on your end (for the first time).

We will issue a prioritize fix for this and also to the Batch operations (InsertAll and MergeAll).

Just FYI, Entity Framework Core does the same batch operation, but I personally investigated it already and they are already doing the suggestions you mentioned above in relation to Ordering.