pyodbc: Inserting decimal datatypes using fastexecutemany=True causes memory issues

Environment

  • Python: 3.6.9
  • pyodbc: 4.0.32
  • OS: ibuntu linux
  • DB: Azure SQL DB
  • driver: DRIVER={ODBC Driver 17 for SQL Server}

I am attempting to load a very wide table that consists of 99 columns that include 44 decimal(28,9), 34 uniqueidentifiers, 12 ints, 4 bits, 5 dates, etc. See schema below.

My python application is running on a VM with 8 gb of memory with no load other than my application. My application reads from a very large parque file in a streaming fashion and sends batches to Azure SQL via pyodbc’s executemany method.

I am using the fastexecutemany = True, Autocommit = false. I am currently testing with loading in batches of 20,000 rows.

I expect to be able to call cursor.executemany, passing a list of 20,000 tuples in a loop and be able to successfully load nearly 10 million rows with my application memory not getting consumed in a linear fashion. What I am observing is if I include decimals in my list that I send to executemany(), then it appears that memory is not getting released from executemany() such that with each iteration I can observe the memory usage consistently growing.

...
for record_batch in reader.iter_batches(batch_size=20,000)
        cursor.fast_executemany = True
        values = list(zip(*map(lambda col: col.to_pylist(), record_batch.columns)))
       cursor.executemany(insert_stmt, values)
       cursor.commit()
...

Immediately out of the gate, my server’s memory went quickly (linearly) straight down until it consumed all of the memory (8GB). So, I began investigating, including removing columns from the list until I could observe some pattern. Here are some things that I notice with memory:

  • If I take all of the decimals out of the list, leaving the other 55 columns, then the entire batch runs successfully with memory staying below 1/2GB!

  • If I add 4 decimals to the list of 55, then by the 192th iteration of the loop, memory consumption is already hitting 3GB. I stopped it there and didn’t let it continue to the 500th loop because it didn’t seem worth it.

  • If I JUST load Id (uniqueidentifier) and a single decimal column, then memory grows to the point that after the 10 million rows have been inserted, 1.7 GB of memory has been consumed in the 12 minutes. Given that I am using a for-loop to load in batches, it should allow for the values to be released as they go out of scope so why does the memory usage grow consistently over the entire time?

    • As a final decimal test, I did a batch of Id and 9 decimals. By the 223th loop I had to kill the process because it had already consumed 5.9GB!
  • I remove just the cursor.executemany from the loop and memory is not an issue so I am positive it is isolated to the cursor.

  • If I turn off fastexecutemany, then I don’t have any memory issues but I have speed issues so I definitely need to make it work with fastexecutemany.

  • If I pass the zip generator object to cursor_executemany rather than the values list, then the memory issue goes away but it runs slow like its no longer in fastexecute mode, which is what I suspect is happening.

It seems that decimals are key to the memory issue that I am seeing. Why does including a decimal consume so much memory without it getting released?

The simplest way to reproduce this is to do the third test above with a list of 20,000 tuples of (string, decimal) where the string is a GUID. Observe how memory grows in a linear fashion.

I appreciate any assistance you can provide! Thank you for all you do!

Here is my target table:

Target Table: [Id] [uniqueidentifier] NOT NULL, [SourceCompany] [uniqueidentifier] NULL, [Company] [uniqueidentifier] NULL, [Account] [uniqueidentifier] NULL, [Organization] [uniqueidentifier] NULL, [BudgetCategory] [uniqueidentifier] NULL, [LaborExpense] [int] NULL, [Project] [uniqueidentifier] NULL, [Task] [uniqueidentifier] NULL, [PostingDate] [uniqueidentifier] NULL, [PostingDateDate] [date] NULL, [PostingPeriod] [uniqueidentifier] NULL, [Year] [int] NULL, [Period] [int] NULL, [YearPeriod] [int] NULL, [InputDocumentType] [int] NULL, [InputDocumentSource] [int] NULL, [InputDocumentSourceId] [uniqueidentifier] NULL, [InputDocumentSuffix] [uniqueidentifier] NULL, [DocumentNbr] nvarchar NULL, [DocumentDate] [date] NULL, [TransactionDate] [date] NULL, [DocumentId] [uniqueidentifier] NULL, [LineItemType] [int] NULL, [LineItemId] [uniqueidentifier] NULL, [ProcessTime] datetime2 NULL, [EvcType] [int] NULL, [EmployeeVendorClient] [uniqueidentifier] NULL, [Employee] [uniqueidentifier] NULL, [Vendor] [uniqueidentifier] NULL, [Client] [uniqueidentifier] NULL, [Equipment] [uniqueidentifier] NULL, [Unit] [uniqueidentifier] NULL, [ExpenseReportCategory] [uniqueidentifier] NULL, [DocumentCurrency] [uniqueidentifier] NULL, [TransactionCurrency] [uniqueidentifier] NULL, [CostDC] [decimal](28, 9) NULL, [CostTC] [decimal](28, 9) NULL, [CostCC] [decimal](28, 9) NULL, [CostPC] [decimal](28, 9) NULL, [CostBC] [decimal](28, 9) NULL, [EffectiveBillCostPC] [decimal](28, 9) NULL, [EffectiveBillCostCC] [decimal](28, 9) NULL, [EffectiveBillCostBC] [decimal](28, 9) NULL, [EffortPC] [decimal](28, 9) NULL, [EffortCC] [decimal](28, 9) NULL, [EffortBC] [decimal](28, 9) NULL, [WriteOffCostDC] [decimal](28, 9) NULL, [WriteOffCostTC] [decimal](28, 9) NULL, [WriteOffCostCC] [decimal](28, 9) NULL, [WriteOffCostPC] [decimal](28, 9) NULL, [WriteOffCostBC] [decimal](28, 9) NULL, [WriteOffEffortCC] [decimal](28, 9) NULL, [WriteOffEffortPC] [decimal](28, 9) NULL, [WriteOffEffortBC] [decimal](28, 9) NULL, [WriteOffQty] [decimal](28, 9) NULL, [CostBasis] [int] NULL, [EffortBasis] [int] NULL, [Quantity] [decimal](28, 9) NULL, [CostRateOverride] [bit] NULL, [EffectiveCostRateDC] [decimal](28, 9) NULL, [EffectiveCostRateTC] [decimal](28, 9) NULL, [EffectiveCostRatePC] [decimal](28, 9) NULL, [EffectiveCostRateCC] [decimal](28, 9) NULL, [EffectiveCostRateBC] [decimal](28, 9) NULL, [EffectiveBillCostRatePC] [decimal](28, 9) NULL, [EffectiveBillCostRateCC] [decimal](28, 9) NULL, [EffectiveBillCostRateBC] [decimal](28, 9) NULL, [RegularCostRateDC] [decimal](28, 9) NULL, [RegularCostRateTC] [decimal](28, 9) NULL, [RegularCostRatePC] [decimal](28, 9) NULL, [RegularCostRateCC] [decimal](28, 9) NULL, [RegularCostRateBC] [decimal](28, 9) NULL, [RegularBillCostRatePC] [decimal](28, 9) NULL, [RegularBillCostRateCC] [decimal](28, 9) NULL, [RegularBillCostRateBC] [decimal](28, 9) NULL, [EffortRatePC] [decimal](28, 9) NULL, [EffortRateCC] [decimal](28, 9) NULL, [EffortRateBC] [decimal](28, 9) NULL, [EffortMultiplier] [decimal](28, 9) NULL, [PremiumEffortMultiplier] [decimal](28, 9) NULL, [FeeType] [uniqueidentifier] NULL, [RevenueRule] [int] NULL, [ActivityType] [uniqueidentifier] NULL, [LaborBillClass] [uniqueidentifier] NULL, [HoursType] [uniqueidentifier] NULL, [ProjectDescription] [uniqueidentifier] NULL, [ExternalReferenceNbr] nvarchar NULL, [ExternalReferenceDate] [date] NULL, [Overtime] [bit] NULL, [HomeCompany] [uniqueidentifier] NULL, [HomeOrg] [uniqueidentifier] NULL, [Billable] [bit] NULL, [BillItemId] [uniqueidentifier] NULL, [OriginalProjectDetailId] [uniqueidentifier] NULL, [Location] [uniqueidentifier] NULL, [ItemId] [uniqueidentifier] NULL, [ExcludeFromStaffUtilization] [bit] NULL, [BatchId] [int] NULL

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (2 by maintainers)

Commits related to this issue

Most upvoted comments

@v-chojas - My apologies. PyCharm didn’t see the change until I did

pip install -e /home/gord/git/pyodbc/

Now (on Linux) I get

before loop: 39.8 MiB
after 1 iteration: 40.2 MiB
after 2 iterations: 46.8 MiB
after 3 iterations: 46.8 MiB
after 4 iterations: 46.8 MiB
after 5 iterations: 46.8 MiB
after 6 iterations: 46.8 MiB
after 7 iterations: 46.8 MiB
after 8 iterations: 46.8 MiB

You can check out specific commit ID, instead of only the latest of the master.

This is working as expected now! Thank you so much for the incredibly quick turnaround!
I am assuming I close the issue now?

Good analysis @gordthompson , and nice find @v-chojas .

Additional information: The above code was run using pyodbc 4.0.32 (current release). It also reproduces the issue under the current master branch and also using the modifications from PR #703 from @Mizaro , so it looks like it really is a new issue.