frictionless-py: Writing resource data to sqlite DB raises "sqlite3.OperationalError: too many SQL variables"

Hi,

writing sufficiently “wide” (i.e. number of fields/colums) and “long” (as in number of rows) table data easily runs into “sqlite3.OperationalError: too many SQL variables” for SQLite.

The reason seems to be how the Insert operation is done here:

https://github.com/frictionlessdata/frictionless-py/blob/3e6354e42a8d68a4b06f52858514d09ae871b994/frictionless/plugins/sql/storage.py#L324-L336

For a given sample program

#  frictionless_sqlite_params_error.py 
from frictionless import Resource
from frictionless.plugins.sql import SqlDialect                                

# sufficiently 'wide' and 'long' table data to provoke SQLite param
# restrictions
number_of_fields = 90
number_of_rows = 100

data = '\n'.join([
    ','.join(f'header{i}' for i in range(number_of_fields)),
    '\n'.join(
        ','.join(f'row{r}_col{c}' for c in range(number_of_fields))
        for r in range(number_of_rows)
        )
    ]).encode('ascii')
    
with Resource(data, format='csv') as resource:
    resource.write('sqlite:///app.db', dialect=SqlDialect(table='datatable'))

…this creates a huge “bulk insert” SQL statement

[SQL: INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), ...

I.e. this uses separate parameters for each field, for each row.

According to https://www.sqlite.org/limits.html the max. number of params in an SQL statement is “[…] SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0. […]”.

I’m running on 3.22.0 so that’s why I’ve immediately stumbled into the limit. While this has obviously been lifted somewhat in newer versions I still think it’s not a viable approach, as you’ll just run into the problems again for bigger (more rows) tables.

It does look like there’s some code that addresses this to a degree (the buffer and buffer_size parts) but IMHO this is not sufficient (think tables with 50 colums and 800 rows = 40000 params > 32766, again).

I’ve taken a look at SQLAlchemy possibilities and it seems to me one should rather use

self.__connection.execute(sql_table.insert(), buffer)

instead of

self.__connection.execute(sql_table.insert().values(buffer))

(see https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html#tutorial-multiple-parameters) and make SQLAlchemy apply cursor.executemany.

In contrast, the former creates a prepared parameterized SQL

INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)

I’m aware this might have performance implications - depending on if the SQL engine implements executemany efficiently. Haven’t measured.

I’m not convinced the “params for each field for each row” is faster in the first place; but if it is one could also use the executemany approach as a fallback, in case of running into the exception(?).

So a naive fix to not run into the error would be

$ diff frictionless/plugins/sql//storage.py.BAK frictionless/plugins/sql//storage.py
333c333
<                     self.__connection.execute(sql_table.insert().values(buffer))
---
>                     self.__connection.execute(sql_table.insert(), buffer)
336c336
<                 self.__connection.execute(sql_table.insert().values(buffer))
---
>                 self.__connection.execute(sql_table.insert(), buffer)

(no changes to the buffer and buffer_size parts here since I didn’t really understand their full intention 😉).

If helpful I could come up with a PR.

Again, awesome library, best regards, Holger

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (17 by maintainers)

Commits related to this issue

Use executemany for db writes and fix sqlite too many variables (#1255) * Test to provoke crossing SQLITE_MAX_VARIABLE_NUMBER (#1196) When creating prepared parameterized SQL statements with param... — committed to frictionlessdata/frictionless-py by hjoukl 2 years ago

Most upvoted comments

Hi @roll, @shashigharti, please reopen:

Sadly, somehow the (merged) PR #1255 never made it to framework v5:

https://github.com/frictionlessdata/framework/blob/01689f923560cf6d5aa6b6b0fd44229118fec75f/frictionless/formats/sql/adapter.py#L124-L126

(Accompanying tests test_sqlite.py and conftests.py additions also missing, see e.g.: https://github.com/frictionlessdata/framework/tree/main/tests/formats/sql)

So I’m running into this issue again on frictionless 5.3.0. Since this was merged in Oct. 2022 after the last v4 release, it looks like there’s now no released version that contains the fix. 😦

Let me know if I can help…

Best, Holger

hjoukl on Jan 19, 2023

Hi there @shashigharti @roll, PR #1255 is there (finally) - sorry for the long silence.

That’s the minimal code change discussed in previous ticket comments plus accompanying test.

As a little extra and compensation for your patience I’ve benchmarked the principal executemany approach against the one previously used in frictionless-framework storage.py. You can find it all here: https://github.com/hjoukl/bench-sqlalchemy-executemany 😃

The gist of it is that executemany is superior to the previous approach performance-wise, sometimes vastly (especially with SQLite). I’ve run the sample benchmark for the DB engines tested in your CI (SQlite, PostgreSQL, MySQL).

Btw I’ve stumbled over some rough edges in the v5 docs (e.g. invalid examples using not-existing-anymore plugins imports) and some missing bits that might be useful for the development/contribution docs (like “copy .env.example to .env before running make test”). Would you like some suggestions wrt this in form of tickets or otherwise?

Greetings, Holger

hjoukl on Sep 15, 2022

Hi @roll and @shashigharti,

great to see you considering this issue! Just a note wrt the classification change from bug to enhancement:

Since stock RHEL8 seems to have below sqlite3 defaults (sqlite version + compile options) the usability of frictionless-py is severely limited on this platform, which I think is in very widespread use in “enterprisey” Linux environments (probably predominant). Same goes for the older RHEL7. I.e. you’d basically need to be able install a newer sqlite version or compile sqlite and use this instead of system sqlite. Which might mean you’d also need to recompile Python (at least it’s sqlite3 extension)…

This is something that most users won’t be able/allowed to do in a corporate environment, from my experience.

0 $ cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.5 (Ootpa)
0 $ sqlite3 --version
3.26.0 2018-12-01 12:34:55 bf8c1b2b7a5960c282e543b9c293686dccff272512d08865f4600fb58238alt1
0 $ sqlite3 app.db 'pragma compile_options;'
COMPILER=gcc-8.4.1 20210423 (Red Hat 8.4.1-2)
DISABLE_DIRSYNC
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_FTS5
ENABLE_JSON1
ENABLE_RTREE
ENABLE_STMTVTAB
ENABLE_UNKNOWN_SQL_FUNCTION
ENABLE_UNLOCK_NOTIFY
HAVE_ISNAN
SECURE_DELETE
THREADSAFE=1

You run into those limits with a table with 20 fields and 50 rows, i.e. not only tables with an unrealistically high number of columns:

0 $ FRICTIONLESS_TEST_FIELDS=20 FRICTIONLESS_TEST_ROWS=50 python3 frictionless_sqlite_params_error.py 
Traceback (most recent call last):
  File "venv/rhel8-frictionless-venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
    cursor, statement, parameters, context
  File "venv/rhel8-frictionless-venv/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables

...
(Background on this error at: https://sqlalche.me/e/14/e3q8)

So I basically think a change like

$ diff frictionless/plugins/sql//storage.py.BAK frictionless/plugins/sql//storage.py
333c333
<                     self.__connection.execute(sql_table.insert().values(buffer))
---
>                     self.__connection.execute(sql_table.insert(), buffer)
336c336
<                 self.__connection.execute(sql_table.insert().values(buffer))
---
>                 self.__connection.execute(sql_table.insert(), buffer)

is both a bugfix (the 1000-lines chunking is not enough to avoid exceptions in general) and an enhancement (it looks like it’s dramatically faster for SQLite, but of course other DB backends might completely differ - needs proper benchmarking).

Hopefully not getting on anybody’s nerves here - of course it’s entirely your call how you handle this. I just wanted to bring it to your attention that frictionless-py might not be properly usable with an SQLite DB backend on these mentioned platforms, at all. Which would be a shame IMHO since it’s so cool. 😃

Best regards, Holger

hjoukl on Aug 8, 2022

Thanks for the amazing input @hjoukl!

@shashigharti and I will be discussing it soon with the team

roll on Aug 8, 2022

@shashigharti thanks for investigating further!

One quick remark:

To answer your question regarding buffer size, it breaks insert statements into multiple chunks of 1000 rows if row size is greator than 1000.

That’s what I also figured out in the meantime, thanks for confirming. 😃 I applied the same approach in the benchmark of my last comment.

I am not quite sure about two options that you have mentioned, which I have to test self.__connection.execute(sql_table.insert().values(buffer)) fallback: self.__connection.execute(sql_table.insert(), buffer)

From the above result rows * columns also doesn’t quite fit (999 for sqlite < 3.32 and 32766 > 3.32) to what we assumed to be SQLITE_MAX_VARIABLE_NUMBER. WDYT?

From sqlite3 app.db 'pragma compile_options;', do you get MAX_VARIABLE_NUMBER output? Then that’s what’s relevant instead of the sqlite version number. E.g. my Kubuntu (Debian) sqlite package (< 3.32) has drastically raised limits, i.e. you don’t run into the 999 limit there, even though that’s “vanilla” sqlite’s default.

So you really need to check the actual compiled limits of the sqlite under test to make proper expectations of when the test will fail or run through, not breaching the limits.

If I interpret your configuration output correctly that’s also the case for you, so the results look expectable to me (e.g. no-error for 250.000 variables, error for 300.000 variables without chunking)?

hjoukl on Aug 2, 2022

I got curious so hacked together a rather naive benchmark (timeit-based, measurements without any precautions to keep the machine idle apart from the benchmark process).

# benchmark_sqlite_params.py

# Pretty naive timeit benchmark
# I'm using globals all over the place here - don't do this at home, kids. 

import os
import timeit

from sqlalchemy import (
        Column, MetaData, String, Table, create_engine, delete, func,
        select)
from sqlalchemy.orm import Session


# some outside configurability through env vars
number_of_fields = int(os.environ.get('FRICTIONLESS_TEST_FIELDS', '90'))
number_of_rows = int(os.environ.get('FRICTIONLESS_TEST_ROWS', '100'))
timeit_number = int(os.environ.get('FRICTIONLESS_TEST_TIMEIT_NUMBER', '1000'))
buffer_size = int(os.environ.get('FRICTIONLESS_TEST_BUFFER_SIZE', '1000'))

col_pad = len(str(number_of_fields)) - 1
row_pad = len(str(number_of_rows)) - 1

columns = [Column(f'col{i:0{col_pad}}', String(col_pad + row_pad + 1)) for i in range(number_of_fields)]
rows = [
    {col.name: f'row{r:0{row_pad}}_col{c:0{col_pad}}' for (c, col) in enumerate(columns)}
    for r in range(number_of_rows)
    ]

engine = create_engine('sqlite:///app.db', echo=False)
meta = MetaData()
Mytable = Table('mytable', meta, *columns)
meta.create_all(engine)


def insert_values_params(session):
    stmt_values_params_base = Mytable.insert()
    buffer = []
    for row in rows:
        buffer.append(row)
        if len(buffer) >= buffer_size:
            stmt_values_params = stmt_values_params_base.values(buffer)
            session.execute(stmt_values_params)
            buffer = []
    if len(buffer) > 0:
        stmt_values_params = stmt_values_params_base.values(buffer)
        session.execute(stmt_values_params)
        buffer = []
    session.commit()


def insert_values_executemany(session):
    stmt_executemany = Mytable.insert()
    buffer = []
    for row in rows:
        buffer.append(row)
        if len(buffer) >= buffer_size:
            session.execute(stmt_executemany, buffer)
            buffer = []
    if len(buffer) > 0:
        session.execute(stmt_executemany, buffer)
        buffer = []
    session.commit()


def check_and_clean(session):
    count_stmt = select(func.count()).select_from(Mytable)
    result = session.execute(count_stmt)
    print(f'{count_stmt}: {result.all()}')
    last_row = session.query(Mytable).order_by(Mytable.c[0].desc()).first()
    print(f'last row/col value: {last_row[-1]}')
    session.execute(delete(Mytable))
    session.commit()


with Session(engine) as session:
    print('\n--------------------------------------------------------------------')
    print("timeit.timeit('insert_values_params(session)', "
           "number=timeit_number, globals=globals())") 
    print(timeit.timeit('insert_values_params(session)',
           number=timeit_number, globals=globals()))
    check_and_clean(session)

with Session(engine) as session:
    print('\n--------------------------------------------------------------------')
    print("timeit.timeit('insert_values_executemany(session)', "
          "number=timeit_number,  globals=globals())")
    print(timeit.timeit('insert_values_executemany(session)',
          number=timeit_number, globals=globals()))
    check_and_clean(session)

In my initial tests this seems to clearly hint - for the sqlite3 engine at least - that applying executemany is (vastly) superior to “named/numbered parameters for all row values”:

 rm -f app.db && FRICTIONLESS_TEST_FIELDS=50 FRICTIONLESS_TEST_ROWS=10000 FRICTIONLESS_TEST_TIMEIT_NUMBER=100 FRICTIONLESS_TEST_BUFFER_SIZE=1000 ./frictionless-py-venv/bin/python benchmark_sqlite_params.py 

--------------------------------------------------------------------
timeit.timeit('insert_values_params(session)', number=timeit_number, globals=globals())
547.7838764779999
SELECT count(*) AS count_1 
FROM mytable: [(1000000,)]
last row/col value: row9999_col49

--------------------------------------------------------------------
timeit.timeit('insert_values_executemany(session)', number=timeit_number,  globals=globals())
20.852090813999894
SELECT count(*) AS count_1 
FROM mytable: [(1000000,)]
last row/col value: row9999_col49

(unless I’ve made some really stupid mistake in the benchmarking code, obviously 😃)

So please take with a grain of salt - but I think this indicates that some proper benchmarking should be done, preferably with other DB backends/engines, too.

Best regards, Holger

hjoukl on Aug 2, 2022

Hi @shashigharti ,

So IMHO this doesn’t scale, per definition - unless you have other mechanisms in place to avoid creating a prepared statement with an immense amount of named/numbered parameters.

As I mentioned I didn’t quite grasp the buffer and buffer_size parts of the code, maybe they keep the number of params under the 32766 limit? Will take a deeper look.

Just from looking at it I’d say you still run into the restriction with e.g. 33 columns and 1001 rows since 33 * 1001 > 32766. Have to check…

Have checked now 😃:

With a slightly modified test program for modifying the number for columns/rows:

#  frictionless_sqlite_params_error.py 

import os

from frictionless import Resource
from frictionless.plugins.sql import SqlDialect                                

# sufficiently 'wide' and 'long' table data to provoke SQLite param
# restrictions
number_of_fields = int(os.environ.get('FRICTIONLESS_TEST_FIELDS', '90'))
number_of_rows = int(os.environ.get('FRICTIONLESS_TEST_ROWS', '100'))

data = '\n'.join([
    ','.join(f'header{i}' for i in range(number_of_fields)),
    '\n'.join(
        ','.join(f'row{r}_col{c}' for c in range(number_of_fields))
        for r in range(number_of_rows)
        )
    ]).encode('ascii')
    
with Resource(data, format='csv') as resource:
    resource.write('sqlite:///app.db', dialect=SqlDialect(table='datatable'))

To not run into the very restricted limits of the sqlite versions on RHEL7 and RHEL8 I tested on my Kubuntu 20.04 machine. FWIW this also still features an sqlite version < 3.32.0, namely 3.31.1:

holger@server:~/dev$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
holger@server:~/dev$ sqlite3 -version
3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1

However, (K)Ubuntu / Debian seem to drastically raise the (compiled) limits for their sqlite package:

holger@server:~/dev$ sqlite3 app.db 'pragma compile_options;'|grep MAX_VARIABLE
MAX_VARIABLE_NUMBER=250000

Still, one can easily provoke the error:

# <= 250.000 is fine...
holger@server:~/dev$ FRICTIONLESS_TEST_FIELDS=240 FRICTIONLESS_TEST_ROWS=1001 ./frictionless-py-venv/bin/python frictionless_sqlite_params_error.py 

# > 250.000 is not
holger@server:~/dev$ FRICTIONLESS_TEST_FIELDS=250 FRICTIONLESS_TEST_ROWS=1001 ./frictionless-py-venv/bin/python frictionless_sqlite_params_error.py
raceback (most recent call last):                                                                                                    
  File "/home/holger/dev/frictionless-py-venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/home/holger/dev/frictionless-py-venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: too many SQL variables
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "frictionless_sqlite_params_error.py", line 21, in <module>
    with Resource(data, format='csv') as resource:
...
  File "/home/holger/dev/frictionless-py-venv/lib/python3.8/site-packages/frictionless/plugins/sql/storage.py", line 333, in __write_convert_data
    self.__connection.execute(sql_table.insert().values(buffer))
...

I wouldn’t say that 250 columns makes a lot of sense but I don’t think it should crash, either.

I am using version 3.38.5 and tested with 191(columns) X 1000 (rows) and it worked fine. I used this excel sheet https://docs.google.com/spreadsheets/d/12VbraN-O_gBpP246wTWdaK4BkvLYNIUm1LkXSRQzbaE/edit#gid=566358228 And I think 32766(column number limit) is more than enough.

I have yet to test with your sample spreadsheet. But looking at it, are you sure it actually applies 191 columns, since the column name “field12” is duplicated in columns L-Gi?

Also, as I’ve had to learn you need to check the actual compiled restrictions for your sqlite installation since these could well not be the defaults. One can check with sqlite3 app.db 'pragma compile_options;'.

Regarding using self.__connection.execute(sql_table.insert(), buffer) instead of self.__connection.execute(sql_table.insert().values(buffer)) I tested as shown in the screenshot and I think both use prepared statements. So it would be better to keep everything as it is for now because 32766 fields (limit) should be enough. What do you think?

Both use prepared statements, but the currently implemented variant prepares statements where the number of used named/numbered parameters depend on both the number of columns and the number of rows.

I wonder how other databases handle a huge number of named/number parameters. Personally, I find it a bit surprising by SQLAlchemy to “create a dedicated variable for every single input value”. It feels a bit like having a function f(x, y) and instead of invoking this with several (x, y) input tuples [(1,'a'), (2, 'b'), (3, 'c')] like f(1, 'a'), f(2, 'b'), f(3, 'c') you’re instead rewriting to a function f_mod(x1, y1, x2, y2, x3, z3) and calling f_mod(1,'a', 2, 'b', 3, 'c').

But I take it there might be valid reasons for doing that with .insert().values(...) (e.g. execeutemany is not performant).

What about combining the 2 variants for the prepared statements? I.e.

keep the currently implemented self.__connection.execute(sql_table.insert().values(buffer)) variant as the default and
fall back to self.__connection.execute(sql_table.insert(), buffer) in case of the exception?

Unfortunately it doesn’t look like there’s a programmatic sqlite3 Python interface to get at MAX_VARIABLE_NUMBER; if there was frictionless-py could dynamically optimize buffer_size. But reading this information through sqlite3 app.db 'pragma compile_options;' seems brittle.

I’m also not convinced that the “named/numbered parameters for all row values” is actually faster than running through executemany but that probably depends heavily on the database and the database library/engine used. I.e. this needs benchmarking…

(For completeness’ sake: Looks like https://github.com/rogerbinns/apsw (an alternative sqlite Python wrapper) seems to indeed feature getting at that information through it’s connection.limit method)

hjoukl on Aug 1, 2022

Hi @shashigharti,

thanks for getting back!

SQLITE_MAX_VARIABLE_NUMBER is the max number of fields/columns not (rows x columns) and max limit is 32766 for SQLite version > 3.32.0.

I kindly disagree 😉

According to https://www.sqlite.org/limits.html: “A host parameter is a place-holder in an SQL statement that is filled in using one of the sqlite3_bind_XXXX() interfaces.”

So as you yourself show the number of parameters applied in a statement is indeed rows x columns in the self.__connection.execute(sql_table.insert().values(buffer)) case:

>>> from sqlalchemy import table, column
>>> t = table('foo', column('x'), column('y'))
>>> stmt1 = t.insert().values([(3, 'd'), (4, 'e')])
>>> print(stmt1)  # 2 columns, 2 rows => 2*2 = 4 parameters
INSERT INTO foo (x, y) VALUES (:x_m0, :y_m0), (:x_m1, :y_m1)
>>> t = table('foo', column('x'), column('y'), column('z'))
>>> stmt1 = t.insert().values([(3, 'd'), (4, 'e'), (5, 'f')])
>>> print(stmt1)  # 3 columns, 2 rows => 3*2 = 6 parameters
INSERT INTO foo (x, y) VALUES (:x_m0, :y_m0), (:x_m1, :y_m1), (:x_m2, :y_m2)

So IMHO this doesn’t scale, per definition - unless you have other mechanisms in place to avoid creating a prepared statement with an immense amount of named/numbered parameters.

As I mentioned I didn’t quite grasp the buffer and buffer_size parts of the code, maybe they keep the number of params under the 32766 limit? Will take a deeper look.

Just from looking at it I’d say you still run into the restriction with e.g. 33 columns and 1001 rows since 33 * 1001 > 32766. Have to check…

I am using version 3.38.5 and tested with 191(columns) X 1000 (rows) and it worked fine. I used this excel sheet https://docs.google.com/spreadsheets/d/12VbraN-O_gBpP246wTWdaK4BkvLYNIUm1LkXSRQzbaE/edit#gid=566358228 And I think 32766(column number limit) is more than enough.

I have yet to test with your sample spreadsheet. But looking at it, are you sure it actually applies 191 columns, since the column name “field12” is duplicated in columns L-Gi?

Unfortunately several “enterprise” systems ship with sqlite < 3.32.0 and thus have a drastically lower SQLITE_MAX_VARIABLE_NUMBERof 999:

RHEL 7.9: sqlite3 3.7.17
RHEL 8.5: sqlite3 3.26.0

So this may be a problem in an enterprise context where you might not be easily able to install newer versions. So frictionless-py might not be practically usable there unless you’re able to upgrade sqlite.

Regarding using self.__connection.execute(sql_table.insert(), buffer)

instead of self.__connection.execute(sql_table.insert().values(buffer))

I tested as shown in the screenshot and I think both use prepared statements. So it would be better to keep everything as it is for now because 32766 fields (limit) should be enough. What do you think?

Both use prepared statements, but the currently implemented variant prepares statements where the number of used named/numbered parameters depend on both the number of columns and the number of rows.

I wonder how other databases handle a huge number of named/number parameters. Personally, I find it a bit surprising by SQLAlchemy to “create a dedicated variable for every single input value”. It feels a bit like having a function f(x, y) and instead of invoking this with several (x, y) input tuples [(1,'a'), (2, 'b'), (3, 'c')] like f(1, 'a'), f(2, 'b'), f(3, 'c') you’re instead rewriting to a function f_mod(x1, y1, x2, y2, x3, z3) and calling f_mod(1,'a', 2, 'b', 3, 'c').

But I take it there might be valid reasons for doing that with .insert().values(...) (e.g. execeutemany is not performant).

hjoukl on Aug 1, 2022

Thanks @hjoukl!

@shashigharti can you please take a look?

roll on Jul 25, 2022