frictionless-py: Writing resource data to sqlite DB raises "sqlite3.OperationalError: too many SQL variables"
Hi,
writing sufficiently “wide” (i.e. number of fields/colums) and “long” (as in number of rows) table data easily runs into “sqlite3.OperationalError: too many SQL variables” for SQLite.
The reason seems to be how the Insert operation is done here:
For a given sample program
# frictionless_sqlite_params_error.py
from frictionless import Resource
from frictionless.plugins.sql import SqlDialect
# sufficiently 'wide' and 'long' table data to provoke SQLite param
# restrictions
number_of_fields = 90
number_of_rows = 100
data = '\n'.join([
','.join(f'header{i}' for i in range(number_of_fields)),
'\n'.join(
','.join(f'row{r}_col{c}' for c in range(number_of_fields))
for r in range(number_of_rows)
)
]).encode('ascii')
with Resource(data, format='csv') as resource:
resource.write('sqlite:///app.db', dialect=SqlDialect(table='datatable'))
…this creates a huge “bulk insert” SQL statement
[SQL: INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?), ...
I.e. this uses separate parameters for each field, for each row.
According to https://www.sqlite.org/limits.html the max. number of params in an SQL statement is “[…] SQLITE_MAX_VARIABLE_NUMBER, which defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0. […]”.
I’m running on 3.22.0 so that’s why I’ve immediately stumbled into the limit. While this has obviously been lifted somewhat in newer versions I still think it’s not a viable approach, as you’ll just run into the problems again for bigger (more rows) tables.
It does look like there’s some code that addresses this to a degree (the buffer and buffer_size parts) but IMHO this is not sufficient (think tables with 50 colums and 800 rows = 40000 params > 32766, again).
I’ve taken a look at SQLAlchemy possibilities and it seems to me one should rather use
self.__connection.execute(sql_table.insert(), buffer)
instead of
self.__connection.execute(sql_table.insert().values(buffer))
(see https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html#tutorial-multiple-parameters) and make SQLAlchemy apply cursor.executemany.
In contrast, the former creates a prepared parameterized SQL
INSERT INTO datatable (header0, header1, header2, header3, header4, header5, header6, header7, header8, header9, header10, header11, header12, header13, header14, header15, header16, header17, header18, header19, header20, header21, header22, header23, header24, header25, header26, header27, header28, header29, header30, header31, header32, header33, header34, header35, header36, header37, header38, header39, header40, header41, header42, header43, header44, header45, header46, header47, header48, header49, header50, header51, header52, header53, header54, header55, header56, header57, header58, header59, header60, header61, header62, header63, header64, header65, header66, header67, header68, header69, header70, header71, header72, header73, header74, header75, header76, header77, header78, header79, header80, header81, header82, header83, header84, header85, header86, header87, header88, header89) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
I’m aware this might have performance implications - depending on if the SQL engine implements executemany efficiently. Haven’t measured.
I’m not convinced the “params for each field for each row” is faster in the first place; but if it is one could also use the executemany approach as a fallback, in case of running into the exception(?).
So a naive fix to not run into the error would be
$ diff frictionless/plugins/sql//storage.py.BAK frictionless/plugins/sql//storage.py
333c333
< self.__connection.execute(sql_table.insert().values(buffer))
---
> self.__connection.execute(sql_table.insert(), buffer)
336c336
< self.__connection.execute(sql_table.insert().values(buffer))
---
> self.__connection.execute(sql_table.insert(), buffer)
(no changes to the buffer and buffer_size parts here since I didn’t really understand their full intention 😉).
If helpful I could come up with a PR.
Again, awesome library, best regards, Holger
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 17 (17 by maintainers)
Hi @roll, @shashigharti, please reopen:
Sadly, somehow the (merged) PR #1255 never made it to framework v5:
https://github.com/frictionlessdata/framework/blob/01689f923560cf6d5aa6b6b0fd44229118fec75f/frictionless/formats/sql/adapter.py#L124-L126
(Accompanying tests
test_sqlite.pyandconftests.pyadditions also missing, see e.g.: https://github.com/frictionlessdata/framework/tree/main/tests/formats/sql)So I’m running into this issue again on frictionless 5.3.0. Since this was merged in Oct. 2022 after the last v4 release, it looks like there’s now no released version that contains the fix. 😦
Let me know if I can help…
Best, Holger
Hi there @shashigharti @roll, PR #1255 is there (finally) - sorry for the long silence.
That’s the minimal code change discussed in previous ticket comments plus accompanying test.
As a little extra and compensation for your patience I’ve benchmarked the principal
executemanyapproach against the one previously used in frictionless-frameworkstorage.py. You can find it all here: https://github.com/hjoukl/bench-sqlalchemy-executemany 😃The gist of it is that
executemanyis superior to the previous approach performance-wise, sometimes vastly (especially with SQLite). I’ve run the sample benchmark for the DB engines tested in your CI (SQlite, PostgreSQL, MySQL).Btw I’ve stumbled over some rough edges in the v5 docs (e.g. invalid examples using not-existing-anymore
pluginsimports) and some missing bits that might be useful for the development/contribution docs (like “copy .env.example to .env before running make test”). Would you like some suggestions wrt this in form of tickets or otherwise?Greetings, Holger
Hi @roll and @shashigharti,
great to see you considering this issue! Just a note wrt the classification change from bug to enhancement:
Since stock RHEL8 seems to have below sqlite3 defaults (sqlite version + compile options) the usability of
frictionless-pyis severely limited on this platform, which I think is in very widespread use in “enterprisey” Linux environments (probably predominant). Same goes for the older RHEL7. I.e. you’d basically need to be able install a newer sqlite version or compile sqlite and use this instead of system sqlite. Which might mean you’d also need to recompile Python (at least it’s sqlite3 extension)…This is something that most users won’t be able/allowed to do in a corporate environment, from my experience.
You run into those limits with a table with 20 fields and 50 rows, i.e. not only tables with an unrealistically high number of columns:
So I basically think a change like
is both a bugfix (the 1000-lines chunking is not enough to avoid exceptions in general) and an enhancement (it looks like it’s dramatically faster for SQLite, but of course other DB backends might completely differ - needs proper benchmarking).
Hopefully not getting on anybody’s nerves here - of course it’s entirely your call how you handle this. I just wanted to bring it to your attention that
frictionless-pymight not be properly usable with an SQLite DB backend on these mentioned platforms, at all. Which would be a shame IMHO since it’s so cool. 😃Best regards, Holger
Thanks for the amazing input @hjoukl!
@shashigharti and I will be discussing it soon with the team
@shashigharti thanks for investigating further!
One quick remark:
That’s what I also figured out in the meantime, thanks for confirming. 😃 I applied the same approach in the benchmark of my last comment.
From
sqlite3 app.db 'pragma compile_options;', do you getMAX_VARIABLE_NUMBERoutput? Then that’s what’s relevant instead of the sqlite version number. E.g. my Kubuntu (Debian) sqlite package (< 3.32) has drastically raised limits, i.e. you don’t run into the 999 limit there, even though that’s “vanilla” sqlite’s default.So you really need to check the actual compiled limits of the sqlite under test to make proper expectations of when the test will fail or run through, not breaching the limits.
If I interpret your configuration output correctly that’s also the case for you, so the results look expectable to me (e.g. no-error for 250.000 variables, error for 300.000 variables without chunking)?
I got curious so hacked together a rather naive benchmark (timeit-based, measurements without any precautions to keep the machine idle apart from the benchmark process).
In my initial tests this seems to clearly hint - for the sqlite3 engine at least - that applying
executemanyis (vastly) superior to “named/numbered parameters for all row values”:(unless I’ve made some really stupid mistake in the benchmarking code, obviously 😃)
So please take with a grain of salt - but I think this indicates that some proper benchmarking should be done, preferably with other DB backends/engines, too.
Best regards, Holger
Hi @shashigharti ,
Have checked now 😃:
With a slightly modified test program for modifying the number for columns/rows:
To not run into the very restricted limits of the sqlite versions on RHEL7 and RHEL8 I tested on my Kubuntu 20.04 machine. FWIW this also still features an sqlite version < 3.32.0, namely 3.31.1:
However, (K)Ubuntu / Debian seem to drastically raise the (compiled) limits for their sqlite package:
Still, one can easily provoke the error:
I wouldn’t say that 250 columns makes a lot of sense but I don’t think it should crash, either.
Also, as I’ve had to learn you need to check the actual compiled restrictions for your sqlite installation since these could well not be the defaults. One can check with
sqlite3 app.db 'pragma compile_options;'.What about combining the 2 variants for the prepared statements? I.e.
self.__connection.execute(sql_table.insert().values(buffer))variant as the default andself.__connection.execute(sql_table.insert(), buffer)in case of the exception?Unfortunately it doesn’t look like there’s a programmatic sqlite3 Python interface to get at
MAX_VARIABLE_NUMBER; if there wasfrictionless-pycould dynamically optimizebuffer_size. But reading this information throughsqlite3 app.db 'pragma compile_options;'seems brittle.I’m also not convinced that the “named/numbered parameters for all row values” is actually faster than running through
executemanybut that probably depends heavily on the database and the database library/engine used. I.e. this needs benchmarking…(For completeness’ sake: Looks like https://github.com/rogerbinns/apsw (an alternative sqlite Python wrapper) seems to indeed feature getting at that information through it’s
connection.limitmethod)Hi @shashigharti,
thanks for getting back!
I kindly disagree 😉
According to https://www.sqlite.org/limits.html: “A host parameter is a place-holder in an SQL statement that is filled in using one of the sqlite3_bind_XXXX() interfaces.”
So as you yourself show the number of parameters applied in a statement is indeed rows x columns in the
self.__connection.execute(sql_table.insert().values(buffer))case:So IMHO this doesn’t scale, per definition - unless you have other mechanisms in place to avoid creating a prepared statement with an immense amount of named/numbered parameters.
As I mentioned I didn’t quite grasp the
bufferandbuffer_sizeparts of the code, maybe they keep the number of params under the 32766 limit? Will take a deeper look.Just from looking at it I’d say you still run into the restriction with e.g. 33 columns and 1001 rows since
33 * 1001 > 32766. Have to check…I have yet to test with your sample spreadsheet. But looking at it, are you sure it actually applies 191 columns, since the column name “field12” is duplicated in columns L-Gi?
Unfortunately several “enterprise” systems ship with sqlite < 3.32.0 and thus have a drastically lower
SQLITE_MAX_VARIABLE_NUMBERof 999:So this may be a problem in an enterprise context where you might not be easily able to install newer versions. So
frictionless-pymight not be practically usable there unless you’re able to upgrade sqlite.Both use prepared statements, but the currently implemented variant prepares statements where the number of used named/numbered parameters depend on both the number of columns and the number of rows.
I wonder how other databases handle a huge number of named/number parameters. Personally, I find it a bit surprising by SQLAlchemy to “create a dedicated variable for every single input value”. It feels a bit like having a function
f(x, y)and instead of invoking this with several(x, y)input tuples[(1,'a'), (2, 'b'), (3, 'c')]likef(1, 'a'),f(2, 'b'),f(3, 'c')you’re instead rewriting to a functionf_mod(x1, y1, x2, y2, x3, z3)and callingf_mod(1,'a', 2, 'b', 3, 'c').But I take it there might be valid reasons for doing that with
.insert().values(...)(e.g.execeutemanyis not performant).Thanks @hjoukl!
@shashigharti can you please take a look?