odbc-api: `panicked at 'range end index ... out of range for slice of length ...'`

thread '<unnamed>' panicked at 'range end index 10084237 out of range for slice of length 10084230', ...\odbc-api-0.40.1\src\buffers\text_column.rs:230:13

Was also getting this at a lower value with a smaller insert when I was initiating TextColumn at 0,0

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 27 (8 by maintainers)

Most upvoted comments

New way to insert bulk: https://docs.rs/odbc-api/latest/odbc_api/guide/index.html#columnar-bulk-inserts

Probably only somewhat easier. 0.42.0 is a breaking release and but it closes a safety hole and is faster. Cheers, Markus

I will leave the issues open, until I feel I improved the accessability of that subject somewhat.

Since a lot of it is driver dependent it is hard to make definitiv statements, but in general these statements hold true.

  • There is a significant IO overhead per batch, having larger batches means less overhead per row.
  • This means that larger batch sizes yield deminishing returns. Going from 100 to 1000 could just bring the same speedup as going from 10 to 100 for ten times the memory.
  • Do not go beyond 65536. See this issue with SAP anywhere https://github.com/pacman82/odbc-api/issues/93
  • Benchmark, before you pick a value. If you can not, because you are just a library, do not take this option away from your users, it might make a big difference for them.

Thanks for the feedback, by the way. It is helpful.

More questions than I can answer right now 😅 .

The most important thing to point out. These are bindings to the ODBC API. At some point they have to talk to a driver manager over a C interface. This crate tries to adhere to the zero overhead principle. So you should be able to write the best and fastest code you can. More specifically it doesn’t perform unnecessary copies of the payload data. This implies laying the data out in memory exactly as the C interface commands it. This is why isize shows up for the indicator values in this crate.

Wait… I’ve seen lots of unsafe code… Are you expecting unsafety to be used for this API? Is that why I’m struggling so much?

You are not required to write unsafe code, but there are escape hatches for power users.

Columnar is not a type. It means the data is layed out in columns, rather than it being organized row by row. ODBC supports both, so this is something to clarify.

I don’t know why but for some reason, some parts of this API just feel off…

Maybe it is due to it wrapping the ODBC C API? It is also lower level than you might be used to. You could read up on ODBC here: https://docs.microsoft.com/en-us/sql/odbc/reference/develop-app/developing-applications?view=sql-server-ver15

The issue is that I’m doing mass inserts with highly generic data. All I know is each value’s type; from which I can derive column types; from which I can create buffers of columns. I don’t know intimate details of each column and I do not seek to mutate.

buffer_from_description or ColumnarAnyBuffer::buffer_from_description fits this usecase. I would suggest deriving the buffer description from the value’s type. https://docs.rs/odbc-api/latest/odbc_api/buffers/struct.BufferDescription.html

IMHO a common misconception is to model the input buffers after the DB schema, but in most use cases I would modell them after the appliaction input data. An example for illustration: if your input data does not contain any NULLs the buffer does not need to support nullability` indpendent of the fact that the DB schema might.

I sought TextColumn column after much frustration for my usecase, I couldn’t find anything that seemed somewhat accurate/efficient/not requiring notable workarounds or extra work.

As I mentioned before, it clearly seems I need more documentation. Actually it is the other usecase (knowing the table scheme at compile time) which is not supported with specific features by this crate. So far any downstream artefacts odbc2parquet, odbcsv or arrow-odbc do not know anything about the shape of the data in advance. The first two also can perform inserts on generic user data (provided in the form of csv or parquet files) and insert that into the database (with schema information unknown at compile time).

odbc2parquet will be a tough read due to some optimzations and percularities of parquet. The odbcsv sample linked above does very much what you want, with text input only though. It might be a good first iteration.

Personally I would suggest you start like this:

  1. Get a working implementation utilizing TextRowSet (which is just a type alias for ColumnarBuffer<TextRowSet> anyway)
  2. Generalize your implemantion to ColumnarAnyBuffer, but still matching every type to Text.
  3. Introduce new columnar buffer type one after the other depending on your usecase.

Speaking of your usecase. There does your input data come from? If it is all text stopping after 1. might be the best thing to do.

Unless you are providing your own buffer implementation your code shoud have no need to interact with TextColumn directly. A TextColumn is a very low level class, which does not even know how many elements in holds. Just use the higher level ColumnarAnyBuffer or TextRowSet buffers. They will instantiate and manage these TextColumns for you.

Happy to preallocate and chunk but I tried doing this and received the origin error of this issue; not sure why.

You can test if my suggestions would work by just running the odbcsv tool against your database. If it doesn’t work and panics please open a bug. My guess is though that the original error is caused by a misunderstanding of what TextColumn is and how it works.

Chunking currently works in this libray like this:

  1. Fill the ColumnarBuffer (not TextColumn) with a chunk of data.
  2. Execute the statement with the buffer.
  3. Clear the buffer and repeat for the next chunk.

Here is how row wise chunking is performed in the odbcsv tool (a chunk is called ‘batch’ here):

for try_record in reader.into_byte_records() {
    if buffer.num_rows() == *batch_size as usize {
         num_batch += 1;
         // Batch is full. We need to send it to the data base and clear it, before we insert
         // more rows into it.
         statement.execute(&buffer)?;
         info!(
            "Insert batch {} with {} rows into DB.",
            num_batch, batch_size
        );
        buffer.clear();
   }

    let record = try_record?;
        buffer.append(
        record
            .iter()
            .map(|field| if field.is_empty() { None } else { Some(field) }),
    );
}

Here is actually an inefficiency of this crate hidden. It currently forces users to rebind the buffer in safe code. I am working on it, but this is likely fast enough, for now.

Hope this helps.

Cheers, Markus

FYI: The first argument in the constructor of TextColumn refers to the number of rows you can put into the buffer at once. Yet there should not even be a need to instantiate it manually, for just inserting values.