superset: Snowflake: Inconsistent column name case

Superset handles the case of non-case-sensitive column names inconsistently for Snowflake connections:

  • Native tables/views, i.e., physical datasets, have lowercase column names, as it is internally handled by SQLAlchemy.
  • SQLLab’s virtual datasets have UPPERCASE column names.

This is troublesome for all dashboards where filters are present which act on charts with related, mixed physical/virtual datasets.
Superset’s filter scoping is case sensitive. Thus, filters on a certain column name will either be applied to the related physical or virtual datasets.

How to reproduce the bug

  1. Register any Snowflake table with non-case-sensitive column names (i.e., UPPERCASE) as a physical dataset in Superset.
  2. Notice how its columns are recognized and stored in lowercase.
  3. Edit the dataset again and convert it to a virtual dataset. Use select * from … as statement text. Save the changes.
  4. Edit the dataset again and synchronize the column names.
  5. Notice how the column names now changed to UPPERCASE.

Expected results

The case should not depend on whether the dataset is physical or virtual.
All non-case-sensitive (i.e., UPPERCASE in Snowflake) should be converted to lowercase consistently, also for virtual datasets created in SQLLab.

Lowercase is the internal representation of SQLALchemy.

Actual results

Virtual dataset’s column names are treated as UPPERCASE.

Screenshots

The easies way to experience this effect is to simply explore a Snowflake table in SQLLab. As you can see in the following screenshot, the column names that are extracted from the schema and displayed in the schema browser on the left are represented as lowercase, while the query results of the preview table have UPPERCASE column names:

image

Rejected workarounds

An option would of course be to use double-quoted lowercase column aliases in the SELECT statement of the virtual dataset. This would make these column names case-sensitive and thus be treated as lowercase. However, with a broad userbase working with Snowflake/Superset this is very unhandy.
Furthermore, this apporach would not allow any select * statements here.

I think it simply is a bug and it needs to be fixed so the way the column names are treated is always consistent.

Environment

Tested versions:

  • 1.0.1
  • 1.4 rc4

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • I have reproduced the issue with at least the latest released version of superset.
  • I have checked the issue tracker for the same issue and I haven’t found one similar.

Additional context

This SQLALchemy issue is closely related: https://github.com/snowflakedb/snowflake-sqlalchemy/issues/157#issuecomment-807922786

However, my feeling is that we should probably try to stick with SQLALchemy’s way of treating the column names as lowercase instead of uppercase. However, then we need to do so consistently and correct this for the virtual datasets.

tai pointed me in Slack to this piece of code: https://github.com/snowflakedb/snowflake-sqlalchemy/blob/9118cf8f18a0039f9cb5d3892ff2b1e5c82a05e0/snowdialect.py#L217

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 41 (31 by maintainers)

Most upvoted comments

@villebro thanks a lot for giving this issue some love 😃.

The suggested approach seems like a simple yet effective solution to me.

Although using the cursor to get the column names is a likely fix I do feel strongly that the database driver should not be applying any case transformations on the payload returned from the db.

Since @villebro did propose that as a solution to the issue, I agree, it would be good to his thoughts on the matter

Hi all, this change has broken all of our charts and filters due to the change from lowercase naming to uppercase. This impacted our existing old datasets when I clicked “Sync Columns From Source.”

Is there any guidance on how to make this backward compatible with existing assets? @betodealmeida @Vitor-Avila

I second @rumbin very much here, this is a simple and effective solution. Thinking about our Superset instance, this will definitely address the problems we have with dashboard filtering.

Thank you a million @villebro @rusackas, very much appreciate this.

We’ve been throwing around ideas about this with @rusackas for the last few weeks, and my proposal is to change the behavior going forward as follows:

  • virtual dataset logic remains unchanged
  • physical tables call denormalize_name on the column name. See example below:
>>> from snowflake.sqlalchemy.snowdialect import SnowflakeDialect
>>> dialect = SnowflakeDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name('Foo')
'Foo'
>>> dialect.denormalize_name('FOO')
'FOO'

This means that going forward, caseless physical column names will be stored in UPPERCASE, rather than lowercase. We’ll also do the same for Oracle and the others that need to call the standard denormalize_name method:

>>> from sqlalchemy.dialects.oracle.base import OracleDialect
>>> dialect = OracleDialect()
>>> dialect.requires_name_normalize
True
>>> dialect.denormalize_name('foo')
'FOO'
>>> dialect.denormalize_name("Foo")
'Foo'
>>> dialect.denormalize_name("FOO")
'FOO'

This is in contrast to MSSQL and other dialects that don’t require name normalization:

>>> from sqlalchemy.dialects.mssql.base import MSDialect
>>> dialect = MSDialect()
>>> ms.requires_name_normalize
False
>>> ms.denormalize_name('foo')
'FOO'
>>> # this ☝️ surprised me!

So calling dialect.denormalize_name(col_name) if dialect.requires_name_normalize is True will ensure that physical and virtual datasets have the same case. Furthermore, I propose not doing any migrations to existing datasets, as the migration would be extremely risky, avoiding breaking any existing dashboards or charts. This means, that this change would only affect physical datasets created after the change. Therefore, anyone who wants to migrate old charts/dashboards to the new behavior will need to edit the dataset and click “Sync columns from source”:

image

After this the lowercase column names will become UPPERCASE:

image

Keep in mind, however, that this might break existing charts, as column references in both chart and dashboard native filter metadata might be pointing to the old normalized column name.

Thoughts @agusfigueroa-htg @rumbin @nytai ?

I can sadly confirm that the issue remains 😦

This issue should not be closed by the stale bot, in my eyes.