dbt-core: [CT-260] [Bug] `generate_database_name` does not apply to sources

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

When using sources, the generate_database_name override doesn’t apply to sources.

Expected Behavior

Sources would take the overridden database name, just like models.

This is either a very big oversight, or for some reason it’s intentional. It’s quite a blocker for me though. I’ve tried to get around it by adding some jinja in my sources yaml block, but no luck so far. I’m even trying to call macros from there, which I don’t think is possible.

I suppose leaving database explicitly unset might be an OK workaround, as it would take the database of the session which most of the time would be correct.

Steps To Reproduce

  1. Set up a dbt project
  2. Configure a source and a model that consumes the source.
  3. Add a generate_database_name override macro. Here’s an example:
{% macro generate_database_name(custom_database_name=none, node=none) -%}

  {%- if custom_database_name is not none and node.resource_type in ['seed', 'model'] -%}
    {%- set error_message -%}
      {{ node.resource_type | capitalize }} '{{ node.unique_id }}' has a database configured. This is not allowed because the it will be derived from the directory structure and target name.
    {%- endset -%}
    {{ exceptions.raise_compiler_error(error_message) }}
  {%- endif -%}

  {% set path = node.path %}
  {% set database = path.split('/')[0] %}
  {% if target.name in ['prod_us', 'prod_eu'] %}
    {{ database }}
  {% else %}
    {{ database + "_" + target.name }}
  {% endif %}

{%- endmacro %}

  1. dbt compile
  2. The compiled models will have whatever db is set in your profile, instead of the overridden one from the macro

Relevant log output

No response

Environment

- OS: macOS
- Python: 3.7.9
- dbt: 

❯ dbt --version
installed version: 1.0.2
   latest version: 1.0.2

Up to date!

Plugins:
  - snowflake: 1.0.0 - Up to date!


### What database are you using dbt with?

snowflake

### Additional Context

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (15 by maintainers)

Most upvoted comments

Jumping in with three quick thoughts, since there’s already some great conversation happening here:

  1. @nathan-protempo has the right sense for why dbt doesn’t support generate_database_name for sources by default. dbt assumes that (a) sources are preexisting tables with predetermined locations, not things it is tasked with putting in place; and (b) those locations don’t vary across environments. The goal behind the default custom schema behavior is to avoid two developers putting the same model in the same namespace location, and stepping on each others’ toes. The same logic applies for why snapshots use target_database + target_schema configs instead, and don’t use generate_X_name macros—it should be the same for all users. Obviously, in many complex deployments, these objects can vary across environments—sources in dev/staging are sampled, cleansed of PII, perhaps even faked entirely.
  2. Because snapshots are written in dbt-SQL, users can still set target_database = generate_database_name('db') if they so choose. The default is justifiable, and the alternative is still possible. Unfortunately, dbt doesn’t support macro calls within .yml configuration today. (Exception: “late-rendered” config, such as hooks.) This is a capability that we’ve continually worked toward, and we’ve laid a number of the foundational pieces. The reason we haven’t just done this: adding macros to the rendering contexts for configurations will be a significant performance hit at parse time. Could we add support for just generate_X_name macros, since those already have special behavior during parsing? Maybe. That would make this possible, at least:
# not possible today
sources:
  - name: blah
    database: "{{ generate_database_name('reporting') }}"
    schema: schema_name
    tables: [ ... ]
  1. We will soon be working to reconcile configurations for sources, so that they are at parity with other resource types (#3662). This is work we began ahead of v1.0, but de-prioritized relative to other changes. This will at least make the following behavior possible:
# dbt_project.yml

sources:
  my_project:
    all_my_sources:
      # logic is still duplicated, but compact - not scattered across different files
      reporting_sources:
        +database: "reporting{{ '_' + target.name if not target.name.startswith('prod') }}"
      other_org_sources:
        +database: "other_org{{ '_' + target.name if not target.name.startswith('prod') }}"

I’ve just come across this behaviour too and was struggling to understand why sources weren’t changing their database value even though I configured the macro to do so.

In my case I want any source that uses a PROD_ database to use the DEV_ database instead, when the non-live profile is used.

# generate_database_name.sql

# Does not work for sources

{% macro generate_database_name(custom_database_name, node) -%}

    {%- set default_database = target.database -%}
    {%- if custom_database_name is none -%}

        {{ default_database }}

    {%- elif target.name == "non-live" and custom_database_name[:5] | upper == "PROD_" -%}

        {{ "DEV_" ~ custom_database_name[5:] }}

     {%- else -%}

        {{ custom_database_name | trim }}

    {%- endif -%}

{%- endmacro %}

For now the below Jinja logic works, it’s just not very user friendly for developers when it comes to adding a new source. Would love for this to be a flag or setting in a future version.

# sources.yml
version: 2

sources:
  - name: source_name
    database: "{{ 'dev_database' if target.name == 'non-live' else 'prod_database' }}"

Similarly, you did say that dbt “doesn’t support generate_x_name for sources by default”, which implies it can be enabled. Where can I change that setting (Perhaps there is no way to and I’m just reading in a bit too carefully)?

Sorry, yes, no way to do this today. I meant option 2 to be an illustration of how you could someday reenable this, for a particular source, in the same way users have done with snapshots.

So option 4 might be, an optional configuration that “enrolls” sources into generate_x_name macros (or a dedicated source-specific version). Behind the scenes, this would need to look like running update_parsed_node_name over sources when they’re parsed.

For the short term, I still like the third option most. We’re planning to start work on that soon, with the aim of including it in v1.1. After that capability is unlocked, we can see if there remains a significant unmet need for more dynamic, macro-based configuration.

Good thoughts, thanks @jtcohen6. Of those three I am most a fan of the third option.

It’s not uncommon for people to use the same physical instance of a cloud database with separate logical (_qa kind of thing) databases, to save on $$. Also, it’s not uncommon for folks to do what I’m doing where separate sub-orgs within a company go in separate databases. I understand the assumption dbt is currently making that sources are expected to be in a known location, my point here is to call into question the validity of that assumption.

My main question for you is, why isn’t there a 4th option, a generate_X_name_sources (where x is database and schema)? Same exact idea as what we do for models, just applied to sources.

Similarly, you did say that dbt “doesn’t support generate_x_name for sources by default”, which implies it can be enabled. Where can I change that setting (Perhaps there is no way to and I’m just reading in a bit too carefully)?