dowhy: `distance_matching` estimator raises an exception when binary treatment encoded using `int`s

Describe the bug When using distance_matching estimator with binary treatment encoded as int type (1s and 0s rather than True and False), DoWhy throws the following exception:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Input In [418], in <cell line: 2>()
      1 # Get estimate (Linear Regression)
----> 2 estimate = model.estimate_effect(
      3     identified_estimand=estimand,
      4     method_name="backdoor.distance_matching",
      5     target_units="ate",
      6     method_params={'distance_metric': "minkowski", 'p':2})

File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_model.py:297, in CausalModel.estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
    295     if method_params is None:
    296         method_params = {}
--> 297     self.causal_estimator = causal_estimator_class(
    298         self._data,
    299         identified_estimand,
    300         self._treatment, self._outcome, #names of treatment and outcome
    301         control_value = control_value,
    302         treatment_value = treatment_value,
    303         test_significance=test_significance,
    304         evaluate_effect_strength=evaluate_effect_strength,
    305         confidence_intervals = confidence_intervals,
    306         target_units = target_units,
    307         effect_modifiers = effect_modifiers,
    308         **method_params,
    309         **extra_args)
    310 else:
    311     # Estimator had been computed in a previous call
    312     assert self.causal_estimator is not None

File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_estimators\distance_matching_estimator.py:45, in DistanceMatchingEstimator.__init__(self, num_matches_per_unit, distance_metric, exact_match_cols, *args, **kwargs)
     43     error_msg = "Distance Matching method is applicable only for binary treatments"
     44     self.logger.error(error_msg)
---> 45     raise Exception(error_msg)
     47 self.num_matches_per_unit = num_matches_per_unit
     48 self.distance_metric = distance_metric

Exception: Distance Matching method is applicable only for binary treatments

Note that when using the same dataset with a binary treatment encoded as bool no error occurs.

Steps to reproduce the behavior

form scipy import stats
import numpy as np
import pandas as pd
import dowhy
from dowhy import CausalModel

# Generate the data
SAMPLE_SIZE = 10_000
MAX_AGE = 50

age = stats.halfnorm.rvs(loc=19, scale=10, size=SAMPLE_SIZE).astype(int)
age = np.where(age > MAX_AGE, np.random.choice(np.arange(20, MAX_AGE)), age)

took_a_course = stats.bernoulli(p=10/age).rvs()#.astype(bool)

earnings = 75000 + took_a_course * 10000 + age * 1000 + age**2 * 50 + np.random.randn(SAMPLE_SIZE) * 2000
earnings = earnings.round()


# Construct the graph (the graph is constant for all iterations)
nodes = ['took_a_course', 'earnings', 'age']
edges = [
    ('took_a_course', 'earnings'),
    ('age', 'took_a_course'),
    ('age', 'earnings')
]

# Generate the GML graph
gml_string = 'graph [directed 1\n'

for node in nodes:
    gml_string += f'\tnode [id "{node}" label "{node}"]\n'

for edge in edges:
    gml_string += f'\tedge [source "{edge[0]}" target "{edge[1]}"]\n'
    
gml_string += ']'

data = pd.DataFrame(dict(
    age=age,
    took_a_course=took_a_course,
    earnings=earnings
))


# Instantiate the CausalModel 
model = CausalModel(
    data=data,
    treatment='took_a_course',
    outcome='earnings',
    graph=gml_string
)


# Get the estimand
estimand = model.identify_effect()

# Get estimate (Linear Regression)
estimate = model.estimate_effect(
    identified_estimand=estimand,
    method_name="backdoor.distance_matching",
    target_units="ate",
    method_params={'distance_metric': "minkowski", 'p':2})

Expected behavior It should not matter what encoding we use for binary treatment or it should be explicitly specified that only bool-typed values are accepted and error message should reflect that.

Version information:

  • DoWhy version 0.8

Additional context

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Most upvoted comments

Hey @AlxndrMlk that’s okay. for a lot of the earlier code, the auto linters were not activated.

So we’ve taken a policy of updating them as needed in relevant PRs. No worries on the additional files added to the PR–since they are just formatting changes they should be easy to review.

This guide has some pointers to lint/code formatting checkers in PR checklist: https://github.com/py-why/dowhy/blob/main/docs/source/contributing/contributing-code.rst Other than that, I would suggest simply running all the tests. Since this change affects binary versus float treatment, you may want to add a test that includes a float treatment and pass it to one of the propensity methods

yes, currently distance-matching and the three propensity based estimators check for binary treatment. Would make sense to modify the behavior for all such methods that do a binary treatment check.

Thanks for clarifying. My first read was that this was a broader data preprocessing / validation problem. This targeted change sounds good to me.

@AlxndrMlk thanks for the thoughtful reply. Agree that we should be consistent with EconML as far as possible.

My proposal is that we add explicit checks to make sure that if the treatment is int / float it contains only 0 and 1 (possibly we could also raise a warning when there are only 0s or only 1s). I believe that this gives the user more consistency and more flexibility, translating into a smoother user experience.

Sounds good. So if a user provides bool, then that’s great. But if they provide int/float, then we check the values as you write above. If they provide any other values apart from 0/1 (e.g., -1/1), then we raise an error. I think this is reasonable.

@emrekiciman this will be a change to dowhy’s builtin estimators (e.g., propensity based) that raise an error if the treatment is not binary type. Currently, we have an explicit raise Error that precludes someone from passing in an int, and @AlxndrMlk is suggesting to remove this super-strict check.

PR in early Jan sounds good!

Great, adding to the calendar and will keep you posted.

We can consider adding a simple check, but the concern was computational efficiency.

Oh, I see, that’s a valid concern. If you don’t mind, I am happy to look into it and discuss this with you further if we find a sensible solution.