dowhy: `distance_matching` estimator raises an exception when binary treatment encoded using `int`s
Describe the bug
When using distance_matching estimator with binary treatment encoded as int type (1s and 0s rather than True and False), DoWhy throws the following exception:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Input In [418], in <cell line: 2>()
1 # Get estimate (Linear Regression)
----> 2 estimate = model.estimate_effect(
3 identified_estimand=estimand,
4 method_name="backdoor.distance_matching",
5 target_units="ate",
6 method_params={'distance_metric': "minkowski", 'p':2})
File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_model.py:297, in CausalModel.estimate_effect(self, identified_estimand, method_name, control_value, treatment_value, test_significance, evaluate_effect_strength, confidence_intervals, target_units, effect_modifiers, fit_estimator, method_params)
295 if method_params is None:
296 method_params = {}
--> 297 self.causal_estimator = causal_estimator_class(
298 self._data,
299 identified_estimand,
300 self._treatment, self._outcome, #names of treatment and outcome
301 control_value = control_value,
302 treatment_value = treatment_value,
303 test_significance=test_significance,
304 evaluate_effect_strength=evaluate_effect_strength,
305 confidence_intervals = confidence_intervals,
306 target_units = target_units,
307 effect_modifiers = effect_modifiers,
308 **method_params,
309 **extra_args)
310 else:
311 # Estimator had been computed in a previous call
312 assert self.causal_estimator is not None
File ~\anaconda3\envs\causal_book_py38\lib\site-packages\dowhy\causal_estimators\distance_matching_estimator.py:45, in DistanceMatchingEstimator.__init__(self, num_matches_per_unit, distance_metric, exact_match_cols, *args, **kwargs)
43 error_msg = "Distance Matching method is applicable only for binary treatments"
44 self.logger.error(error_msg)
---> 45 raise Exception(error_msg)
47 self.num_matches_per_unit = num_matches_per_unit
48 self.distance_metric = distance_metric
Exception: Distance Matching method is applicable only for binary treatments
Note that when using the same dataset with a binary treatment encoded as bool no error occurs.
Steps to reproduce the behavior
form scipy import stats
import numpy as np
import pandas as pd
import dowhy
from dowhy import CausalModel
# Generate the data
SAMPLE_SIZE = 10_000
MAX_AGE = 50
age = stats.halfnorm.rvs(loc=19, scale=10, size=SAMPLE_SIZE).astype(int)
age = np.where(age > MAX_AGE, np.random.choice(np.arange(20, MAX_AGE)), age)
took_a_course = stats.bernoulli(p=10/age).rvs()#.astype(bool)
earnings = 75000 + took_a_course * 10000 + age * 1000 + age**2 * 50 + np.random.randn(SAMPLE_SIZE) * 2000
earnings = earnings.round()
# Construct the graph (the graph is constant for all iterations)
nodes = ['took_a_course', 'earnings', 'age']
edges = [
('took_a_course', 'earnings'),
('age', 'took_a_course'),
('age', 'earnings')
]
# Generate the GML graph
gml_string = 'graph [directed 1\n'
for node in nodes:
gml_string += f'\tnode [id "{node}" label "{node}"]\n'
for edge in edges:
gml_string += f'\tedge [source "{edge[0]}" target "{edge[1]}"]\n'
gml_string += ']'
data = pd.DataFrame(dict(
age=age,
took_a_course=took_a_course,
earnings=earnings
))
# Instantiate the CausalModel
model = CausalModel(
data=data,
treatment='took_a_course',
outcome='earnings',
graph=gml_string
)
# Get the estimand
estimand = model.identify_effect()
# Get estimate (Linear Regression)
estimate = model.estimate_effect(
identified_estimand=estimand,
method_name="backdoor.distance_matching",
target_units="ate",
method_params={'distance_metric': "minkowski", 'p':2})
Expected behavior
It should not matter what encoding we use for binary treatment or it should be explicitly specified that only bool-typed values are accepted and error message should reflect that.
Version information:
- DoWhy version 0.8
Additional context
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 21 (21 by maintainers)
Hey @AlxndrMlk that’s okay. for a lot of the earlier code, the auto linters were not activated.
So we’ve taken a policy of updating them as needed in relevant PRs. No worries on the additional files added to the PR–since they are just formatting changes they should be easy to review.
This guide has some pointers to lint/code formatting checkers in PR checklist: https://github.com/py-why/dowhy/blob/main/docs/source/contributing/contributing-code.rst Other than that, I would suggest simply running all the tests. Since this change affects binary versus float treatment, you may want to add a test that includes a float treatment and pass it to one of the propensity methods
yes, currently distance-matching and the three propensity based estimators check for binary treatment. Would make sense to modify the behavior for all such methods that do a binary treatment check.
Thanks for clarifying. My first read was that this was a broader data preprocessing / validation problem. This targeted change sounds good to me.
@AlxndrMlk thanks for the thoughtful reply. Agree that we should be consistent with EconML as far as possible.
Sounds good. So if a user provides bool, then that’s great. But if they provide int/float, then we check the values as you write above. If they provide any other values apart from 0/1 (e.g., -1/1), then we raise an error. I think this is reasonable.
@emrekiciman this will be a change to dowhy’s builtin estimators (e.g., propensity based) that raise an error if the treatment is not binary type. Currently, we have an explicit raise Error that precludes someone from passing in an int, and @AlxndrMlk is suggesting to remove this super-strict check.
Great, adding to the calendar and will keep you posted.
Oh, I see, that’s a valid concern. If you don’t mind, I am happy to look into it and discuss this with you further if we find a sensible solution.