scikit-learn: LinearRegression with zero sample_weights is not the same as excluding those rows
Describe the bug
Excluding rows having sample_weight == 0
in LinearRegression
does not give the same results.
Steps/Code to Reproduce
from collections import Counter
import numpy as np
from sklearn.linear_model import LinearRegression
results = []
for i in range(100):
rng = np.random.RandomState(i)
n_samples, n_features = 10, 5
X = rng.rand(n_samples, n_features)
y = rng.rand(n_samples)
reg = LinearRegression()
sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
sample_weight_0 = sample_weight.copy()
sample_weight_0[-5:] = 0
y[-5:] *= 1000 # to make excluding those samples important
reg.fit(X, y, sample_weight=sample_weight_0)
coef_0, intercept_0 = reg.coef_.copy(), reg.intercept_
reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
print(f"{coef_0=}")
print(f"{reg.coef_=}")
results.append(np.allclose(reg.coef_, coef_0, rtol=1e-6))
print(Counter(results))
Expected Results
Always True
.
Actual Results
Counter({True: 79, False: 21}) # it fails 20% of the time
The print statement gives:
coef_0 = array([ 1.43516166, -1.78826443, 0.15365526, 1.82233166, -1.6 ])
reg.coef_ = array([ 2.24022351, -1.04917851, 0.45341088, 0.80315086, -0.25726798])
Versions
System:
python: 3.9.15 (main, Nov 15 2022, 05:24:15) [Clang 14.0.0 (clang-1400.0.29.202)]
machine: macOS-13.3.1-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.2
numpy: 1.23.5
scipy: 1.10.1
Cython: 0.29.33
user_api: blas
internal_api: openblas
prefix: libopenblas
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 16 (16 by maintainers)
I edited the script of the summary to show that it fails in 20% of the time for 100 seeds.