scikit-learn: LinearRegression with zero sample_weights is not the same as excluding those rows

Describe the bug

Excluding rows having sample_weight == 0 in LinearRegression does not give the same results.

Steps/Code to Reproduce

from collections import Counter
import numpy as np
from sklearn.linear_model import LinearRegression

results = []
for i in range(100):
    rng = np.random.RandomState(i)
    n_samples, n_features = 10, 5
    X = rng.rand(n_samples, n_features)
    y = rng.rand(n_samples)
    reg = LinearRegression()
    sample_weight = rng.uniform(low=0.01, high=2, size=X.shape[0])
    sample_weight_0 = sample_weight.copy()
    sample_weight_0[-5:] = 0
    y[-5:] *= 1000  # to make excluding those samples important
    reg.fit(X, y, sample_weight=sample_weight_0)
    coef_0, intercept_0 = reg.coef_.copy(), reg.intercept_

    reg.fit(X[:-5], y[:-5], sample_weight=sample_weight[:-5])
    print(f"{coef_0=}")
    print(f"{reg.coef_=}")
    results.append(np.allclose(reg.coef_, coef_0, rtol=1e-6))
print(Counter(results))

Expected Results

Always True.

Actual Results

Counter({True: 79, False: 21})  # it fails 20% of the time

The print statement gives:

coef_0 =    array([ 1.43516166, -1.78826443,  0.15365526,  1.82233166, -1.6       ])
reg.coef_ = array([ 2.24022351, -1.04917851,  0.45341088,  0.80315086, -0.25726798])

Versions

System:
    python: 3.9.15 (main, Nov 15 2022, 05:24:15)  [Clang 14.0.0 (clang-1400.0.29.202)]
   machine: macOS-13.3.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: 0.29.33

       user_api: blas
   internal_api: openblas
         prefix: libopenblas

About this issue

Original URL
State: open
Created a year ago
Comments: 16 (16 by maintainers)

Most upvoted comments

I edited the script of the summary to show that it fails in 20% of the time for 100 seeds.

glemaitre on Apr 14, 2023