SALib: saltelli.sample returns several times the exact same samples

I recently upgrade to SAlib 1.4.0.2 and witnessed a behaviour that looks incorrect to me. When using saltelli.sample, most of the returned samples are identical, which would mean that the model is evaluated several times with the exact same input variables. Is this really how it should be?

Code from the SAlib example:

from SALib.sample import saltelli
from SALib.analyze import sobol

problem = {
    'num_vars': 3,
    'names': ['x1', 'x2', 'x3'],
    'bounds': [[-3.14159265359, 3.14159265359],
               [-3.14159265359, 3.14159265359],
               [-3.14159265359, 3.14159265359]]
}

x = saltelli.sample(problem, 2)

Output:

x = array([[-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [-3.14159265, -3.14159265, -3.14159265],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]])

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (16 by maintainers)

Most upvoted comments

You mean you’re unsure of what “skipping values” actually means?

The brief explanation I can offer is that:

The Sobol’ sequence is used to generate points in parameter space to sample from. It is a deterministic sequence.
A common practice was to skip the initial first points to obtain a more uniform sampling
This was shown have an adverse effect (in the document I linked to above), actually increasing the number of samples needed to achieve convergence

To avoid the duplicate samples, you can skip a given number of points in the Sobol’ sequence (using the skip_values argument). Avoiding the first point should avoid the duplicate samples.

The caveat is simply that ideally

The number of samples you want and the number of points to skip (given with skip_values) will be a power of 2
Failing the above, the value of skip_values can be set to (2^n)-1 (e.g., you pick n) and this value (2^n)-1 would be <= the desired number of samples

As I understand it, not following the above will still produce usable results (for some value of “usable”) but may take more samples than necessary (mentioned above).

As for removing samples, this is not recommended. As noted above, the Sobol’ sequence is deterministic so changing the samples destroys its structure, and I think the subsequent analysis likely won’t be usable. I recommend skipping values to avoid the duplicates rather than filtering.

ConnectedSystems on Jul 8, 2021

I think I understand you, but I also understand the confusion.

Table 2 in [1] are not samples, they are points in the Sobol’ sequence.

The first row shows all points in this sequence for a 10-dimensional problem, and actually all dimensions, are identical (e.g., all set to 0.5).

As Campolongo et al., describes (in [1]): “As in the first points of the Sobol’ sequence the values of the coordinates tend to repeat (i.e. for the first point they are all equal to 0.5, for the second they are alternates couples of 0.25 and 0.75 and so on”

This repetition is what causes the initial samples to be identical:

“… in order to achieve different coordinates’ values for the points a and b, we need to generate a quasi-random matrix of Sobol’ numbers of size (R, 2k), with R > r, and discard the first few points for the auxiliary points …”

ConnectedSystems on Jul 14, 2021

Hmm, I think to avoid overloading the docs and confusing users I will simplify to outlining just one of the recommendations: that both skip_values and N be a power of 2, and that skip_values be >= N.

https://github.com/scipy/scipy/pull/10844#issuecomment-673029539

ConnectedSystems on Jul 13, 2021

Sure, I am happy to give feedback also in the future. Thanks for the quick responses!

Maybe for a bit of context: I integrated SAlib into our CLIMADA package to perform uncertainty and sensitivity analysis. CLIMADA can be used to model the impact and risk of natural catastrophes today and in the future.

chahank on Jul 12, 2021