esp-idf: ESP32-S3 float point incorrectly computed `nan` or `inf` when pinned to a core (IDFGH-9930)

Answers checklist.

I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.1-dev-4692-g7bd0fd4abd

Operating System used.

Windows

How did you build your project?

Command line with Make

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-S3-WROOM-1-N16R2

Power Supply used.

USB

What is the expected behavior?

computing a valid float algorithm should result in a finite result.

What is the actual behavior?

I have a circular buffer containing data from an ADC and will process the buffered data in a separated thread.

However, the processing thread sometimes computed invalid nan or inf result from the normal inputs.

for example:

    for (int i = 0; i < n; ++i) {
        const auto& el = ads1247.data_buffer[i];
        auto v = (float)el.value / (float)(1UL << el.gain);
        data[i] = v;

        auto scaled = scaler(v) / (float)led_level; 
        if (!std::isfinite(scaled)) {
          float recalculate = (float)el.value / (float)(1UL << el.gain);
          LE("Invalid data @{}: scaled={}, v={}, value = {}, gain={}, recalculate={}", i, scaled, v, el.value, el.gain, recalculate);
          LE("data_buffer contents:");
          for (int j = 0; j < n; ++j) {
            const auto& d = ads1247.data_buffer[j];
            LE("Item [{}]: gain={}\tvalue={}\traw={}", j, d.gain, d.value, *reinterpret_cast<const int*>(&d.value));
          }
          LE("(float)el.value={}", (float)el.value);
          LE("(float)(1UL << el.gain)={}", (float)(1UL << el.gain));
          LE("v={}", (float)el.value / (float)(1UL << el.gain));
          LE("scaler(v)={}", scaler(v));
          LE("scaler(v) / (float)led_level={}", scaler(v) / (float)led_level);
          LE("scaler.multiplier={}", scaler.multiplier);
          LE("scaler.output_max_value={}", scaler.output_max_value);
          LE("scaler.output_min_value={}", scaler.output_min_value);
          LE("scaler.input_min_value={}", scaler.input_min_value);
          abort();
        } else if (i == n - 1) {
          LI("Good data: scaled={}, v={}, value = {}, gain={}, recalculate={}", scaled, v, el.value, el.gain, (float)el.value / (float)(1UL << el.gain));
        }
      }

It triggers the errorous branch randomly, giving the following outputs

E (14:55:28.576) turbidity: Invalid data @77: scaled=inf, v=inf, value = 5086020, gain=6, recalculate=inf
E (14:55:28.577) turbidity: data_buffer contents:
****Item output removed for simplicity*****
E (14:55:28.607) turbidity: (float)el.value=5086020
E (14:55:28.607) turbidity: (float)(1UL << el.gain)=64
E (14:55:28.607) turbidity: v=79469.06
E (14:55:28.607) turbidity: scaler(v)=inf
E (14:55:28.608) turbidity: scaler(v) / (float)led_level=inf
E (14:55:28.608) turbidity: scaler.multiplier=0.10319918
E (14:55:28.608) turbidity: scaler.output_max_value=100
E (14:55:28.609) turbidity: scaler.output_min_value=0
E (14:55:28.609) turbidity: scaler.input_min_value=500

Note, the scaler is a linear mapping function adopted from etl. It computes the output by float(((value - input_min_value) * multiplier)) + output_min_value;

The floating point arithmetics do not involve any invalid data or invalid operation, but somehow it mistakenly computed inf. Normally, the result should be equal to

v = 79469.06
scaler(v) = ((79469.06 - 500) * 0.10319918) + 0 = 8149.54223737

I made many attempts to find out the root of this random error but it turns out when creating this thread, if I pin the thread to core 0 by using esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 0}; this problem will occur. If I pin the thread to core 1 by esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 1}; The problem is gone. Why would the core affinity affect the floating point calculation result? I know that the FPU is lazy switching, which potentially causes the inconsistency when dealing with threads. Is there any precaution I need to take to make sure the floating point result is valid and consistent?

Steps to reproduce.

The problem is a bit too random to reproduce

Debug Logs.

No response

More Information.

No response

About this issue

Original URL
State: closed
Created a year ago
Comments: 23 (12 by maintainers)

Most upvoted comments

This is really a good new, because a regression is by nature easier to fix that a very old problem that was never seen before. Nevertheless, is can be hard to fix!

For now, you have a workaround to run your code.

eiffel31 on Apr 22, 2023