esp-idf: ESP32-S3 float point incorrectly computed `nan` or `inf` when pinned to a core (IDFGH-9930)
Answers checklist.
- I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
- I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
- I have searched the issue tracker for a similar issue and not found a similar issue.
IDF version.
v5.1-dev-4692-g7bd0fd4abd
Operating System used.
Windows
How did you build your project?
Command line with Make
If you are using Windows, please specify command line type.
None
Development Kit.
ESP32-S3-WROOM-1-N16R2
Power Supply used.
USB
What is the expected behavior?
computing a valid float algorithm should result in a finite result.
What is the actual behavior?
I have a circular buffer containing data from an ADC and will process the buffered data in a separated thread.
However, the processing thread sometimes computed invalid nan or inf result from the normal inputs.
for example:
for (int i = 0; i < n; ++i) {
const auto& el = ads1247.data_buffer[i];
auto v = (float)el.value / (float)(1UL << el.gain);
data[i] = v;
auto scaled = scaler(v) / (float)led_level;
if (!std::isfinite(scaled)) {
float recalculate = (float)el.value / (float)(1UL << el.gain);
LE("Invalid data @{}: scaled={}, v={}, value = {}, gain={}, recalculate={}", i, scaled, v, el.value, el.gain, recalculate);
LE("data_buffer contents:");
for (int j = 0; j < n; ++j) {
const auto& d = ads1247.data_buffer[j];
LE("Item [{}]: gain={}\tvalue={}\traw={}", j, d.gain, d.value, *reinterpret_cast<const int*>(&d.value));
}
LE("(float)el.value={}", (float)el.value);
LE("(float)(1UL << el.gain)={}", (float)(1UL << el.gain));
LE("v={}", (float)el.value / (float)(1UL << el.gain));
LE("scaler(v)={}", scaler(v));
LE("scaler(v) / (float)led_level={}", scaler(v) / (float)led_level);
LE("scaler.multiplier={}", scaler.multiplier);
LE("scaler.output_max_value={}", scaler.output_max_value);
LE("scaler.output_min_value={}", scaler.output_min_value);
LE("scaler.input_min_value={}", scaler.input_min_value);
abort();
} else if (i == n - 1) {
LI("Good data: scaled={}, v={}, value = {}, gain={}, recalculate={}", scaled, v, el.value, el.gain, (float)el.value / (float)(1UL << el.gain));
}
}
It triggers the errorous branch randomly, giving the following outputs
E (14:55:28.576) turbidity: Invalid data @77: scaled=inf, v=inf, value = 5086020, gain=6, recalculate=inf
E (14:55:28.577) turbidity: data_buffer contents:
****Item output removed for simplicity*****
E (14:55:28.607) turbidity: (float)el.value=5086020
E (14:55:28.607) turbidity: (float)(1UL << el.gain)=64
E (14:55:28.607) turbidity: v=79469.06
E (14:55:28.607) turbidity: scaler(v)=inf
E (14:55:28.608) turbidity: scaler(v) / (float)led_level=inf
E (14:55:28.608) turbidity: scaler.multiplier=0.10319918
E (14:55:28.608) turbidity: scaler.output_max_value=100
E (14:55:28.609) turbidity: scaler.output_min_value=0
E (14:55:28.609) turbidity: scaler.input_min_value=500
Note, the scaler is a linear mapping function adopted from etl. It computes the output by float(((value - input_min_value) * multiplier)) + output_min_value;
The floating point arithmetics do not involve any invalid data or invalid operation, but somehow it mistakenly computed inf. Normally, the result should be equal to
v = 79469.06
scaler(v) = ((79469.06 - 500) * 0.10319918) + 0 = 8149.54223737
I made many attempts to find out the root of this random error but it turns out when creating this thread, if I pin the thread to core 0 by using esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 0}; this problem will occur. If I pin the thread to core 1 by esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 1}; The problem is gone. Why would the core affinity affect the floating point calculation result? I know that the FPU is lazy switching, which potentially causes the inconsistency when dealing with threads. Is there any precaution I need to take to make sure the floating point result is valid and consistent?
Steps to reproduce.
- The problem is a bit too random to reproduce
Debug Logs.
No response
More Information.
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 23 (12 by maintainers)
This is really a good new, because a regression is by nature easier to fix that a very old problem that was never seen before. Nevertheless, is can be hard to fix!
For now, you have a workaround to run your code.