homie-esp8266: [develop-v3] OTA failing

Hi people,

I’ve been hitting my head against the wall trying to OTA update an ESP8266. The code is minimalistic

#include "Arduino.h"
#include <Homie.h>


HomieNode sensorNode("sensor", "Sensor", "sensor");

unsigned long lastSent = 0;

void loopHandler() {
  unsigned long m = millis();
  if ( m - lastSent >= 10 * 1000UL || lastSent == 0) {
    Homie.getLogger() << "Fake value: " << m << endl;
    sensorNode.setProperty("value").send(String(m));
    lastSent = millis();
  }
}

void setup() {
  Serial.begin(115200);
  Homie.getLogger() << "kkkkkkkkkkkkkkkkkkkkkk" << endl;
  Homie.getMqttClient().setKeepAlive(120);
  Homie_setFirmware("test", "1.0.0");
  Homie.setLoopFunction(loopHandler);
  sensorNode.advertise("value").setName("Value").setDatatype("double");
  Homie.setup();
}

void loop(){
  Homie.loop();
}

The platformio.ini file is :

[env:nodemcuv2]
platform = espressif8266
board = nodemcuv2
framework = arduino
build_flags =
    -D HOMIE_MDNS=0
    -D PIO_FRAMEWORK_ARDUINO_LWIP2_LOW_MEMORY
upload_speed = 115200
lib_ldf_mode = deep
lib_deps =
    ArduinoJson@6.13.0
    https://github.com/homieiot/homie-esp8266.git#develop-v3

I also repartitioned the mcu to make sure there is enough space:

# Name,   Type, SubType, Offset,  Size, Flags
nvs,      data, nvs,     0x9000,  0x6000,
phy_init, data, phy,     0xf000,  0x1000,
factory,  app,  factory, 0x10000, 3M,

Here are my libs :

Dependency Graph
|-- <ArduinoJson> 6.13.0
|-- <Homie> 3.0.0 #7818c73
|   |-- <ArduinoJson> 6.13.0
|   |-- <AsyncMqttClient> 0.8.2
|   |   |-- <ESPAsyncTCP> 1.2.2
|   |   |   |-- <ESP8266WiFi> 1.0
|   |-- <Bounce2> 2.52
|   |-- <ESP Async WebServer> 1.2.3
|   |   |-- <ESPAsyncTCP> 1.2.2
|   |   |   |-- <ESP8266WiFi> 1.0
|   |   |-- <Hash> 1.0
|   |   |-- <ESP8266WiFi> 1.0
|   |   |-- <ArduinoJson> 6.13.0
|   |-- <ESP8266mDNS> 1.2
|   |   |-- <ESP8266WiFi> 1.0
|   |-- <ESP8266WiFi> 1.0
|   |-- <DNSServer> 1.1.1
|   |   |-- <ESP8266WiFi> 1.0
|   |-- <ESPAsyncTCP> 1.2.2
|   |   |-- <ESP8266WiFi> 1.0
|   |-- <Ticker> 1.0
|   |-- <ESP8266HTTPClient> 1.2
|   |   |-- <ESP8266WiFi> 1.0

Also for some reason, not sure why, platform IO is not recognizing the size of my partitions …

DATA:    [=====     ]  48.9% (used 40048 bytes from 81920 bytes)
PROGRAM: [=====     ]  49.8% (used 520632 bytes from 1044464 bytes)

Shouldn’t it read 3M and not 1MB (1044464 bytes)

Anyways, then what I do is modify the sketch to change the logging line after initializing Serial. Then build and then invoke :

$ python ota_updater.py -l 192.168.0.227 -p 1883 -t devices/ -i testbox /home/nemi/workspace/green-jac/homie_i2c/.pio/build/nodemcuv2/firmware.bin
Connecting to mqtt broker 192.168.0.227 on port 1883
Connected with result code 0
Waiting for device to come online...
Waiting for device info...
Publishing new firmware with checksum 3373df653b869cf431689d254d3b1e37
[                              ] 989/545344Waiting for device info...
[                              ] 989/545344Expecting checksum 3373df653b869cf431689d254d3b1e37, got b6b11dc49ab3c6400b1995025dc996f1, update failed!

Sometimes the serial output shows and blocks:

Receiving OTA payload
↕ OTA started
Triggering OTA_STARTED event...
Firmware is binary
✖ OTA failed (500 INTERNAL_ERROR 11)
Triggering OTA_FAILED event...

ERROR 11 according to this means UPDATE_ERROR_BAD_ARGUMENT

Some other times I get :

Receiving OTA payload
↕ OTA started
Triggering OTA_STARTED event...
Firmware is binary
Receiving OTA firmware (453/545344)...
Receiving OTA firmware (989/545344)...

 ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v951aeffa
~ld

Here is what gets published :

------- SOMETHING THAT SEEMS TO BE THE BINARY BEING SENT OVER MQTT --------
devices/testbox/$implementation/ota/status 202
devices/testbox/$implementation/ota/status 206 453/545344
devices/testbox/$implementation/ota/status 206 989/545344
------- RESET -------
devices/testbox/$state init
devices/testbox/$homie 3.0.1
devices/testbox/$name Test
....

As u see in the output the board restarts but with the old firmware, it’s like it starts getting the binary but then decides (?) to restart before it got it all. I read through this other issue https://github.com/homieiot/homie-esp8266/issues/613 and tried the keep alive thing but did not work either

Then I also added the flag but it makes it even worse

-D HOMIE_CONFIG=0

I get the following crash if I flash the board with a firmware built with that flag and then OTA:

0x40225144: operator delete[](void*) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/del_opv.cc line 33
0x40225174: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 45
0x40225135: std::set_terminate(void (*)()) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/eh_terminate.cc line 70
0x40225120: __fpclassifyd at ../../../../../../dl/newlib-xtensa/newlib/libm/common/s_fpclassify.c line 26
0x402252b1: glue2esp_linkoutput at glue-esp/lwip-esp.c line 239
0x40225533: new_linkoutput at glue-lwip/lwip-git.c line 216
0x40225174: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 45
0x4022ce14: etharp_output_to_arp_index at core/ipv4/etharp.c line 764
0x4022e6da: inet_chksum_pseudo at core/inet_chksum.c line 266
0x4022dba2: ip4_output_if_opt at core/ipv4/ip4.c line 820
0x4022e717: inet_chksum_pseudo at core/inet_chksum.c line 327
0x4022a8a2: tcp_output at core/tcp_out.c line 1312
0x40228ba4: tcp_input at core/tcp_in.c line 487
0x4022d919: ip4_input at core/ipv4/ip4.c line 696
0x4022589d: ethernet_input_LWIP2 at netif/ethernet.c line 170
0x402256d3: esp2glue_alloc_for_recv at glue-lwip/lwip-git.c line 419
0x4024a51e: ethernet_input at glue-esp/lwip-esp.c line 353
0x4024a52f: ethernet_input at glue-esp/lwip-esp.c line 355
0x40225135: std::set_terminate(void (*)()) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/eh_terminate.cc line 70
0x40225120: __fpclassifyd at ../../../../../../dl/newlib-xtensa/newlib/libm/common/s_fpclassify.c line 26
0x40225135: std::set_terminate(void (*)()) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/eh_terminate.cc line 70
0x40225174: operator new(unsigned int) at ../../../../../dl/gcc-xtensa/libstdc++-v3/libsupc++/new_op.cc line 45

Any ideas what might be wrong or what else to try before diving into the code? I still have to try with ESP32 to see if the same thing happens. Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 27 (17 by maintainers)

Most upvoted comments

Hey guys, I found the problem and learned a few things along the way.

The hardware WD timer was being hit as direct consequence of a memory bug. I timed the memcpy + unique_ptr piece of code to be aprox at 50 microsecs.

Here is what’s wrong : When you get to ota topic that has as payload the binary firmware it first goes through

void BootNormal::_onMqttMessage(char* topic, char* payload, AsyncMqttClientMessageProperties properties, size_t len, size_t index, size_t total) {
  if (total == 0) return;  // no empty message possible

  size_t topicLength = strlen(topic);
  _mqttTopicCopy = std::unique_ptr<char[]>(new char[topicLength+1]);
  memcpy(_mqttTopicCopy.get(), topic, topicLength);
  _mqttTopicCopy.get()[topicLength] = '\0';

  // split topic on each "/"
  if (index == 0) {
    __splitTopic(_mqttTopicCopy.get());
  }

  // 1. Handle OTA firmware (not copied to payload buffer)
  if (__handleOTAUpdates(_mqttTopicCopy.get(), payload, properties, len, index, total))
    return;
  1. copies the topic into _mqttTopicCopy
  2. since index == 0 calls splitTopic. That function uses the internal object (char*) of _mqttTopicCopy to tokenize the topic and then store all the tokens in _mqttTopicLevels.

The bug resides in that the tokens are pieces of the internal buffer copied into _mqttTopicCopy. When doing OTA, the _onMqttMessage is “artificially” invoked for the chunked FW (here), every time a new call to _onMqttMessage is done kills the previous instance of the _mqttTopicCopy (here) and thus the pointers that were stored on _mqttTopicLevels become invalid, leading to undefined behavior and eventually a timeout.

I’ll work on a fix and send the PR.

Thanks !

@nemidiy your support is greatly appreciated!

@luebbe @sradner13 thanks for the input. the memcpy code is copying the mqtt topic, not the payload. https://github.com/homieiot/homie-esp8266/blob/develop-v3/src/Homie/Boot/BootNormal.cpp#L786 The topic should be string no more that 50 chars. Anyways, since I am the rookie here I will time that piece of code to see how much time it takes. Regardless there is clearly something wrong on the copying that’s causing the issue. I will do some debugging and come back. My end goal is having OTA over SSL working eventually.