pycom-micropython-sigfox: GPY Modem - ESP32 Cannot Communicate, LTE() function error

GPY Modem-ESP32 Communication Issue

This document describes this hardware/firmware issue that dramatically impacts reliability and potential applications.

Issue Description

The fundamental issue reloves around this function in main.py:

lte = LTE()

The ESP32 is unable to connect to the modem, resulting in the error below from pycom-micropython-sigfox/esp32/mods/modlte.c

OSError: Couldn't connect to Modem (modem_state=disconnected)

After resetting, the problem repeats itself.

machine.reset()

After hundreds of soft resets, the LTE() function occasionally returns successfully and we are able to proceed with full functionality from then on. But after one (or more) soft resets or spontaneous disconnects, the problem returns. The same happens when resets occur via WDT.

Power cycling will usually cause the module to operate properly on the first try, but not always. And the issue always returns after a period of time.

Firmware Version Tests

This problem is present with many firmware versions, including the experimental Dev branch of pycom-mircopython-sigfox.

  • v1.18.2
  • v1.18.2.r7
  • v1.20.1
  • v1.20.1.r2
  • v1.20.2.rc6
  • Development I have used both the pre-compiled firmware as well as built it with pycom-esp-idf and the toolchain. Nothing fixes the problem.

I am using the latest stable version, CATM1-41065.dup firmware for the modem. It is almost impossible to downgrade because, obviously, I cannot communicate with the modem.

Hardware Tests

This problem is present in two different GPY modules using the Expansion Board 3.1 and the provided cellular antenna.

One module with a firmware test: (sysname=‘GPy’, nodename=‘GPy’, release=‘1.20.1.r2’, version=‘v1.20.1.r2-5-gdb7f895-dirty on 2020-03-25’, machine=‘GPy with ESP32’)

Fix Attempts

Trying lte = LTE() from the CLI does exactly the same thing

Adding pycom.lte_modem_en_on_boot(False) does nothing

Other Instances

I am clearly not the only one with this issue. See these forum threads from 2018:

https://forum.pycom.io/topic/3675/despite-heavy-investment-in-fipy-gpy-not-possible-to-use-board-as-anything-more-than-lte-modem-and-even-that-s-problematic

https://forum.pycom.io/topic/3129/lte-lte-getting-stuck-after-reset-fw-1-17-3-b1-on-fipy

Relevant Code

import pycom
import time
import os
from machine import WDT
from machine import SD
import machine

pycom.wdt_on_boot(True)
pycom.wdt_on_boot_timeout(240000)

wdt = WDT(timeout=240000)  # enable it with a timeout of 240 seconds
wdt.init(240000)
wdt.feed()

import ujson

pycom.wifi_on_boot(False)
pycom.heartbeat(False)

from machine import Pin
from network import WLAN

wlan = WLAN()
wlan.antenna(WLAN.EXT_ANT)
wlan.deinit()

import socket
import ssl
from network import LTE
from network import Bluetooth
from simple import MQTTClient
import ubinascii
import array
from machine import RTC

if pycom.lte_modem_en_on_boot():
    print("LTE on boot was enabled. Disabling.")
    pycom.lte_modem_en_on_boot(False)

print("LTE()")

try:
    lte = LTE()
except:
    time.sleep(6)
    machine.reset()

#we rarely get here...
print("LTE() done")
#rest of program...

Proposed Resolution

There must be a way to reset the communication lines between the ESP32 and the Modem without first executing lte = LTE(). I am comfortable building my own updated firmware with a solution from Pycom. However, I need assistance due to the complexity of the the LTE-related firmware and processes.

Thank you thank you thank you for any help in solving this issue!

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 3
  • Comments: 23 (2 by maintainers)

Most upvoted comments

@curtmiller, @jonnerd154’s information has all been based on CATM1-5.2-48829/1.20.2.r4.

One additional tidbit that has been helpful was from #585: Occasionally I have seen that the GPy has been running fine for weeks (with a few automated power cycles from the external watchdog) but will get into a cycle where it will not longer attach. Sending the AT&F command and then a reset has been able to get the GPy out of this state.

After working with the Pycom engineers, it appears that the “experimental” new Modem firmware they have fixes this problem. However, I am keeping this ticket open until I see that someone has posted a link to where people can get this firmware.

As indicated previously we are using Pycom MicroPython V1.20.0.rc13 release candidate in our application. We do not use the LTE module for connectivity but instead talk directly to the cell modem thru the serial port. It is possible that improvements have been made to the LTE module in newer versions that we have not tried. The external watchdog is definitely necessary for long-term reliability.

We ended up having to add an external watchdog circuit which toggles the +5V power to the GPy chip.

From: SebastiaanMerckx @.> Sent: Wednesday, November 24, 2021 9:01 AM To: pycom/pycom-micropython-sigfox @.> Cc: tlanier9 @.>; Mention @.> Subject: Re: [pycom/pycom-micropython-sigfox] GPY Modem - ESP32 Cannot Communicate, LTE() function error (#445)

I have that same 46262 firmware and, together with an unofficial 1.20.2.rc11 micropython binary, this is the most stable that I could get so far (using NB-IoT). So it means we have a live setups with unofficial modem firmware and unofficial pycom firmware, great 😄 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pycom/pycom-micropython-sigfox/issues/445#issuecomment-977905585 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPDXIU6CVNO7M6UIDCUNDUNTVZBANCNFSM4M77HMUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub . https://github.com/notifications/beacon/AANPDXNM37VQQ4AXXL7M5KLUNTVZBA5CNFSM4M77HMUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOHJE2PMI.gif

Thanks to you all for your patience. Unfortunately despite best efforts, we haven’t been able to get much closer to solving it once and for all, however we made some small steps and I can give a bit more information on what we’ve learned. I don’t have vast experience with LTE modems so might give some awkward descriptions or have missed some clues.

To recap (OP described it best), we were experiencing a problem failing to gain control of the LTE modem at times and so the LTE() object could not be created, and so it could not be lte.deinit()ed etc. We were running the NB1-41019 firmware. We need to deepsleep in our application but we can’t power cycle.

This would often happen to us when attempting an lte.attach(), whereby if it failed to attach in allotted time, the program would time out, reset, and be unable to reinitialize the modem with LTE(). The logic analyzer could see the Fipy trying to interrupt the modem with +++ while the modem appears to be stuck in data mode ignoring the interrupt (the PyJTAG is a bit out of reach for us).

Unable to predict when or recreate the conditions (it would usually just happen from time to time), we found it to be happening a lot with one of our test devices running NB1-41019, increasingly frequent until it became effectively bricked. With nothing to lose, it was decided to flash the experimental NB1-46262 available from a Pycom service request. Since flashing 46262, we can regain control over the modem almost every time (failed twice but was able to retry and regain control). Not sure whether it was just the act of flashing the modem firmware that cleared its config or the version itself, but flashing it did mostly unblock it for us in this instance. Monitoring the current draw, we observed the modem running 46262 to enter deepsleep every time with lte.deinit(reset=True).

Our capabilities will be limited with the equipment we have but if there are any suggestions we can try to get closer to resolving it, we can post our results.

Had to focus on some procurement here. The LTE module is still important to us - I haven’t forgotten. I have some current profile plots that I can put up… I’ll be back to post those when I get a moment.

We’ve been trying for some time to find a fix/workaround to these same LTE() issues you’re experiencing while trying to use the Fipy on NB-IoT in both the UK and the US. It would be ideal for us if the issues with the LTE class could be resolved for our existing code and hardware as we are also unable to power-cycle the modem in our intended application, and we’re able and willing to help do some testing here with the equipment we have and contribute our findings.

Our focus is on getting the Fipy working on NB-IoT in the US ideally, but the LTE functionality is equally as important to us in general. Our UK Fipy can send/receive on NB-IoT whenever the LTE class doesn’t error out, however we haven’t had our US Fipy work once on NB-IoT despite using a known-working SIM.

We have a few Fipys in the UK and US on firmware 1.20.0.rc13 or 1.20.2.r4, running basic NB-IoT Python code from file or REPL. Most are on custom boards with sensors but we also have Pysense/Expansion boards for testing. Modem firmware is LR6.0.0.0-41019 (NB-IoT). We’re yet to try 46262, but I’ll report back on our findings with it when we try it. For sake of doubt, some tests were performed with a (1000 µF + 100 nF) capacitor pair added to each power input pins and 3.3 V out, then powered by battery, USB, sometimes both, or external power supply and found no improvement or difference with/without any combination. (Notably with the modem active and Fipy idle in REPL, sometimes observed power spikes around 600 mA for 7 ms, peaks reduced slightly with capacitors added but functionally the same).

Some of our investigations / findings:

  • We experience the same LTE() module issues with both 1.20.0.rc13 previously and now 1.20.2.r4.
  • We have observed problems with initialization/creation of the LTE() object as well as lte.deinit() and find that either one can become a recurring problem on each machine.reset() cycle.
  • We don’t have a PyJTAG but instead connected an analyzer on the five accessible LTE UART pins on the Fipy with pycom.lte_modem_en_on_boot() == True and observed modem responses
  • As power is applied, and before manually calling LTE(), the firmware tries to put the modem in command mode with +++, sent a second time if no response after the first (as seen in lteppp.c)
  • The same modem initialization sequence happens when lte=LTE() is called. Some times when the call fails we have observed either:
    • no response from the modem (RX is silent)
    • or the modem ignores the +++ and stays in data mode sending HDLC frames
  • After a successful uplink (and downlink), lte.deinit() is called. The ESP32 sends an 'ATH\r' and then AT commands proceed up to AT+CFUN=4 which should turn off the transceiver. Here’s where the paths diverge:
    • Where the deinit() call succeeds, the modem sends a OK CEREG: 0 and WAKE goes low for 1 second
    • Where the call fails, it was observed that RTS went high and no RX data from the modem. WAKE does not change.

We also tried some program-stress tests, repeatedly running the same LTE init/send/receive/deinit/reset code over 150 times to spot any patterns (varying our experimental setup after around 20 tries each). There was no clear difference varying power source or with/without capacitors. We did observe a pattern when calling deinit() quickly after receiving an NB-IoT downlink where it succeeded almost every time with only a few failures, however adding a few seconds extra processing delay between the downlink and the deinit() prompted it to fail almost every time with a few successful deinits.

It seems the internal functions can’t take back control over the modem at times where the modem is expected to be in one given state. Maybe there’s another way of interrupting the modem or getting its attention (is DTR wired and does the modem use it?).

Let’s hope some clues lead to a better understanding and a solution. We have a lot of data. I can provide more detail on any of these points if needed.

We had to switch to a different product entirely as the pycom units we had in the field caused countless issues and we had site visits literally on a weekly basis. These are in no way ready for prime time.