core: Nest integration high CPU usage on armv7 / raspberry pi in pubsub subscriber native code (fixed in 2023.10.x)

The problem

When I enable the Google Nest integration, I see a continous CPU load of python3 of about 65% Without this integration CPU load of python3 is just about 2%

Any recommendation to optimize the load of the Google Nest integration?

What version of Home Assistant Core has the issue?

2022.2.9

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

nest

Link to integration documentation on our website

https://www.home-assistant.io/integrations/nest

Diagnostics information

{
  "home_assistant": {
    "installation_type": "Home Assistant Container",
    "version": "2022.2.9",
    "dev": false,
    "hassio": false,
    "virtualenv": false,
    "python_version": "3.9.7",
    "docker": true,
    "arch": "armv7l",
    "timezone": "Europe/Brussels",
    "os_name": "Linux",
    "os_version": "5.10.63-v7l+",
    "run_as_root": true
  },
  "custom_components": {
    "config_editor": {
      "version": "3.0",
      "requirements": []
    },
    "hacs": {
      "version": "1.22.0",
      "requirements": [
        "aiogithubapi>=21.11.0"
      ]
    },
    "bosch": {
      "version": "0.17.3",
      "requirements": [
        "bosch-thermostat-client==0.17.3"
      ]
    },
    "afvalbeheer": {
      "version": "4.9.2",
      "requirements": [
        "rsa",
        "pycryptodome"
      ]
    },
    "authenticated": {
      "version": "21.9.0",
      "requirements": []
    }
  },
  "integration_manifest": {
    "domain": "nest",
    "name": "Nest",
    "config_flow": true,
    "dependencies": [
      "ffmpeg",
      "http",
      "media_source"
    ],
    "documentation": "https://www.home-assistant.io/integrations/nest",
    "requirements": [
      "python-nest==4.2.0",
      "google-nest-sdm==1.7.1"
    ],
    "codeowners": [
      "@allenporter"
    ],
    "quality_scale": "platinum",
    "dhcp": [
      {
        "macaddress": "18B430*"
      },
      {
        "macaddress": "641666*"
      },
      {
        "macaddress": "D8EB46*"
      },
      {
        "macaddress": "1C53F9*"
      }
    ],
    "iot_class": "cloud_push",
    "is_built_in": true
  },
  "data": {
    "subscriber": {
      "start": 1,
      "message_received": 7,
      "message_acked": 7
    },
    "devices": [
      {
        "data": {
          "name": "**REDACTED**",
          "type": "sdm.devices.types.DOORBELL",
          "assignee": "**REDACTED**",
          "traits": {
            "sdm.devices.traits.Info": {
              "customName": "**REDACTED**"
            },
            "sdm.devices.traits.CameraLiveStream": {
              "maxVideoResolution": {
                "width": 640,
                "height": 480
              },
              "videoCodecs": [
                "H264"
              ],
              "audioCodecs": [
                "AAC"
              ],
              "supportedProtocols": [
                "RTSP"
              ]
            },
            "sdm.devices.traits.CameraImage": {
              "maxImageResolution": {
                "width": 1920,
                "height": 1200
              }
            },
            "sdm.devices.traits.CameraPerson": {},
            "sdm.devices.traits.CameraSound": {},
            "sdm.devices.traits.CameraMotion": {},
            "sdm.devices.traits.CameraEventImage": {},
            "sdm.devices.traits.DoorbellChime": {}
          },
          "parentRelations": [
            {
              "parent": "**REDACTED**",
              "displayName": "**REDACTED**"
            }
          ]
        },
        "command": {
          "sdm.devices.commands.CameraLiveStream.GenerateRtspStream": 1
        },
        "event_media": {
          "event": 2,
          "event.new": 2,
          "event.fetch": 2,
          "fetch_image": 2,
          "fetch_image.skip": 2,
          "event.notify": 2
        }
      }
    ]
  }
}

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 93 (44 by maintainers)

Most upvoted comments

Fantastic.That meets the description of the issue I was seeing when debugging.

Last time I bumped grpc there were some other dependency issues for protobuf 4. If these aren’t resolved yet it may not just be a trivial bump, but either way I’ll manage it.

Still an issue - I am running a new install on a raspberry pi 4, hosted via docker compose.

spotted high CPU last night and narrowed it down to the nest integration. CPU is around 30% with best enabled, 1-2% with it disabled.

i assume we are needing to wait for a fix further upstream based on the comment history, but I just wanted to confirm that it’s still an issue on the latest HA version.

I have been seeing a jump from 5% CPU without the Nest integration to 40% with the integration.

I have been running HomeAssistantOS 32 bit on a raspberry pi 4. The other day I migrated my HAOS to the 64 bit version and my CPU usage stays at 5% while running the Nest integration. Yay!

I don’t know if this is already known but I thought it may be helpful to someone.

That’s great news. The lib bump with come out with 2023.10.x

Did the same 64-Bit swap as rgerbranda, my cpu finally lowered to like 5% usage and the low 50’s in temp since summer last year.

I switched to Raspberry Pi OS (64-bit) (thanks @Photoexploration ) and now the Nest integratrion is working fine! The CPU load is back to normal.

@lukakama would you be up to file another issue with grpc? You have a pretty good profile data ,and may be able to give them the detail they need. (I don’t currently have an rpi in my possession and haven’t been able to get one due to the chip shortage)

Done, sharing also another log from strace showing some issue in futex instruction usage (some sort of thread synchronization locking but with already expired absolute timeouts), which should help them to better trace the source of the issue.

However, I think I will move to a 64bit OS as soon as possible (thanks @Photoexploration ! ), in order to workaround the problem and reduce my RPi power usage.

As an addition, I tried to profile both the main HA process and threads using the CPU with “strace -Tcfp <pid>”, and they always reports an high “% temp” on futex syscall, with also a lot of errors… :

(from main process)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.69   38.438107        2645     14527     13052 futex
 11.76    7.838140      712558        11         3 restart_syscall
 10.05    6.697177      148826        45           _newselect
  8.33    5.549870      168177        33           poll
  8.13    5.415075       13272       408           epoll_pwait
  3.76    2.506349          47     53049           clock_gettime64
  0.07    0.046078          88       521           write
  0.05    0.035340          66       533           _llseek
[...]

(from the thread constantly using the CPU)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.67   34.166324        2457     13904     12848 futex
 16.63   12.172873     1106624        11         4 restart_syscall
 16.35   11.971184      244309        49           poll
  8.70    6.367453      219567        29           _newselect
  7.78    5.698243       11086       514           epoll_pwait
  3.40    2.487209          47     52787           clock_gettime64
  0.20    0.145667         446       326           read
[...]

(from the thread using CPU intermittently)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.00   22.481324        2182     10299      9492 futex
 20.65    9.102114      827464        11         4 restart_syscall
 14.25    6.281863      224352        28           _newselect
  8.63    3.802922        9320       408           epoll_pwait
  4.09    1.803817          46     39178           clock_gettime64
  1.12    0.491646        8779        56           poll
  0.04    0.019802         119       166        23 recvfrom
[...]

so, if I’m interpreting it correctly, it should confirm that there are still some issues on threading/polling management on grpc library on ARM.

Hi, I just updated HA (docker installation) to 2023.1.0, but my RPI 3 CPU usage has been increased from ~50% to ~100%… from what I can see from “htop”, there are three HA threads which cause all the CPU load, with one of them constantly using 50% of the CPU, and the other two, at alternated times, using the remaining 50% .

image image

(PIDs 48, 53 and 64)

Python profiler still shows nothing, so the issue is still in some native code: image

I tried to retrieve some information for such threads adding “gdb” to the docker image and performing a “thread apply all bt” command on the HA’s python process, and all three threads report the library “/usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so” in their stacks, so there are still some other issues in the grpc library:

(PID 53)
Thread 44 (LWP 53 "python3"):
#0  0x76f6fa6e in syscall () from /lib/ld-musl-armhf.so.1
#1  0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(PID 48 - took wile using CPU)
Thread 39 (LWP 48 "python3"):
#0  0x76f6fa6e in syscall () from /lib/ld-musl-armhf.so.1
#1  0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(PID 64 - took wile using CPU)
Thread 53 (LWP 64 "python3"):
#0  0x76f6fa6c in syscall () from /lib/ld-musl-armhf.so.1
#1  0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I also tried to enable grpc logs (GRPC_TRACE=api and GRPC_VERBOSITY=info environmental variable), but it seems that the polling problem reported in the https://github.com/googleapis/python-pubsub/issues/728 has been fixed, as the container grpc library is 1.51.1 and logs reports a poll message every 200ms, as it should:

I0105 11:25:21.678587881      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914321, tv_nsec: 878563767, clock_type: 1 }, reserved=0)
I0105 11:25:21.879211785      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 79186160, clock_type: 1 }, reserved=0)
I0105 11:25:22.080915739      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 280890114, clock_type: 1 }, reserved=0)
I0105 11:25:22.281308965      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 481280945, clock_type: 1 }, reserved=0)
I0105 11:25:22.483128597      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 683102347, clock_type: 1 }, reserved=0)

so there should still be something wrong on grpcimplementation…

grpc 1.51.1 was integrated in https://github.com/home-assistant/core/pull/83420

Perhaps we can confirm folks see this helps on a dev build and we can consider this closed.

Hey! Remember this is a volunteer project. I bunch a pretty significant amount of effort into diagnosing and profiling this, finding a potential cause, and filed the issue in the upstream project.

The right answer is I disabled the addon for now, running 100% cpu for months on a row is unacceptable, hoped a little more effort from the devs