dapr: Data corruption in actor/service invocation under high rps

Expected Behavior

Correct serialization (99.9% of the time) on the HTTP endpoints for actors

Actual Behavior

When running a Proxy call for the Actor implementation, it randomly fails and spits out a weird JSON string, as if something happened on the protocol transport. Below you can find 2 different occurrences:

Occurrence dapr/python-sdk#1:

# /home/xanrin/dapr-python-sdk/dapr/actor/client/proxy.py (line 71)
b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]lse,{}]'

# /home/xanrin/dapr-python-sdk/dapr/serializers/json.py (line 49)
b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]lse,{}]'

# What I expect:
b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]'

Occurrence dapr/python-sdk#2:

# /home/xanrin/dapr-python-sdk/dapr/actor/client/proxy.py (line 71)
b'[[0.02180521361883352,0.9327880131478761,-0.04958054991416051,-1.4610110113961423],1.0,false,{}]]'

# /home/xanrin/dapr-python-sdk/dapr/serializers/json.py (line 49)
b'[[0.02180521361883352,0.9327880131478761,-0.04958054991416051,-1.4610110113961423],1.0,false,{}]]'

# What I expect (there is a ] too much):
b'[[0.02180521361883352,0.9327880131478761,-0.04958054991416051,-1.4610110113961423],1.0,false,{}]'

Stacktrace

== APP == Process ForkServerProcess-4:
== APP == Traceback (most recent call last):
== APP ==   File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
== APP ==     self.run()
== APP ==   File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
== APP ==     self._target(*self._args, **self._kwargs)
== APP ==   File "/usr/local/lib/python3.7/dist-packages/stable_baselines/common/vec_env/subproc_vec_env.py", line 18, in _worker
== APP ==     observation, reward, done, info = env.step(data)
== APP ==   File "/mnt/e/Projects/roadwork-rl/src/Lib/python/roadwork/roadwork/client/client_dapr.py", line 91, in step
== APP ==     obs, reward, done, info = asyncio.get_event_loop().run_until_complete(self.proxy.SimStep({ 'action': action }))
== APP ==   File "/usr/local/lib/python3.7/dist-packages/nest_asyncio.py", line 59, in run_until_complete
== APP ==     return f.result()
== APP ==   File "/usr/lib/python3.7/asyncio/futures.py", line 181, in result
== APP ==     raise self._exception
== APP ==   File "/usr/lib/python3.7/asyncio/tasks.py", line 249, in __step
== APP ==     result = coro.send(None)
== APP ==   File "/home/xanrin/dapr-python-sdk/dapr/actor/client/proxy.py", line 71, in __call__
== APP ==     return self._message_serializer.deserialize(rtnval, self._attr_call_type['return_types'])
== APP == Actor sending: b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]lse,{}]'
== APP == Decoding: b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]lse,{}]'
== APP ==   File "/home/xanrin/dapr-python-sdk/dapr/serializers/json.py", line 49, in deserialize
== APP == Actor sending: b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]lse,{}]'
== APP == Decoding: b'[[0.056063172301151286,0.43107083116247324,-0.21845425248263262,-1.2186224010316111],1.0,true,{}]lse,{}]'
== APP ==     obj = json.loads(data, cls=DaprJSONDecoder)
== APP ==   File "/usr/lib/python3.7/json/__init__.py", line 361, in loads
== APP ==     return cls(**kw).decode(s)
== APP ==   File "/usr/lib/python3.7/json/decoder.py", line 340, in decode
== APP ==     raise JSONDecodeError("Extra data", s, end)
== APP == json.decoder.JSONDecodeError: Extra data: line 1 column 96 (char 95)

Steps to Reproduce the Problem

This happens when running an application that is utilizing a ThreadPool and running > 1000 reqs / sec (estimated)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 45 (33 by maintainers)

Most upvoted comments

Awesome, thanks! It worked!! 😃 I ran 8 processes @ ~42 req / s resulting in 100k steps in 294s without a crash!

image

@XavierGeerinck we merged the fix. You can use this edge version of dapr.

Selfhost mode:

  1. dapr uninstall --all
  2. dapr init
  3. download your edge version of daprd binary from https://github.com/dapr/dapr/actions/runs/180628900
  4. replace daprd with edge version

Kubernetes with helm:

helm install helm dapr/dapr --namespace=dapr-system --set-string global.tag=edge