apm-agent-python: v4.1.0 Introduces `http.client.RemoteDisconnected` error
Describe the bug: … Events getting reported to APM-Server are randomly getting rejected. It appears that every so often the event getting reported is being met with a RemoteDisconnected exception.
To Reproduce
- Setup a django project as per directions
- Hit the homepage of the project
- Wait for several events to hit the APM server
- One will eventually cause the exception
Expected behavior: … No Exception
Environment (please complete the following information)
- OS: Mojave but running under
python:3
docker container - Python version: 3.7.2
- Framework and version: Django version 2.1.5
- APM Server version: apm-server version 6.6.0 (amd64), libbeat 6.6.0
- Agent version: 4.1.0
Additional context
downgrading to 4.0.3 remedies the issue.
Directly related: https://discuss.elastic.co/t/unable-to-reach-apm-server-connection-aborted-remotedisconnected-remote-end-closed-connection-without-response/166500
-
Agent config options
Click to expand
ELASTIC_APM = { 'SERVICE_NAME': 'web-backend', 'SECRET_TOKEN': '8149e743-2524-2524-2524-94aadd3fcde7', 'SERVER_URL': 'http://192.168.16.100:8200', 'DEBUG': True, }
-
requirements.txt
:Click to expand
django-allauth django-guardian django-qr-code django-icons django-phonenumber-field django-crispy-forms django-autocomplete-light django-money django-anymail django-debug-toolbar django-extensions django-countries djangorestframework psycopg2-binary stripe geoip2 redis celery sentry-sdk googlemaps django pillow phonenumbers gunicorn whitenoise elastic-apm
web_1 | Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)"
web_1 | Traceback (most recent call last):
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
web_1 | chunked=chunked)
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
web_1 | six.raise_from(e, None)
web_1 | File "<string>", line 2, in raise_from
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
web_1 | httplib_response = conn.getresponse()
web_1 | File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
web_1 | rv = real_getresponse(self, *args, **kwargs)
web_1 | File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
web_1 | response.begin()
web_1 | File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
web_1 | version, status, reason = self._read_status()
web_1 | File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
web_1 | raise RemoteDisconnected("Remote end closed connection without"
web_1 | http.client.RemoteDisconnected: Remote end closed connection without response
web_1 |
web_1 | During handling of the above exception, another exception occurred:
web_1 |
web_1 | Traceback (most recent call last):
web_1 | File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 40, in send
web_1 | "POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 323, in urlopen
web_1 | response = conn.urlopen(method, u.request_uri, **kw)
web_1 | File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
web_1 | return wrapped(*args, **kwargs)
web_1 | File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
web_1 | return wrapped(*args, **kwargs)
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
web_1 | _stacktrace=sys.exc_info()[2])
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
web_1 | raise six.reraise(type(error), error, _stacktrace)
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
web_1 | raise value.with_traceback(tb)
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
web_1 | chunked=chunked)
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
web_1 | six.raise_from(e, None)
web_1 | File "<string>", line 2, in raise_from
web_1 | File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
web_1 | httplib_response = conn.getresponse()
web_1 | File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
web_1 | rv = real_getresponse(self, *args, **kwargs)
web_1 | File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
web_1 | response.begin()
web_1 | File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
web_1 | version, status, reason = self._read_status()
web_1 | File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
web_1 | raise RemoteDisconnected("Remote end closed connection without"
web_1 | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
web_1 |
web_1 | During handling of the above exception, another exception occurred:
web_1 |
web_1 | Traceback (most recent call last):
web_1 | File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/base.py", line 184, in send_sync
web_1 | self.sync_transport.send(self, data)
web_1 | File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 53, in send
web_1 | raise TransportException(message, data, print_trace=print_trace)
web_1 | elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 53 (16 by maintainers)
Commits related to this issue
- set SO_KEEPALIVE on underlying TCP socket of http connection to APM Server this is a tentative fix for #409 — committed to beniwohli/apm-agent-python by beniwohli 5 years ago
@ciarancourtney @MrBones757 sorry to bug you folks again. I’ve implemented another approach that might work: https://github.com/elastic/apm-agent-python/pull/426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.
I started to look at APM server a little bit closely and tried version 7.0.0-alpha2 and tried to find something relevant in server debug logs but got nothing valuable so went back to 6.6.0.
Then I found this setting: https://www.elastic.co/guide/en/apm/server/current/configuration-process.html#read_timeout When changed from default 30s to:
read_timeout: 45s
I’m now unable to reproduce the issue with my script. Also there are no more FIN packets from the server in wireshark. Even those not harmful between HTTP requests.I also noticed that the 30s may very well be racing with 30000ms from here: https://github.com/elastic/apm-agent-python/blob/v4.1.0/elasticapm/conf/__init__.py#L223
Also changing it to 20000ms makes the issue disappear even if read_timeout on the server is still 30s. Not sure if changing it is desired though.
So maybe to workaround this issue either of these setting need to be changed so that the server read_timeout is just longer than agent metrics_interval by changing either of these settings. I’m not sure if this is a proper fix but much better than the current situation.
As for reproduction on your side maybe using the docker-compose.yml for the apm server I posted above is required? It’s also somehow a little bit harder to reproduce if apm server debug logging is enabled.
Also if there is some ticket on apm-server side could you please link it here?
I am getting this error currently. I’m using the elastic.co managed elastic stack, version v7.3.2. I suppose the fix they made to the apm server should be integrated already?
@reveil awesome, thanks! As far as I understand, a
FIN
per se isn’t a problem, if it is only sent after the HTTP response was received. In this case, urllib3 will recognize that the connection was closed and recycle it.I will go ahead and merge #426, even if it isn’t a 100% fix for the problem. It seems to at least substantially reduce the number of connection issues. It’ll take some time until a server side fix can be released (currently slated for 7.1, which is some time out).