apm-agent-python: v4.1.0 Introduces `http.client.RemoteDisconnected` error

Describe the bug: … Events getting reported to APM-Server are randomly getting rejected. It appears that every so often the event getting reported is being met with a RemoteDisconnected exception.

To Reproduce

  1. Setup a django project as per directions
  2. Hit the homepage of the project
  3. Wait for several events to hit the APM server
  4. One will eventually cause the exception

Expected behavior: … No Exception

Environment (please complete the following information)

  • OS: Mojave but running under python:3 docker container
  • Python version: 3.7.2
  • Framework and version: Django version 2.1.5
  • APM Server version: apm-server version 6.6.0 (amd64), libbeat 6.6.0
  • Agent version: 4.1.0

Additional context

downgrading to 4.0.3 remedies the issue.

Directly related: https://discuss.elastic.co/t/unable-to-reach-apm-server-connection-aborted-remotedisconnected-remote-end-closed-connection-without-response/166500

  • Agent config options

    Click to expand
    ELASTIC_APM = {
        'SERVICE_NAME': 'web-backend',
        'SECRET_TOKEN': '8149e743-2524-2524-2524-94aadd3fcde7',
        'SERVER_URL': 'http://192.168.16.100:8200',
        'DEBUG': True,
    }
    
    
  • requirements.txt:

    Click to expand
    django-allauth
    django-guardian
    django-qr-code
    django-icons
    django-phonenumber-field
    django-crispy-forms
    django-autocomplete-light
    django-money
    django-anymail
    django-debug-toolbar
    django-extensions
    django-countries
    djangorestframework
    psycopg2-binary
    stripe
    geoip2
    redis
    celery
    sentry-sdk
    googlemaps
    django
    pillow
    phonenumbers
    gunicorn
    whitenoise
    elastic-apm
    
web_1      | Failed to submit message: "Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)"
web_1      | Traceback (most recent call last):
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
web_1      |     chunked=chunked)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
web_1      |     six.raise_from(e, None)
web_1      |   File "<string>", line 2, in raise_from
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
web_1      |     httplib_response = conn.getresponse()
web_1      |   File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
web_1      |     rv = real_getresponse(self, *args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
web_1      |     response.begin()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
web_1      |     version, status, reason = self._read_status()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
web_1      |     raise RemoteDisconnected("Remote end closed connection without"
web_1      | http.client.RemoteDisconnected: Remote end closed connection without response
web_1      | 
web_1      | During handling of the above exception, another exception occurred:
web_1      | 
web_1      | Traceback (most recent call last):
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 40, in send
web_1      |     "POST", url, body=data, headers=self._headers, timeout=self._timeout, preload_content=False
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 323, in urlopen
web_1      |     response = conn.urlopen(method, u.request_uri, **kw)
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
web_1      |     return wrapped(*args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/instrumentation/packages/base.py", line 106, in call_if_sampling
web_1      |     return wrapped(*args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
web_1      |     _stacktrace=sys.exc_info()[2])
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 367, in increment
web_1      |     raise six.reraise(type(error), error, _stacktrace)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise
web_1      |     raise value.with_traceback(tb)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
web_1      |     chunked=chunked)
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
web_1      |     six.raise_from(e, None)
web_1      |   File "<string>", line 2, in raise_from
web_1      |   File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
web_1      |     httplib_response = conn.getresponse()
web_1      |   File "/usr/local/lib/python3.7/site-packages/sentry_sdk/integrations/stdlib.py", line 48, in getresponse
web_1      |     rv = real_getresponse(self, *args, **kwargs)
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 1321, in getresponse
web_1      |     response.begin()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 296, in begin
web_1      |     version, status, reason = self._read_status()
web_1      |   File "/usr/local/lib/python3.7/http/client.py", line 265, in _read_status
web_1      |     raise RemoteDisconnected("Remote end closed connection without"
web_1      | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
web_1      | 
web_1      | During handling of the above exception, another exception occurred:
web_1      | 
web_1      | Traceback (most recent call last):
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/base.py", line 184, in send_sync
web_1      |     self.sync_transport.send(self, data)
web_1      |   File "/usr/local/lib/python3.7/site-packages/elasticapm/transport/http.py", line 53, in send
web_1      |     raise TransportException(message, data, print_trace=print_trace)
web_1      | elasticapm.transport.base.TransportException: Unable to reach APM Server: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) (url: http://192.168.16.100:8200/intake/v2/events)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 53 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@ciarancourtney @MrBones757 sorry to bug you folks again. I’ve implemented another approach that might work: https://github.com/elastic/apm-agent-python/pull/426. Specifically, in this branch, we add some fuzziness to the time between requests to the APM Server (between 90-110% of the configured value). The original reason I implemented this was to avoid stampedes when multiple workers are booted up at the same time, but it might help with this race condition as well.

pip install "https://github.com/beniwohli/apm-agent-python/archive/fuzzy-api-request-time.zip#egg=elastic-apm"

I started to look at APM server a little bit closely and tried version 7.0.0-alpha2 and tried to find something relevant in server debug logs but got nothing valuable so went back to 6.6.0.

Then I found this setting: https://www.elastic.co/guide/en/apm/server/current/configuration-process.html#read_timeout When changed from default 30s to: read_timeout: 45s I’m now unable to reproduce the issue with my script. Also there are no more FIN packets from the server in wireshark. Even those not harmful between HTTP requests.

I also noticed that the 30s may very well be racing with 30000ms from here: https://github.com/elastic/apm-agent-python/blob/v4.1.0/elasticapm/conf/__init__.py#L223

Also changing it to 20000ms makes the issue disappear even if read_timeout on the server is still 30s. Not sure if changing it is desired though.

So maybe to workaround this issue either of these setting need to be changed so that the server read_timeout is just longer than agent metrics_interval by changing either of these settings. I’m not sure if this is a proper fix but much better than the current situation.

As for reproduction on your side maybe using the docker-compose.yml for the apm server I posted above is required? It’s also somehow a little bit harder to reproduce if apm server debug logging is enabled.

Also if there is some ticket on apm-server side could you please link it here?

I am getting this error currently. I’m using the elastic.co managed elastic stack, version v7.3.2. I suppose the fix they made to the apm server should be integrated already?

@reveil awesome, thanks! As far as I understand, a FIN per se isn’t a problem, if it is only sent after the HTTP response was received. In this case, urllib3 will recognize that the connection was closed and recycle it.

I will go ahead and merge #426, even if it isn’t a 100% fix for the problem. It seems to at least substantially reduce the number of connection issues. It’ll take some time until a server side fix can be released (currently slated for 7.1, which is some time out).