nni: Failed to establish a new connection

I try to use nni in the HPC at our school. The code is work on my computer. The HPC has many compute nodes and we should submit the tasks on the manager node. But this error raise:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=17513): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2b352d9c6f28>: Failed to establish a new connection: [Errno 111] Connection refused',))

I think it might be related to the url. may be I should use nniManagerIP to fix this problem? what host should i specify?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 31 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Hi, everyone! I met the same problem when I run my code with nni (v. 2.8). However, the same code works successfully with nni (v. 2.5). It might be a solution to install nni v.2.5 and I also hope someone can find out what’s wrong in the newest version.

The details: (pytorch) wy@Tiger:~/mnist-pytorch$ nnictl create --config config_windows.yml [2022-06-09 13:32:46] Creating experiment, Experiment ID: k5doghe7 [2022-06-09 13:32:46] Starting web server… [2022-06-09 13:32:47] WARNING: Timeout, retry… [2022-06-09 13:32:48] WARNING: Timeout, retry… [2022-06-09 13:32:49] ERROR: Create experiment failed Traceback (most recent call last): File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py”, line 175, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File “/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py”, line 95, in create_connection raise err File “/home/wy/.local/lib/python3.7/site-packages/urllib3/util/connection.py”, line 85, in create_connection sock.connect(sa) ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py”, line 710, in urlopen chunked=chunked, File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py”, line 398, in _make_request conn.request(method, url, **httplib_request_kw) File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py”, line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py”, line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py”, line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py”, line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py”, line 1036, in _send_output self.send(msg) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/http/client.py”, line 976, in send self.connect() File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py”, line 205, in connect conn = self._new_conn() File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connection.py”, line 187, in _new_conn self, “Failed to establish a new connection: %s” % e urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py”, line 450, in send timeout=timeout File “/home/wy/.local/lib/python3.7/site-packages/urllib3/connectionpool.py”, line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File “/home/wy/.local/lib/python3.7/site-packages/urllib3/util/retry.py”, line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host=‘localhost’, port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused’))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “/home/wy/.local/bin/nnictl”, line 8, in <module> sys.exit(parse_args()) File “/home/wy/.local/lib/python3.7/site-packages/nni/tools/nnictl/nnictl.py”, line 497, in parse_args args.func(args) File “/home/wy/.local/lib/python3.7/site-packages/nni/tools/nnictl/launcher.py”, line 92, in create_experiment exp.start(port, debug, run_mode) File “/home/wy/.local/lib/python3.7/site-packages/nni/experiment/experiment.py”, line 117, in start self._proc = launcher.start_experiment(self._action, self.id, config, port, debug, run_mode, self.url_prefix) File “/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py”, line 119, in start_experiment raise e File “/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py”, line 97, in start_experiment _check_rest_server(port, url_prefix=url_prefix) File “/home/wy/.local/lib/python3.7/site-packages/nni/experiment/launcher.py”, line 258, in _check_rest_server rest.get(port, ‘/check-status’, url_prefix) File “/home/wy/.local/lib/python3.7/site-packages/nni/experiment/rest.py”, line 43, in get return request(‘get’, port, api, prefix=prefix) File “/home/wy/.local/lib/python3.7/site-packages/nni/experiment/rest.py”, line 31, in request resp = requests.request(method, url, timeout=timeout) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/api.py”, line 61, in request return session.request(method=method, url=url, **kwargs) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/sessions.py”, line 529, in request resp = self.send(prep, **send_kwargs) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/sessions.py”, line 645, in send r = adapter.send(request, **kwargs) File “/home/wy/miniconda3/envs/pytorch/lib/python3.7/site-packages/requests/adapters.py”, line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPConnectionPool(host=‘localhost’, port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused’))

我将版本回退到2.5可行,这个报错就没有了

We have the same problem. requests.exceptions.ConnectionError: HTTPConnectionPool(host=‘localhost’, port=8080): Max retries exceeded with url: /api/v1/nni/check-status (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fdbe7b0a250>: Failed to establish a new connection: [Errno 111] Connection refused’))

Hi, @SparkSnail @kvartet ! I have left the institute and not use HPC anymore, so I hardly test the new version. So sorry for that. Once I have the chance I will try it ASAP.

I think the confusing thing is that we submit the task by using a queue system like PBS, so how to write the script to run the trials, not on the management node makes me confused. If you have any idea, please update the tutorial 😃 It is much more helpful for those who do not familiar with Linux!

Thanks again for your selfless help!

I have a simple fix for this issue: give it more retries.

https://github.com/microsoft/nni/blob/e101717234a9c2b44ea62cea4492b9f391824c0f/nni/experiment/launcher.py#L125

Change the line into the following:

_check_rest_server(port, retry=30, url_prefix=url_prefix)

Many people may work on a cluster without sufficient CPU resources. 3 seconds might be too strict to start a server.

As of v2.10, this error generally means “NNI manager fails to start” (since NNI manager is running in another process, we have trouble displaying the real reason why it fails to start. We can only tell that we can’t connect to that process.)

To see the real exception, please go to ~/nni-experiments/<experiment_id> and check the logs inside.

i got the same error in v2.9