nni: STATUS: Training service error: No such file

First of all, thanks for this great library.

Short summary about the issue/question:

NNI is started on server 1 and connects to server 2 via SSH. The MNIST example runs successfully.

The same scenario is repeated between server 1 and server 3, but the experiment fails. The following error message appears in the web interface: STATUS: Training service error: No such file

On server 1-3 NNI, Tensorflow etc. is installed in a virtualenv . This virtualenv is started automatically after the SSH login (.bashrc entry).

nnimanager.log (Server 3): [2019-8-7 11:19:30] ERROR [ ‘Error: No such file\n at SFTPStream._transform (/home/msalz/venv_nni/nni/node_modules/ssh2-streams/lib/sftp.js:412:27)\n at SFTPStream.Transform._read (_stream_transform.js:190:10)\n at SFTPStream._read (/home/msalz/venv_nni/nni/node_modules/ssh2-streams/lib/sftp.js:183:15)\n at SFTPStream.Transform._write (_stream_transform.js:178:12)\n at doWrite (_stream_writable.js:410:12)\n at writeOrBuffer (_stream_writable.js:394:5)\n at SFTPStream.Writable.write (_stream_writable.js:294:11)\n at Channel.ondata (_stream_readable.js:666:20)\n at Channel.emit (events.js:182:13)\n at addChunk (_stream_readable.js:283:12)’ ] [2019-8-7 11:19:30] INFO [ ‘Change NNIManager status from: RUNNING to: ERROR’ ]

The error occurs with different NNI versions (tested with 0.9 and 0.7).

The default port for the web interface is 8080. The NNI participants connect via SSH. Are any other ports used in the background by NNI? What exactly does the error message mean and what is the reason for this error message?

Brief what process you are following:

How to reproduce it:

On all servers I have executed the following setup:

git clone https://github.com/microsoft/nni.git
python3.5 -m venv venv_nni
source venv_nni/bin/activate
cd nni
pip install tensorflow==1.5.0
pip install keras
python -m pip install --upgrade nni
pip install numpy==1.16.4
cd ..

This command only executes on server 1: nnictl create --config nni/examples/trials/mnist/config.yml

nni Environment:

nni version: 0.9.1.1
nni mode(local|pai|remote): remote
OS: CentOS Linux 7
python version: Python 3.5
is conda or virtualenv used?: virtualenv
is running in docker?: no

need to update document(yes/no): no

Anything else we need to know:

Here is the config.yml file from server 1:

authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
  #choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
  #SMAC (SMAC should be installed through nnictl)
  builtinTunerName: TPE
  classArgs:
    #choice: maximize, minimize
    optimize_mode: maximize
trial:
  command: python3 mnist.py
  codeDir: .
  gpuNum: 0
machineList:
  - ip: <Server_3_IP>
    username: <username>
    passwd: <passwd>
    #port can be skip if using default ssh port 22
    #port: 22
nniManagerIp: <Server_1_IP>

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 26 (11 by maintainers)

Most upvoted comments

Your GPUs are already in use. By default NNI will not use active GPUs. You may want to set useActiveGpu to change this behaviour. Please refer to the doc for details.

liuzhe-lz on Aug 21, 2019

@martsalz Please check the permission of /tmp/nni on server 3 and chmod it to 777 if it’s not. Maybe you have run NNI in local mode on server 3 before? This would create the directory with bad permission.

liuzhe-lz on Aug 8, 2019