nni: STATUS: Training service error: No such file
First of all, thanks for this great library.
Short summary about the issue/question:
NNI is started on server 1 and connects to server 2 via SSH. The MNIST example runs successfully.
The same scenario is repeated between server 1 and server 3, but the experiment fails. The following error message appears in the web interface: STATUS: Training service error: No such file
On server 1-3 NNI, Tensorflow etc. is installed in a virtualenv . This virtualenv is started automatically after the SSH login (.bashrc entry).
nnimanager.log (Server 3): [2019-8-7 11:19:30] ERROR [ ‘Error: No such file\n at SFTPStream._transform (/home/msalz/venv_nni/nni/node_modules/ssh2-streams/lib/sftp.js:412:27)\n at SFTPStream.Transform._read (_stream_transform.js:190:10)\n at SFTPStream._read (/home/msalz/venv_nni/nni/node_modules/ssh2-streams/lib/sftp.js:183:15)\n at SFTPStream.Transform._write (_stream_transform.js:178:12)\n at doWrite (_stream_writable.js:410:12)\n at writeOrBuffer (_stream_writable.js:394:5)\n at SFTPStream.Writable.write (_stream_writable.js:294:11)\n at Channel.ondata (_stream_readable.js:666:20)\n at Channel.emit (events.js:182:13)\n at addChunk (_stream_readable.js:283:12)’ ] [2019-8-7 11:19:30] INFO [ ‘Change NNIManager status from: RUNNING to: ERROR’ ]
The error occurs with different NNI versions (tested with 0.9 and 0.7).
The default port for the web interface is 8080. The NNI participants connect via SSH. Are any other ports used in the background by NNI? What exactly does the error message mean and what is the reason for this error message?
Brief what process you are following:
How to reproduce it:
On all servers I have executed the following setup:
git clone https://github.com/microsoft/nni.git
python3.5 -m venv venv_nni
source venv_nni/bin/activate
cd nni
pip install tensorflow==1.5.0
pip install keras
python -m pip install --upgrade nni
pip install numpy==1.16.4
cd ..
This command only executes on server 1:
nnictl create --config nni/examples/trials/mnist/config.yml
nni Environment:
- nni version: 0.9.1.1
- nni mode(local|pai|remote): remote
- OS: CentOS Linux 7
- python version: Python 3.5
- is conda or virtualenv used?: virtualenv
- is running in docker?: no
need to update document(yes/no): no
Anything else we need to know:
Here is the config.yml file from server 1:
authorName: default
experimentName: example_mnist
trialConcurrency: 1
maxExecDuration: 1h
maxTrialNum: 10
#choice: local, remote, pai
trainingServicePlatform: remote
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python3 mnist.py
codeDir: .
gpuNum: 0
machineList:
- ip: <Server_3_IP>
username: <username>
passwd: <passwd>
#port can be skip if using default ssh port 22
#port: 22
nniManagerIp: <Server_1_IP>
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 26 (11 by maintainers)
Your GPUs are already in use. By default NNI will not use active GPUs. You may want to set
useActiveGputo change this behaviour. Please refer to the doc for details.@martsalz Please check the permission of
/tmp/nnion server 3 andchmodit to777if it’s not. Maybe you have run NNI in local mode on server 3 before? This would create the directory with bad permission.