moby: Cannot start Apache Spark on multi-host in Swarm mode
I simply cannot schedule instances of Apache Spark over the multi-host network in Swarm mode.
Output of docker version
:
Client:
Version: 1.12.0-rc4
API version: 1.24
Go version: go1.6.2
Git commit: e4a0dbc
Built: Wed Jul 13 04:15:13 2016
OS/Arch: linux/amd64
Experimental: true
Server:
Version: 1.12.0-rc4
API version: 1.24
Go version: go1.6.2
Git commit: e4a0dbc
Built: Wed Jul 13 04:15:13 2016
OS/Arch: linux/amd64
Experimental: true
Output of docker info
:
Containers: 3
Running: 3
Paused: 0
Stopped: 0
Images: 1
Server Version: 1.12.0-rc4
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 12
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge overlay null
Swarm: active
NodeID: 452huztay9dzxtehh50mzgf9o
IsManager: Yes
Managers: 1
Nodes: 2
CACertHash: sha256:2445358bb26a09d3b7cf67e0324b187fc6c0c4f71f0ae1413bd7b0838b26594f
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-28-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.797 GiB
Name: master
ID: QEVB:ADKU:BJFR:U7PB:NG2T:SBQH:B5NN:ZTNT:BLED:PE57:OY32:NXZQ
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
WARNING: No swap limit support
Additional environment details (AWS, VirtualBox, physical, etc.): DigitalOcean
Steps to reproduce the issue:
- Create 2 nodes
master
andnode1
. Dodocker swarm init
&docker swarm join
- Create an overlay network, in my case:
docker network create --driver overlay spark
- Run
docker service create --name master --network spark -p 7077:7077 -p 8080:8080 gettyimages/spark:1.6.2-hadoop-2.6 bin/spark-class org.apache.spark.deploy.master.Master
- Run
docker service create --name worker --network spark -p 8081:8081 gettyimages/spark:1.6.2-hadoop-2.6 bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
- Scale worker to 2. You’ll see it’s trying to schedule a container on
node1
then failed and all 2 workers will be on master.
Describe the results you received: Apache Spark instances were scheduled only on master nodes.
Describe the results you expected:
Apache Spark instances should be scheduled on both master and node1.
It is likely that containers on node1
cannot resolve name master
or cannot connect to the master
IP correctly.
cc @mavenugo
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 62 (26 by maintainers)
I still have this problem; any elegant solution?
BTW, I worked around the initial issue of the Spark Workers not being able to connect to the Spark Master by passing
--host 0.0.0.0
to the master’s command-line so that it would not bind to a specific IP. This allowed the workers to connect to the master, regardless of which IP thespark-master
hostname resolved to for that particular worker. Unfortunately, the workers’ web UI reports the master URL asspark://0.0.0.0:7077
, which may cause trouble later…It’s important to notice that the
nginx
approach is not optimum, we should be able to expose the ports on themaster
nodes.@doxxx The relevant change that went in 1.12.2 was to disable Service Discovery in the ingress network. Looks like that fixes it and the libnetwork PR is not necessary for this issue.
@chanwit @rogaha Can we close this ? If you want to give it a try with 1.12.2 please do…
@rogaha Thinking about the resolution ambiguity I think this is easy to fix. We can make a fix to not resolve names on the
ingress
network scope at all becauseingress
is really a special network to facilitate ingress routing and we can disable service discovery in that network.But I did not fully understand the
bind
issue. How doesspark
binding toINADDR_ANY
affect workers seeing the master as0.0.0.0
. That just seems odd. What is the protocol that the workers use to resolve manager IP. If it is just DNS then after fixing the above ambiguity workers should be able to resolve to the manager’s IP onspark
network. So I don’t understand how they see master as0.0.0.0
. Can you please clarify?