alluxio: Unable to resolve nameservices for HA HDFS when HA HDFS in kubernetes

Alluxio Version: What version of Alluxio are you using? 2.4.1-1

Describe the bug A clear and concise description of what the bug is. 1、High availability HDFS is deployed in Kubernetes with the following configuration file core-site.txt hdfs-site.txt

2、Allxuxio starts the master using configMap as shown below alluxio-configmap.txt -Dalluxio.master.mount.table.root.ufs=hdfs://hdfs-k8s/alluxio

3、But the Alluxio master has an error,java.net.UnknownHostException: hdfs-k8s

2021-03-08 08:10:20,883 WARN  MetricRegistriesImpl - First MetricRegistry has been created without registering reporters. You may need to call MetricRegistries.global().addReportRegistration(...) before.
2021-03-08 08:10:20,885 INFO  RaftJournalSystem - Performing catchup. Last applied SN: 4. Catchup ID: -3575425480893704492
2021-03-08 08:10:20,886 INFO  RaftServerConfigKeys - raft.server.write.element-limit = 4096 (default)
2021-03-08 08:10:20,887 INFO  RaftServerConfigKeys - raft.server.write.byte-limit = 167772160 (custom)
2021-03-08 08:10:20,890 INFO  RaftJournalSystem - Exception submitting term start entry: java.util.concurrent.ExecutionException: org.apache.ratis.protocol.LeaderNotReadyException: alluxio-master-1_19200@group-ABB3109A44C1 is in LEADER state but not ready yet.
2021-03-08 08:10:20,894 INFO  RaftServerConfigKeys - raft.server.watch.timeout = 10s (default)
2021-03-08 08:10:20,895 INFO  RaftServerConfigKeys - raft.server.watch.timeout.denomination = 1s (default)
2021-03-08 08:10:20,896 INFO  RaftServerConfigKeys - raft.server.watch.element-limit = 65536 (default)
2021-03-08 08:10:20,908 INFO  RaftServerConfigKeys - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2021-03-08 08:10:20,908 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.byte-limit = 10485760 (custom)
2021-03-08 08:10:20,908 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.element-limit = 0 (default)
2021-03-08 08:10:20,913 INFO  GrpcConfigKeys - raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2021-03-08 08:10:20,913 INFO  RaftServerConfigKeys - raft.server.rpc.request.timeout = 5000ms (custom)
2021-03-08 08:10:20,914 INFO  RaftServerConfigKeys - raft.server.log.appender.install.snapshot.enabled = false (custom)
2021-03-08 08:10:20,914 INFO  RatisMetrics - Creating Metrics Registry : ratis_grpc.log_appender.alluxio-master-1_19200@group-ABB3109A44C1
2021-03-08 08:10:20,914 WARN  MetricRegistriesImpl - First MetricRegistry has been created without registering reporters. You may need to call MetricRegistries.global().addReportRegistration(...) before.
2021-03-08 08:10:20,919 INFO  RaftServerConfigKeys - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
2021-03-08 08:10:20,919 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.byte-limit = 10485760 (custom)
2021-03-08 08:10:20,919 INFO  RaftServerConfigKeys - raft.server.log.appender.buffer.element-limit = 0 (default)
2021-03-08 08:10:20,919 INFO  GrpcConfigKeys - raft.grpc.server.leader.outstanding.appends.max = 128 (default)
2021-03-08 08:10:20,920 INFO  RaftServerConfigKeys - raft.server.rpc.request.timeout = 5000ms (custom)
2021-03-08 08:10:20,920 INFO  RaftServerConfigKeys - raft.server.log.appender.install.snapshot.enabled = false (custom)
2021-03-08 08:10:20,922 INFO  RoleInfo - alluxio-master-1_19200: start LeaderState
2021-03-08 08:10:20,936 INFO  SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolling segment log-47_50 to index:50
2021-03-08 08:10:20,946 INFO  SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: Rolled log segment from /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_47 to /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_47-50
2021-03-08 08:10:21,090 INFO  SegmentedRaftLogWorker - alluxio-master-1_19200@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment /journal/raft/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_51
2021-03-08 08:10:21,890 INFO  RaftJournalSystem - Performing catchup. Last applied SN: 4. Catchup ID: -7761596533413317960
2021-03-08 08:10:41,917 INFO  RaftJournalSystem - Caught up in 21032ms. Last sequence number from previous term: 4.
2021-03-08 08:10:41,923 INFO  AbstractMaster - MetricsMaster: Starting primary master.
2021-03-08 08:10:41,925 INFO  MetricsSystem - Reset all metrics in the metrics system in 1ms
2021-03-08 08:10:41,925 INFO  MetricsStore - Cleared the metrics store and metrics system in 1 ms
2021-03-08 08:10:41,926 INFO  AbstractMaster - BlockMaster: Starting primary master.
2021-03-08 08:10:41,927 INFO  AbstractMaster - FileSystemMaster: Starting primary master.
2021-03-08 08:10:41,928 INFO  DefaultFileSystemMaster - Starting fs master as primary
2021-03-08 08:10:41,948 INFO  AbstractMaster - MetaMaster: Starting primary master.
2021-03-08 08:10:41,971 INFO  DefaultMetaMaster - Detected existing cluster ID 0efda228-6f86-4bb4-b467-3dc68899d970
2021-03-08 08:10:41,998 ERROR HeartbeatThread - Uncaught exception in heartbeat executor, Heartbeat Thread shutting down
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3951)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem.getFs(HdfsUnderFileSystem.java:811)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem.getSpace(HdfsUnderFileSystem.java:388)
	at alluxio.underfs.UnderFileSystemWithLogging$26.call(UnderFileSystemWithLogging.java:595)
	at alluxio.underfs.UnderFileSystemWithLogging$26.call(UnderFileSystemWithLogging.java:592)
	at alluxio.underfs.UnderFileSystemWithLogging.call(UnderFileSystemWithLogging.java:1208)
	at alluxio.underfs.UnderFileSystemWithLogging.getSpace(UnderFileSystemWithLogging.java:592)
	at alluxio.master.file.DefaultFileSystemMaster$Metrics.lambda$registerGauges$3(DefaultFileSystemMaster.java:4368)
	at alluxio.master.file.DefaultFileSystemMaster$TimeSeriesRecorder.heartbeat(DefaultFileSystemMaster.java:4137)
	at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:119)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
	at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
	at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
	at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
	at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
	at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem$1.load(HdfsUnderFileSystem.java:169)
	at alluxio.underfs.hdfs.HdfsUnderFileSystem$1.load(HdfsUnderFileSystem.java:155)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
	... 17 more
Caused by: java.net.UnknownHostException: hdfs-k8s
	... 32 more
2021-03-08 08:10:42,013 INFO  BackupTracker - Resetting backup tracker.
2021-03-08 08:10:42,015 INFO  BackupLeaderRole - Creating backup-leader role.
2021-03-08 08:10:42,015 INFO  AbstractMaster - TableMaster: Starting primary master.
2021-03-08 08:10:42,017 INFO  AlluxioMasterProcess - All masters started
2021-03-08 08:10:42,022 INFO  MetricsSystem - Starting sinks with config: {}.
2021-03-08 08:10:42,022 INFO  AlluxioMasterProcess - Alluxio master web server version 2.4.1-1 starting (gained leadership). webAddress=/0.0.0.0:19999
2021-03-08 08:10:42,049 INFO  log - Logging initialized @68234ms to org.eclipse.jetty.util.log.Slf4jLog
2021-03-08 08:10:42,357 INFO  WebServer - Alluxio Master Web service starting @ /0.0.0.0:19999
2021-03-08 08:10:42,360 INFO  Server - jetty-9.4.31.v20200723; built: 2020-07-23T17:57:36.812Z; git: 450ba27947e13e66baa8cd1ce7e85a4461cacc1d; jvm 1.8.0_212-b04
2021-03-08 08:10:42,413 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@b4836c5{/metrics/prometheus,null,AVAILABLE}
2021-03-08 08:10:42,414 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@7048da95{/metrics/json,null,AVAILABLE}
2021-03-08 08:10:42,416 WARN  SecurityHandler - ServletContext@o.e.j.s.ServletContextHandler@5b1b1956{/,file:///opt/alluxio-2.4.1-1/webui/master/build/,STARTING} has uncovered http methods for path: /
2021-03-08 08:11:00,293 INFO  ContextHandler - Started o.e.j.s.ServletContextHandler@5b1b1956{/,file:///opt/alluxio-2.4.1-1/webui/master/build/,AVAILABLE}
2021-03-08 08:11:00,311 INFO  AbstractConnector - Started ServerConnector@4081b016{HTTP/1.1, (http/1.1)}{0.0.0.0:19999}
2021-03-08 08:11:00,311 INFO  Server - Started @86496ms
2021-03-08 08:11:00,311 INFO  WebServer - Alluxio Master Web service started @ /0.0.0.0:19999
2021-03-08 08:11:00,333 INFO  AlluxioMasterProcess - Alluxio master version 2.4.1-1 started (gained leadership). bindAddress=/0.0.0.0:19998, connectAddress=alluxio-master-1:19998, webAddress=/0.0.0.0:19999
2021-03-08 08:11:00,335 INFO  AlluxioMasterProcess - Starting Alluxio master gRPC server on address /0.0.0.0:19998
2021-03-08 08:11:00,502 INFO  MasterProcess - registered service METRICS_MASTER_CLIENT_SERVICE
2021-03-08 08:11:00,700 INFO  MasterProcess - registered service BLOCK_MASTER_CLIENT_SERVICE
2021-03-08 08:11:00,700 INFO  MasterProcess - registered service BLOCK_MASTER_WORKER_SERVICE
2021-03-08 08:11:01,696 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_JOB_SERVICE
2021-03-08 08:11:01,697 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_WORKER_SERVICE
2021-03-08 08:11:01,698 INFO  MasterProcess - registered service FILE_SYSTEM_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_CONFIG_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_BACKUP_MESSAGING_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service RAFT_JOURNAL_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,842 INFO  MasterProcess - registered service META_MASTER_MASTER_SERVICE
2021-03-08 08:11:01,891 INFO  MasterProcess - registered service TABLE_MASTER_CLIENT_SERVICE
2021-03-08 08:11:01,963 INFO  DefaultSafeModeManager - Rpc server started, waiting 5000ms for workers to register
2021-03-08 08:11:01,964 INFO  AlluxioMasterProcess - Started Alluxio master gRPC server on address alluxio-master-1:19998
2021-03-08 08:11:01,972 INFO  FaultTolerantAlluxioMasterProcess - Primary started
2021-03-08 08:11:02,563 WARN  DefaultBlockMaster - Could not find worker id: 4512984809611543378 for heartbeat.
2021-03-08 08:11:02,628 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.116, containerHost=172.31.141.184, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.116, rack=null)} id: 7934004638968114946
2021-03-08 08:11:02,678 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=7934004638968114946, workerAddress=WorkerNetAddress{host=192.168.1.116, containerHost=172.31.141.184, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.116, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191062678, blocks=[], lostStorage={}}
2021-03-08 08:11:02,807 INFO  DefaultMetaMaster - getMasterId(): MasterAddress: alluxio-master-0:19998 id: 2798209153424597338
2021-03-08 08:11:02,864 INFO  DefaultMetaMaster - registerMaster(): master: MasterInfo{id=2798209153424597338, address=alluxio-master-0:19998, lastUpdatedTimeMs=1615191062862}
2021-03-08 08:11:03,782 WARN  DefaultBlockMaster - Could not find worker id: 5061862824333568095 for heartbeat.
2021-03-08 08:11:03,801 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.117, containerHost=172.31.228.236, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.117, rack=null)} id: 951288178340284032
2021-03-08 08:11:03,816 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=951288178340284032, workerAddress=WorkerNetAddress{host=192.168.1.117, containerHost=172.31.228.236, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.117, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191063815, blocks=[], lostStorage={}}
2021-03-08 08:11:04,544 INFO  DefaultMetaMaster - getMasterId(): MasterAddress: alluxio-master-2:19998 id: 6474051121844857814
2021-03-08 08:11:04,584 INFO  DefaultMetaMaster - registerMaster(): master: MasterInfo{id=6474051121844857814, address=alluxio-master-2:19998, lastUpdatedTimeMs=1615191064583}
2021-03-08 08:11:04,726 WARN  DefaultBlockMaster - Could not find worker id: 7799114670528034177 for heartbeat.
2021-03-08 08:11:04,747 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.115, containerHost=172.31.134.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.115, rack=null)} id: 243034216886158838
2021-03-08 08:11:04,764 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=243034216886158838, workerAddress=WorkerNetAddress{host=192.168.1.115, containerHost=172.31.134.240, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.115, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191064763, blocks=[], lostStorage={}}
2021-03-08 08:11:04,981 WARN  DefaultBlockMaster - Could not find worker id: 8145672099464782622 for heartbeat.
2021-03-08 08:11:04,998 INFO  DefaultBlockMaster - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=192.168.1.118, containerHost=172.31.229.245, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.118, rack=null)} id: 1427187059885272881
2021-03-08 08:11:05,035 INFO  DefaultBlockMaster - registerWorker(): MasterWorkerInfo{id=1427187059885272881, workerAddress=WorkerNetAddress{host=192.168.1.118, containerHost=172.31.229.245, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=192.168.1.118, rack=null)}, capacityBytes=34359738368, usedBytes=0, lastUpdatedTimeMs=1615191065035, blocks=[], lostStorage={}}
2021-03-08 08:11:09,220 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:24,145 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:39,115 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:11:54,117 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:09,156 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:24,115 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:39,115 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s
2021-03-08 08:12:54,149 WARN  RestUtils - Unexpected error invoking rest endpoint: java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfs-k8s

To Reproduce Steps to reproduce the behavior (as minimally and precisely as possible)

Expected behavior A clear and concise description of what you expected to happen. In alluxio, I want to access HA HDFS through nameservices

Urgency Describe the impact and urgency of the bug.

Additional context Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (10 by maintainers)

Most upvoted comments

@gaozhenhai I just finished testing this out in my own environment, I’ll share the details of it with you at the end.

Secret Permissions

Regarding the permissions of your core-site.xml and hdfs-site.xml, it turns out Kubernetes does not support changing ownership of secret-mounted volumes. As a workaround, you can use an initContainer to copy the secrets into a separate volume which you can change the permissions for:

    spec:
      securityContext:
        fsGroup: 1000
      initContainers:
        - name: fix-permissions
          image: debian:buster-slim
          command: ["/bin/bash", "-c"]
          args: 
          - cp -RL /mnt/secrets/hdfsconfig/* /secrets/hdfsconfig;
            chown -R 1000:1000 /secrets/hdfsconfig/;
            chmod -R 755 /secrets/hdfsconfig;
            ls -l /secrets/hdfsconfig/;
          volumeMounts:
          - name: hdfs-secret
            mountPath: /secrets/hdfsconfig
          - name: hdfs-secret-mount
            mountPath: /mnt/secrets/hdfsconfig
          securityContext:
            runAsUser: 0
 ...
      volumes:
      - name: hdfs-secret
        emptyDir: {}
      - name: hdfs-secret-mount
        secret:
          secretName: alluxio-hdfs-config

Doing this you should see the following permissions in your pods’ containers:

$ kubectl exec -it alluxio-master-0 -c alluxio-master /bin/bash
bash-4.4$ ls -l /secrets
total 0
drwxr-sr-x    2 alluxio  alluxio         48 Mar 26 00:40 hdfsconfig
bash-4.4$ ls -l /secrets/hdfsconfig/
total 8
-rwxr-xr-x    1 alluxio  alluxio        493 Mar 26 00:40 core-site.xml
-rwxr-xr-x    1 alluxio  alluxio       2302 Mar 26 00:40 hdfs-site.xml

You should add this initContainer and volumes to both alluxio-master-statefulset.yaml and alluxio-worker-daemonset.yaml. You’ll also need to add the following volumeMount to both the main container and the ‘job’ container:

            volumeMounts:
            - name: hdfs-secret
              mountPath: /secrets/hdfsconfig

Alluxio Configuration Properties

Regarding alluxio-configmap.yaml you’ll need to change -Dalluxio.underfs.hdfs.configuration=/secrets/hdfsconfig/core-site.xml:/secrets/hdfsConfig/hdfs-site.xml to -Dalluxio.master.mount.table.root.option.alluxio.underfs.hdfs.configuration=/secrets/hdfsconfig/core-site.xml:/secrets/hdfsconfig/hdfs-site.xml

  • We need to add the prefix alluxio.master.mount.table.root.option. to alluxio.underfs.hdfs.configuration
  • Also notice the typo on /secrets/hdfsConfig to /secrets/hdfsconfig

Testing Environment

alluxio-hdfs-k8s.tar.gz

  • hdfs-k8s.yaml is the HA Hadoop YAML files generated by Helm from this HDFS Helm chart (with some manual tweaks to fix typos)
    • config.yaml is the values used to generate that Helm template via helm template r1 charts/hdfs-k8s -f config.yaml > hdfs-k8s.yaml
  • pvs/ contains some scratch PersistentVolume definitions for the HA Hadoop pods
  • alluxio/ contains the Alluxio YAML files derived from our Helm chart:
    • alluxio-configmap.yaml
    • alluxio-master-statefulset.yaml
    • alluxio-master-service.yaml
    • alluxio-worker-daemonset.yaml
    • secret.yaml is a Secret containing the base64-encoded HDFS XMLs
      • hdfs-site.xml
      • core-site.xml
  1. kubectl apply -f pvs/
  2. kubectl apply -f hdfs-k8s.yaml and wait for the Zookeeper -> Namenodes -> Datanodes to all be Running
  3. kubectl apply -f alluxio/ and wait for the Master and Worker Pods to be started

This set-up for me allowed Alluxio to connect to the HA HDFS nameservice for its UFS. Unfortunately I wasn’t able to configure HDFS permissions properly to get Alluxio to persist files into HDFS but it was successfully able to connect to the nameservice endpoint.

Conclusion

Let me know if this resolves your issue, thanks!

@gaozhenhai Something that was brought to my attention, can you kubectl exec into your Alluxio master Pod(s) and show the permissions of the mounted HDFS configs? eg:

$ kubectl -n gaozh exec -it alluxio-master-0 -c alluxio-master /bin/bash
# ls -l /secrets/

I suspect those will be owned by root:root and aren’t readable by the Alluxio process (which runs as 1000:1000). This is an issue about our Helm templates which we are fixing in #13061. In the meantime you can adjust your alluxio-master-statefulset.yaml to contain the following:

spec:
  template:
    spec:
      securityContext:
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000

Adding this change should allow the Secret passed as a Volume to be owned by 1000:1000. Let me know if this is the case or not and whether that solves your issue. Thanks!

@ZhuTopher The permissions for the mounted HDFS configuration are as follows image

I update the spec.template.spec.securityContext fields, and restart the pod, but still an error: java.net.UnknownHostException: hdfs-k8s

...
spec:
  selector:
    matchLabels:
      app: alluxio
      role: alluxio-master
      name: alluxio-master
  serviceName: alluxio-master
  replicas: 3
  template:
    metadata:
      labels:
        name: alluxio-master
        app: alluxio
        chart: alluxio-0.6.11
        release: alluxio
        heritage: Helm
        role: alluxio-master
    spec:
      hostNetwork: false
      dnsPolicy: ClusterFirst
      nodeSelector:
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000
...

I’m not sure if it’s a permission issue, because Aluxio doesn’t print any permissions errors You can use my YAML files and images to install an HA HDFS environment on kubernetes to find the root cause of the problem

I have no problem defining the namenode of HDFS in hdfs-config.yaml using the headless servier of HDFS, since all my applications are deployed in Kubernetes, including Alluxio

If through the hdfs://my-hdfs-namenode.gaozh.svc.cluster.local/alluxio visit HA HDFS may appear the following two questions: 1、HA HDFS namenode includes active and standby nodes. Access to HA HDFS using namenode’s service name may result in random or rotational access to both active and standby nodes

2、The active standby switch when the namenode, Alluxio cannot use hdfs://my-hdfs-namenode.gaozh.svc.cluster.local/alluxio to find the current active node

When I Spark on Kubernets, I can successfully access HA HDFS using the nameservice in hdfs-config.yaml with the following simple configuration

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: atlas-stream-gaozh
spec:
  type: Scala
  mode: cluster
  sparkVersion: "2.4.5"
  image: "192.168.3.44/system_containers/spark-jar:v2.4.5"
  imagePullPolicy: Always
  mainClass: com.kubedata.insertdemo.Lineage
  mainApplicationFile: "hdfs://hdfs-k8s/jars/InsertDemo-1.0-SNAPSHOT.jar"
  sparkConf:
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "hdfs://hdfs-k8s/spark-event/"
    "spark.extraListeners": "com.hortonworks.spark.atlas.SparkAtlasEventTracker"
    "spark.sql.queryExecutionListeners": "com.hortonworks.spark.atlas.SparkAtlasEventTracker"
    "spark.sql.streaming.streamingQueryListeners": "com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker"
    "spark.driver.extraClassPath": "/root/config"
  hadoopConf:
    dfs.nameservices: hdfs-k8s
    dfs.ha.namenodes.hdfs-k8s: nn0,nn1
    dfs.namenode.rpc-address.hdfs-k8s.nn0: my-hdfs-namenode-0.my-hdfs-namenode.test.svc.cluster.local:8020
    dfs.namenode.rpc-address.hdfs-k8s.nn1: my-hdfs-namenode-1.my-hdfs-namenode.test.svc.cluster.local:8020
    dfs.client.failover.proxy.provider.hdfs-k8s: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
...

The purpose of using a nameservice is to find the current active node by parsing the core-site.xml and hdfs-site.xml configurations, even if the active and standby nodes are switched

If Alluxio does not properly resolve core-site.xml and hdfs-site.xml, it will not be able to find the two namenodes corresponding to the nameservice and will not be aware of the active standby switch

Below is the YAML files and images that I used to create HA HDFS hdfs-client.yaml.txt hdfs-config.yaml.txt hdfs-scripts.yaml.txt journalnode.yaml.txt namenode.yaml.txt datanode.yaml.txt zookeeper.yaml.txt

gaozhenhai/hadoop:uhopper_2.7.2 gaozhenhai/hadoopclient:uhopper_2.7.2 gaozhenhai/zookeeper:google_samples_k8szk_v3

Here’s how I created HA HDFS Note: If you want to deploy HA HDFS locally, you will need to update the image name in YAML and change the StorageClass name field in YAML to your own StorageClass name

1、Create HA HDFS config and script

kubectl -n gaozh create -f hdfs-config.yaml
kubectl -n gaozh create -f hdfs-scripts.yaml

2、Create a zookeeper

kubectl -n gaozh create -f zookeeper.yaml

3、Create an HA HDFS cluster

kubectl -n gaozh create -f journalnode.yaml
kubectl -n gaozh create -f namenode.yaml
kubectl -n gaozh create -f datanode.yaml

Verify that HA HDFS is available

kubectl -n gaozh create -f hdfs-client.yaml

Xnip2021-03-14_01-10-54

The nameservice address hdfs 😕/hdfs-k8s/ is valid for HA HDFS

@ZhuTopher I used Kubectl to create Alluxio through the Manifest Files. The manifest files as follows alluxio-configmap.yaml.txt alluxio-master-service.yaml.txt alluxio-master-statefulset.yaml.txt alluxio-worker-daemonset.yaml.txt secret.yaml.txt

My HA HDFS is deployed via a self-developed Operator, and I can provide separate statefulset.yaml and service.yaml files if needed