external-provisioner: Wrong AccessibilityRequirement passed in CreateVolumeRequest

My csi driver deals with storage local to Node, hence i would like to limit topology segment to individual node, by passing as below in NodeInfoGetRespose.

AccessibleTopology: &csi.Topology{
     Segments: map[string]string{
        "pmem-csi/node": nodeID,
},

I am using delay binding(WaitForFirstConsumer) so that volume get provisioned only after scheduler picks the node for scheduling pod that claims the PV.

My expectation was'CreateVolumeRequest.AccessibileRequirement is filled with the topology of selected node. But its filled with all the nodes in cluster where the driver running.

GRPC call: /csi.v1.Controller/CreateVolume
I0131 11:23:54.079912       1 glog.go:58] GRPC request: name:"pvc-ab1837f9-254a-11e9-8f0d-deadbeef0100" capacity_range:<required_bytes:8589934592 > volume_capabilities:<mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > > accessibility_requirements:<requisite:<segments:<key:"pmem-csi/node" value:"host-3" > > requisite:<segments:<key:"pmem-csi/node" value:"host-1" > > requisite:<segments:<key:"pmem-csi/node" value:"host-2" > > preferred:<segments:<key:"pmem-csi/node" value:"host-1" > > preferred:<segments:<key:"pmem-csi/node" value:"host-2" > > preferred:<segments:<key:"pmem-csi/node" value:"host-3" > > >

My observation is that the code in aggregateTopologies() is considering only TopologyKeys provided in NodeInfo of SelectedNode to select the Nodes with similar topology, where it’s omitting the keyvalue,This results in ending up with all the nodes(where my driver runs) as all nodes has topology constraints with the same key but with unique value.

`		// TODO (verult) retry
		selectedNodeInfo, err := csiAPIClient.CsiV1alpha1().CSINodeInfos().Get(selectedNode.Name, metav1.GetOptions{})
		if err != nil {
			// We must support provisioning if CSINodeInfo is missing, for backward compatibility.
			glog.Warningf("error getting CSINodeInfo for selected node %q: %v; proceeding to provision without topology information", selectedNode.Name, err)
			return nil, nil
		}
		topologyKeys = getTopologyKeys(selectedNodeInfo, driverName)
	}

	if len(topologyKeys) == 0 {
		// Assuming the external provisioner is never running during node driver upgrades.
		// If selectedNode != nil, the scheduler selected a node with no topology information.
		// If selectedNode == nil, all nodes in the cluster are missing topology information.
		// In either case, provisioning needs to be allowed to proceed.
		return nil, nil
	}

	selector, err := buildTopologyKeySelector(topologyKeys)
	if err != nil {
		return nil, err
	}
	nodes, err := kubeClient.CoreV1().Nodes().List(metav1.ListOptions{LabelSelector: selector})
	if err != nil {
		return nil, fmt.Errorf("error listing nodes: %v", err)
	}

	var terms []topologyTerm
	for _, node := range nodes.Items {
		// missingKey bool can be ignored because nodes were selected by these keys.
		term, _ := getTopologyFromNode(&node, topologyKeys)
		terms = append(terms, term)
	}
	return terms, nil`

Environement :

  • external-provisioner v1.0.1: with topology feature gate enabled
  • Kuberntes v1.13.2 : CSINodeInfo, CSIDriverRegistry feature gates enabled.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 24 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Let’s ignore replication for now, I believe that isn’t relevant here.

I think some of confusion comes from a different understanding what topology means and how strict it is. I’m getting the feeling that currently topology is treated as a suggestion, but not a hard requirement. That’s why even though Kubernetes knows that a pod using the volume will run in zone1, it (through external-provisioner) allows the CSI driver to create the volume in zone2, and thus we get zone1 and zone2 in expectedRequisite = CreateVolumeRequest.AccessibilityRequirements.

For local storage, we have to replace “zone” with “host”, and now adhering to that topology becomes a hard requirement. A CSI driver can still do that, it just has to know that it must create a volume in the first topology listed in CreateVolumeRequest.AccessibilityRequirements. This is where it gets tricky: how can the driver know that? To the driver, a CreateVolumeRequest looks exactly the same, regardless whether “wait for first consumer” was set or not.

I tried to define some usage models for local volumes here: https://github.com/intel/pmem-CSI/pull/160

The case that is affected by this ambiguity in CreateVolumeRequest.AccessibilityRequirements is “Delayed Persistent Volume”.

Although this brings up an interesting scalability problem if it’s going to list every single node. Maybe we should consider truncating the results? cc @verult @ddebroy