istio: istio-proxy v1.8.0-alpha2 crashing on aggregate cluster

Bug description

Following creation of an EnvoyFilter and ServiceEntry, istio-injected pods are unable to start up due to istio-proxy crashing with a segfault. Also note that this error seems indeterministic, you may have to run through the reproduction procedure a few times to trigger it.

[ ] Docs [X] Installation [ ] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [X] User Experience [ ] Developer Infrastructure

Expected behavior Istio-proxy starts without crashing.

Steps to reproduce the bug

  1. Install istio v1.8.0-alpha2
  2. Deploy the bookinfo example application
  3. Configure Istio with a custom root certificate as described in this guide
  4. Create the following ServiceEntry and corresponding EnvoyFilter:
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: reviews-test
  namespace: service-mesh-hub
spec:
  addresses:
  - 244.57.37.240
  hosts:
  - reviews-test.bookinfo.global
  location: MESH_INTERNAL
  ports:
  - name: http
    number: 9080
    protocol: http
  resolution: DNS
---
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: reviews-envoy-filter
  namespace: istio-system
spec:
  configPatches:
  - applyTo: CLUSTER
    match:
      cluster:
        name: outbound|9080||reviews-test.bookinfo.global
    patch:
      operation: REMOVE
  - applyTo: CLUSTER
    match:
      cluster:
        name: outbound|9080||reviews-test.bookinfo.global
    patch:
      operation: ADD
      value:
        cluster_type:
          name: envoy.clusters.aggregate
          typed_config:
            '@type': type.googleapis.com/udpa.type.v1.TypedStruct
            type_url: type.googleapis.com/envoy.config.cluster.aggregate.v2alpha.ClusterConfig
            value:
              clusters:
              - outbound|9080||reviews.bookinfo.svc.cluster.local
              - outbound|9080||reviews.bookinfo.svc.remote-cluster.global
        connect_timeout: 1s
        lb_policy: CLUSTER_PROVIDED
        name: outbound|9080||reviews-test.bookinfo.global

Version (include the output of istioctl version --remote and kubectl version --short and helm version if you used Helm)

Istio: v1.8.0-alpha2

Istio-proxy image: docker.io/istio/proxyv2:1.8.0-alpha.2

How was Istio installed?

cat << EOF | istioctl manifest install -y --context "kind-${cluster}" -f -
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: example-istiooperator
  namespace: istio-system
spec:
  profile: minimal
  addonComponents:
    istiocoredns:
      enabled: true
  components:
    # Istio Gateway feature
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
      k8s:
        env:
          - name: ISTIO_META_ROUTER_MODE
            value: "sni-dnat"
        service:
          type: NodePort
          ports:
            - port: 80
              targetPort: 8080
              name: http2
            - port: 443
              targetPort: 8443
              name: https
            - port: 15443
              targetPort: 15443
              name: tls
              nodePort: ${port}
  meshConfig:
    enableAutoMtls: true
  values:
    global:
      pilotCertProvider: kubernetes
      podDNSSearchNamespaces:
      - global
EOF

Environment where bug was observed (cloud vendor, OS, etc)

Kubernetes ENV: KinD v0.8.1

Kubernetes Version: 1.17

OS: OSX Catalina

istio-proxy log output

2020-11-04T21:00:15.617742Z	info	FLAG: --concurrency="2"
2020-11-04T21:00:15.617784Z	info	FLAG: --domain="bookinfo.svc.cluster.local"
2020-11-04T21:00:15.617878Z	info	FLAG: --help="false"
2020-11-04T21:00:15.617886Z	info	FLAG: --log_as_json="false"
2020-11-04T21:00:15.617898Z	info	FLAG: --log_caller=""
2020-11-04T21:00:15.617910Z	info	FLAG: --log_output_level="default:info"
2020-11-04T21:00:15.617921Z	info	FLAG: --log_rotate=""
2020-11-04T21:00:15.617949Z	info	FLAG: --log_rotate_max_age="30"
2020-11-04T21:00:15.617958Z	info	FLAG: --log_rotate_max_backups="1000"
2020-11-04T21:00:15.617973Z	info	FLAG: --log_rotate_max_size="104857600"
2020-11-04T21:00:15.617984Z	info	FLAG: --log_stacktrace_level="default:none"
2020-11-04T21:00:15.617999Z	info	FLAG: --log_target="[stdout]"
2020-11-04T21:00:15.618011Z	info	FLAG: --meshConfig="./etc/istio/config/mesh"
2020-11-04T21:00:15.618020Z	info	FLAG: --outlierLogPath=""
2020-11-04T21:00:15.618050Z	info	FLAG: --proxyComponentLogLevel="misc:error"
2020-11-04T21:00:15.618061Z	info	FLAG: --proxyLogLevel="warning"
2020-11-04T21:00:15.618073Z	info	FLAG: --serviceCluster="ratings.bookinfo"
2020-11-04T21:00:15.618084Z	info	FLAG: --stsPort="0"
2020-11-04T21:00:15.618094Z	info	FLAG: --templateFile=""
2020-11-04T21:00:15.618106Z	info	FLAG: --tokenManagerPlugin="GoogleTokenExchange"
2020-11-04T21:00:15.618123Z	info	Version 1.8.0-alpha.2-ec349ef2a8a0bc846a87c8c10e02511e4865c9cc-Clean
2020-11-04T21:00:15.618264Z	info	Obtained private IP [10.244.0.30 fe80::481:beff:fe68:b1a5]
2020-11-04T21:00:15.618358Z	info	Apply proxy config from env {"proxyMetadata":{"DNS_AGENT":""}}

2020-11-04T21:00:15.619009Z	info	Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: ./etc/istio/proxy
controlPlaneAuthPolicy: MUTUAL_TLS
discoveryAddress: istiod.istio-system.svc:15012
drainDuration: 45s
envoyAccessLogService: {}
envoyMetricsService: {}
parentShutdownDuration: 60s
proxyAdminPort: 15000
proxyMetadata:
  DNS_AGENT: ""
serviceCluster: ratings.bookinfo
statNameLength: 189
statusPort: 15020
terminationDrainDuration: 5s
tracing:
  zipkin:
    address: zipkin.istio-system:9411

2020-11-04T21:00:15.619084Z	info	Proxy role: &model.Proxy{RWMutex:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, Type:"sidecar", IPAddresses:[]string{"10.244.0.30", "fe80::481:beff:fe68:b1a5"}, ID:"ratings-v1-7466f6f858-ngm9x.bookinfo", Locality:(*envoy_config_core_v3.Locality)(nil), DNSDomain:"bookinfo.svc.cluster.local", ConfigNamespace:"", Metadata:(*model.NodeMetadata)(nil), SidecarScope:(*model.SidecarScope)(nil), PrevSidecarScope:(*model.SidecarScope)(nil), MergedGateway:(*model.MergedGateway)(nil), ServiceInstances:[]*model.ServiceInstance(nil), IstioVersion:(*model.IstioVersion)(nil), VerifiedIdentity:(*spiffe.Identity)(nil), ipv6Support:false, ipv4Support:false, GlobalUnicastIP:"", XdsResourceGenerator:model.XdsResourceGenerator(nil), WatchedResources:map[string]*model.WatchedResource(nil)}
2020-11-04T21:00:15.619101Z	info	JWT policy is first-party-jwt
2020-11-04T21:00:15.619133Z	info	PilotSAN []string{"istiod.istio-system.svc"}
2020-11-04T21:00:15.619185Z	info	sa.serverOptions.CAEndpoint == istiod.istio-system.svc:15012 Citadel
2020-11-04T21:00:15.619223Z	info	Using CA istiod.istio-system.svc:15012 cert with certs: ./var/run/secrets/kubernetes.io/serviceaccount/ca.crt
2020-11-04T21:00:15.619294Z	info	citadelclient	Citadel client using custom root: istiod.istio-system.svc:15012 -----BEGIN CERTIFICATE-----
MIICyDCCAbCgAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTIwMTEwNDE4MTg0MFoXDTMwMTEwMjE4MTg0MFowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBANcN
+nDNxSKceeAl4Feghw7sTxc6tNg2HZRONkqXbfrQL/vEIwwtrbiA84MqoiiAEADs
B3q18bh/e5WEU655p1nzbMvcchqTzwgpHSubUW7BsIOWmUN5MnsfEQwDALLfjvQA
AWDMJz0jQsJ+/x4V3tY4F14+mTsssW8SY8vFl5h1KVzTzEQ8MF/jgSnFAnebvhaw
JYDqDc60cI4VSiMoDx/kGbRWYlmMbG3A5uO2hmitN+mCoCnciRgq2FEI5LGYgDYu
tqK/rl78V350Zdj5dIwkJ8iFTVPIQaEwd/gVsxTdUYiJK+wNoOqgkht9UPfy2qrQ
HHsbUdpbXdUOlKDPRZ8CAwEAAaMjMCEwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEBAER+0ygpXY0fYkpi15Cljvrw99An
IZNmbV0HwDLF+rq0XvNl2uFX9HTJFCXAskZRR4TN7+64A0oW5Kz3LIjWhSSmRkob
UsuHdSx7cmbyYuZNqkw4oOb8jD/J3pOV19qX+teZTxDi2vAx3UwE3l1DCI1R2KEt
gU/7clRsIrOGknwXJOK832ZewxoHeINvmes3MZ/klwRbsXiLOkz7xgQGWbXrcsfp
XFJrUZJ8jYudroKsw5Pt1IzwMogQ9ypSw5vgsmXkYIcrn6sz5sdMej8Dpf0wNHJo
FDGKyJ2FfNEUcazsoqYv08bRO5qqfEhPtmlDXl/yshqsGRCUMK43oHilsa4=
-----END CERTIFICATE-----

2020-11-04T21:00:15.644178Z	info	sds	SDS gRPC server for workload UDS starts, listening on "./etc/istio/proxy/SDS"

2020-11-04T21:00:15.644222Z	info	xdsproxy	Initializing with upstream address istiod.istio-system.svc:15012 and cluster Kubernetes
2020-11-04T21:00:15.644385Z	info	xdsproxy	adding watcher for certificate ./var/run/secrets/kubernetes.io/serviceaccount/ca.crt
2020-11-04T21:00:15.644315Z	info	sds	Start SDS grpc server
2020-11-04T21:00:15.644600Z	info	Starting proxy agent
2020-11-04T21:00:15.644661Z	info	Received new config, creating new Envoy epoch 0
2020-11-04T21:00:15.644644Z	info	Opening status port 15020

2020-11-04T21:00:15.644824Z	info	Epoch 0 starting
2020-11-04T21:00:15.652712Z	info	Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster ratings.bookinfo --service-node sidecar~10.244.0.30~ratings-v1-7466f6f858-ngm9x.bookinfo~bookinfo.svc.cluster.local --local-address-ip-version v4 --bootstrap-version 3 --log-format-prefix-with-location 0 --log-format %Y-%m-%dT%T.%fZ	%l	envoy %n	%v -l warning --component-log-level misc:error --concurrency 2]
2020-11-04T21:00:15.684710Z	warning	envoy runtime	Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
2020-11-04T21:00:15.684984Z	warning	envoy runtime	Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
2020-11-04T21:00:15.685367Z	warning	envoy runtime	Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
2020-11-04T21:00:15.685509Z	warning	envoy runtime	Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size
2020-11-04T21:00:15.711445Z	info	xdsproxy	Envoy ADS stream established
2020-11-04T21:00:15.715022Z	warning	envoy main	there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
2020-11-04T21:00:15.719983Z	info	xdsproxy	connecting to upstream XDS server: istiod.istio-system.svc:15012
2020-11-04T21:00:15.742274Z	info	sds	resource:default new connection
2020-11-04T21:00:15.742306Z	info	sds	resource:ROOTCA new connection
2020-11-04T21:00:15.742362Z	info	sds	Skipping waiting for gateway secret
2020-11-04T21:00:15.742378Z	info	sds	Skipping waiting for gateway secret
2020-11-04T21:00:15.833695Z	info	cache	Root cert has changed, start rotating root cert for SDS clients
2020-11-04T21:00:15.833838Z	info	cache	GenerateSecret default
2020-11-04T21:00:15.834220Z	info	sds	resource:default pushed key/cert pair to proxy
2020-11-04T21:00:15.943182Z	info	cache	Loaded root cert from certificate ROOTCA
2020-11-04T21:00:15.943429Z	info	sds	resource:ROOTCA pushed root cert to proxy
2020-11-04T21:00:15.988392Z	warning	envoy filter	mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
2020-11-04T21:00:15.989310Z	warning	envoy filter	mTLS PERMISSIVE mode is used, connection can be either plaintext or TLS, and client cert can be omitted. Please consider to upgrade to mTLS STRICT mode for more secure configuration that only allows TLS connection with client cert. See https://istio.io/docs/tasks/security/mtls-migration/
2020-11-04T21:00:15.996381Z	critical	envoy backtrace	Caught Segmentation fault, suspect faulting address 0x17
2020-11-04T21:00:15.996413Z	critical	envoy backtrace	Backtrace (use tools/stack_decode.py to get line numbers):
2020-11-04T21:00:15.996422Z	critical	envoy backtrace	Envoy version: 211ca038383b61c9fb600758bf170aea5508800c/1.16.0/Clean/RELEASE/BoringSSL
2020-11-04T21:00:15.996661Z	critical	envoy backtrace	#0: __restore_rt [0x7f0fdb3ec8a0]
2020-11-04T21:00:16.006971Z	critical	envoy backtrace	#1: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x564f361f1336]
2020-11-04T21:00:16.016383Z	critical	envoy backtrace	#2: Envoy::Event::DispatcherImpl::run() [0x564f361f11cc]
2020-11-04T21:00:16.024630Z	critical	envoy backtrace	#3: Envoy::Server::WorkerImpl::threadRoutine() [0x564f361e3c53]
2020-11-04T21:00:16.033115Z	critical	envoy backtrace	#4: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x564f367ff8c3]
2020-11-04T21:00:16.033175Z	critical	envoy backtrace	#5: start_thread [0x7f0fdb3e16db]
2020-11-04T21:00:16.037086Z	error	Epoch 0 exited with error: signal: segmentation fault
2020-11-04T21:00:16.037273Z	info	xdsproxy	disconnected from XDS server: istiod.istio-system.svc:15012
2020-11-04T21:00:16.037344Z	info	No more active epochs, terminating

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 34 (31 by maintainers)

Most upvoted comments

With a debug build I get basically the same thing as https://github.com/istio/istio/issues/28620#issuecomment-722658196 although it reproduces much more easily (I haven’t seen the proxy succeed to start up yet). I guess I should’ve paid more attention to the assertion there earlier though. The line is

2020-12-10T19:00:19.934766Z	critical	envoy assert	assert failure: currentThreadRegisteredWorker(index).

which points to the assertion here. This function is called by the return value of dataCallback which is called by runOnAllThreads. The comment in dataCallback points to these comments, pointing to the this capture being a problem, as EItanya pointed out…

It needs a debug build of the istio-proxy to get the detailed stack trace. So swap the binary with debug build or ASAN build, you should be able to get them.