argo-cd: i/o timeout errors in redis and argocd-repo-server PODs

Hi All,

I exploring the argoCD. Its quite neat project. I have deployed argoCD on K8 1.17 cluster (1 master, 2 workers) running over 3 LXD containers. I could use other stuff like metallb, ingress, rancher etc fine with this cluster.

For some reason, my argoCD isn’t working the expected way. I was abe to get argoCD UI login working by using bypass method in bug 4148 that I reported earlier.

Here are svcs in argoCD ns

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/argocd-dex-server       ClusterIP   10.102.198.189   <none>        5556/TCP,5557/TCP,5558/TCP   25h
service/argocd-metrics          ClusterIP   10.104.80.68     <none>        8082/TCP                     25h
service/argocd-redis            ClusterIP   10.105.201.92    <none>        6379/TCP                     25h
service/argocd-repo-server      ClusterIP   10.98.76.94      <none>        8081/TCP,8084/TCP            25h
service/argocd-server           NodePort    10.101.169.46    <none>        80:32046/TCP,443:31275/TCP   25h
service/argocd-server-metrics   ClusterIP   10.107.61.179    <none>        8083/TCP                     25h

After I got my UI, I tried creating a new sample project from GUI, it failed. Below are the logs from during that time for argo-server

time="2020-08-27T09:22:21Z" level=info msg="received unary call /repository.RepositoryService/List" grpc.method=List grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content= grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:22:21Z" span.kind=server system=grpc
time="2020-08-27T09:22:21Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:22:21Z" grpc.time_ms=0.318 span.kind=server system=grpc
time="2020-08-27T09:22:21Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=project.ProjectService grpc.start_time="2020-08-27T09:22:21Z" grpc.time_ms=3.441 span.kind=server system=grpc
time="2020-08-27T09:23:52Z" level=info msg="received unary call /repository.RepositoryService/ListApps" grpc.method=ListApps grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="repo:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" revision:\"HEAD\" " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:23:52Z" span.kind=server system=grpc
_time="2020-08-27T09:26:39Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout\"" grpc.code=Unavailable grpc.method=ListApps grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:23:52Z" grpc.time_ms=167124.3 span.kind=server system=grpc_
time="2020-08-27T09:28:16Z" level=info msg="Alloc=10005 TotalAlloc=1978587 Sys=71760 NumGC=257 Goroutines=158"
time="2020-08-27T09:28:31Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"y\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:28:31Z" span.kind=server system=grpc
2020/08/27 09:28:48 proto: tag has too few fields: "-"
time="2020-08-27T09:28:48Z" level=info msg="received unary call /application.ApplicationService/Create" grpc.method=Create grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="application:<TypeMeta:<kind:\"\" apiVersion:\"\" > metadata:<name:\"app1\" generateName:\"\" namespace:\"\" selfLink:\"\" uid:\"\" resourceVersion:\"\" generation:0 creationTimestamp:<0001-01-01T00:00:00Z> clusterName:\"\" > spec:<source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"yamls\" targetRevision:\"HEAD\" chart:\"\" > destination:<server:\"https://kubernetes.default.svc\" namespace:\"default\" > project:\"default\" > status:<sync:<status:\"\" comparedTo:<source:<repoURL:\"\" path:\"\" targetRevision:\"\" chart:\"\" > destination:<server:\"\" namespace:\"\" > > revision:\"\" > health:<status:\"\" message:\"\" > sourceType:\"\" summary:<> > > " grpc.service=application.ApplicationService grpc.start_time="2020-08-27T09:28:48Z" span.kind=server system=grpc
time="2020-08-27T09:31:11Z" level=info msg="received unary call /repository.RepositoryService/ListApps" grpc.method=ListApps grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="repo:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" revision:\"HEAD\" " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" span.kind=server system=grpc
time="2020-08-27T09:31:11Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"yamls\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" span.kind=server system=grpc
time="2020-08-27T09:31:29Z" level=info msg="finished unary call with code InvalidArgument" error="rpc error: code = InvalidArgument desc = application spec is invalid: InvalidSpecError: Unable to get app details: rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=InvalidArgument grpc.method=Create grpc.service=application.ApplicationService grpc.start_time="2020-08-27T09:28:48Z" grpc.time_ms=161011.11 span.kind=server system=grpc
time="2020-08-27T09:31:33Z" level=warning msg="finished unary call with code DeadlineExceeded" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=DeadlineExceeded grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:28:31Z" grpc.time_ms=182001.84 span.kind=server system=grpc
time="2020-08-27T09:31:33Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"y\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:33Z" span.kind=server system=grpc
time="2020-08-27T09:33:31Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout\"" grpc.code=Unavailable grpc.method=ListApps grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" grpc.time_ms=140004.14 span.kind=server system=grpc

I even tried creating app using the declarating way. Created this yaml and applied mainfest using kubctl apply -f <yanml>method. This created a appp visible in GUI, but it was never deployed. The health status eventually became healthy, but the sync status remained unknown.

From GUI, I can see below errors under applications conditions one after another

ComparisonError
rpc error: code = DeadlineExceeded desc = context deadline exceeded

ComparisonError
rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout"

While I tried deleting the app from GUI, it was stuck in deleting with below error visible under events in GUI

DeletionError
dial tcp 10.105.201.92:6379: i/o timeout
Unable to load data: dial tcp 10.105.201.92:6379: i/o timeout
Unable to delete application resources: dial tcp 10.105.201.92:6379: i/o timeout

As of now, nothing is working for my in argoCD. I am clueless as to what to do now.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 18
  • Comments: 70 (9 by maintainers)

Most upvoted comments

hi I did some testing with delete step by step all polices. The policy which resolves my problem was deleting the argocd-repo-server-network-policy

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn’t only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

In my case the problem was that terraform was overriding the default AWS EKS security groups (allow all), and so the server pod couldn’t communicate with the redis pod. When I added the correct security groups everything started to work as expected.

To help yourself diagnose this problem use “kubectl logs pod/argocd-server…”. Using that I could see that the server pod was timing out when trying to connect to the redis pod, and that helped me narrow it down.

Hi, I am also facing the same problem. Any new workarounds?

Given there were so many different issues I don’t believe the problem is with ArgoCD.

I had been struggling with similar problems this week getting ArgoCD working in a small Vagrant/VirtualBox environment. I switched from Flannel to Calico and everything just magically started working.

Can you confirm the application controller is able to reach the managed cluster’s API server?

Hi, Thanks for the reply. Can you confirm how exactly I do that?

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn’t only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

This got me on the right path, thank you so much!

We’re using the terraform-aws-eks module which only configures security groups for control plane by default. By adding the basic rules as per the complete example, I was able to resolve this issue.

Thank you. Solved by adding the SGs for basic rules as mentioned here.

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn’t only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

This got me on the right path, thank you so much!

We’re using the terraform-aws-eks module which only configures security groups for control plane by default. By adding the basic rules as per the complete example, I was able to resolve this issue.

Hi, i am also facing the same problem, any solution yet ?

I’ve experienced a similar issue, where removing the NetworkPolicy for redis temporarily restored the connectivity. I restored the NetworkPolicy, then restarted the CNI agents on nodes running redis and argocd-server (cilium in my case), and connectivity was restored. I’d be cautious before restarting the cni agents. There was a blip in service communication (as expected). Proceed with caution.

I saw the same issue in v2.0.1 as well, restarting all the pods fixed the issue but not sure what is the cause of it.

Thanks for the hint @jrhoward, but unfortunately this did not solve the issue for me disappointed

It seems that no one of the Argo services can talk with redis.

argocd-server has logs like:

time="2021-06-09T09:42:15Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.248.24:6379: i/o timeout"

These are the startup logs of argocd-application-controller:

time="2021-06-09T09:40:06Z" level=info msg="Processing all cluster shards"
time="2021-06-09T09:40:06Z" level=info msg="appResyncPeriod=3m0s"
time="2021-06-09T09:40:06Z" level=info msg="Application Controller (version: v2.0.3+8d2b13d, built: 2021-05-27T17:38:37Z) starting (namespace: argocd)"
time="2021-06-09T09:40:06Z" level=info msg="Starting configmap/secret informers"
time="2021-06-09T09:40:06Z" level=info msg="Configmap/secret informer synced"
time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-09T09:40:06Z" level=info msg="0xc00097b1a0 subscribed to settings updates"
time="2021-06-09T09:40:06Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=drone
time="2021-06-09T09:40:06Z" level=info msg="Starting clusterSecretInformer informers"
time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-09T09:40:06Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: drone)" application=drone
time="2021-06-09T09:40:06Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc"
W0609 09:40:06.719086       1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0609 09:40:06.815954       1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
time="2021-06-09T09:40:26Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:40:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:36Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:42:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:42:37Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
...

It seems also that services cannot talk to each other. I found this in argocd-server logs:

time="2021-06-09T08:58:15Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.43.55.98:8081: connect: connection refused\"" grpc.code=Unavailable grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2021-06-09T08:58:13Z" grpc.time_ms=2007.533 span.kind=server system=grpc

where 10.43.55.98 is the ClusterIP of the argocd-repo-server service.

I’m very puzzled.

Hi there, had the same issue and stocked in it for several days, logs from the argocd-server showed that the argocd-server connection to the argocd-redis is not established ot got timeout every action it took. so it might be some problems with the argocd-server or its networks. check if the argocd-redis pod it ready,if so, check if the core-dns of the k8s cluster is ready and healthy. In my case it way my calico-node that was not in running mode!

kubectl -n kube-system delete pods calico-node-876rj

and tadaaaa 😃 it was helpful to me. hope works for you.

In my case, I was not specifying a nodeSelector in my Helm values. This caused some of the pods to sometimes land on nodes in a worker group that (similar to what others have described above) did not have the proper security group rules in place.

Using a nodeSelector/affinity to force the pods not to land on these worker node groups solved the issue.

@Tylermarques I resolved the issue. You have to perform the following steps in the solution.

  1. add the cluster properly. download Argocd CLI and I use the following command. argocd cluster add kubernetes-admin@kubernetes --in-cluster # The status will be unknow until app deployment
  2. Remove following network Policies and restart the pods by deleting them. I. argocd-repo-server-network-policy II. argocd-server-network-policy
  3. Please don’t use argo-example repo for the deployment. It won’t work. Instead of it create a public project in your repo. Push the changes. then create ssh public and private keys using this command.

ssh-keygen -t ed25519 -f argocd

Copy Public key in the ssh and GPG keys access section of github as an SSH key. Then go to argocd settings in GUI and add github repo using ssh. There you will need your private key. It will be added successfully. Create the application. When the application will be created, the cluster unknown section will go to successful automatically.

I am facing the following issue.

Unable to create application: application spec for app1 is invalid: InvalidSpecError: repository not accessible: rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp: lookup argocd-repo-server: i/o timeout”

I am running my cluster with bare metal Linux VMS. I have deleted all the network policies by reading the above comments. I also tried with different version but still the same issue.

If some has resolved the above issue. could you please share?

By adding the basic rules as per the complete example, I was able to resolve this issue.

This was the solution that worked for me too.

I am still getting the below error after deleting all argocd network policies. I have deployed argocd on minikube cluster, it seems to be ArgoCD does not work well with minikube cluster: Unable to connect SSH repository: connection error: desc = “transport: Error while dialing dial tcp: lookup argocd-repo-server: i/o timeout”

Anyone can help please. it is very annoying

Removing Network Policies helped in my case. I’m using Weave with no other Policies, AgroCD ones were the only NPs there.

Ok on delving deeper into my issue it was actually an SDN issue. I’m running on bare metal. Machines could not reach CoreDNS if they were not on the same machine, if they were on the same machine they couldn’t reach the Redis Server if it was on another machine, so a mixture of DNS lookup failures and network connectivity to Redis

I spoke too soon. The errors are back

Just asking method to verify coz- I could kube exec into app-conroller pod. It chooses non-root user - “argocd” for login in pod. Its debain buster container, where i can’t install ping or ever sudo (to install ping). The plan was to ping kube api server IP (which is K8 master IP) to see if there is communication between two.

argocd@argocd-application-controller-d9d496bdc-hcv7t:~$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"