argo-rollouts: Failed job results in successful analysis run
I’m using the Job metric provider for pre-promotion validation in a b/g scenario. The job results in failure (expected) but the analysis run still reports Successful
. I expect the analysis run to also fail and cause the revision to be ineligible for promotion (automated or manual) unless otherwise ignored. If I set autoPromotionEnabled
to true
on my Rollout, the revision with the failed Job will be promoted automatically.
Rollout Status
$ kubectl argo rollouts -n example get rollout myapp
Name: myapp
Namespace: example
Status: ॥ Paused
Strategy: BlueGreen
Images: registry.company.io/myapp:1.0.0 (active, preview)
Replicas:
Desired: 1
Current: 2
Updated: 1
Ready: 2
Available: 1
NAME KIND STATUS AGE INFO
⟳ myapp Rollout ॥ Paused 41h
├──# revision:32
│ ├──⧉ myapp-64fc844b69 ReplicaSet ✔ Healthy 113s preview
│ │ └──□ myapp-64fc844b69-ptx8v Pod ✔ Running 113s ready:1/1
│ └──α myapp-64fc844b69-32 AnalysisRun ✔ Successful 48s ✖ 1
│ └──⊞ e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1 Job ✖ Failed 48s
├──# revision:31
│ ├──⧉ myapp-56c5c9749d ReplicaSet ✔ Healthy 4m35s active
│ │ └──□ myapp-56c5c9749d-5qnjr Pod ✔ Running 4m35s ready:1/1
│ ├──α myapp-56c5c9749d-31.1 AnalysisRun ✔ Successful 3m36s ✔ 1
│ │ └──⊞ 739b3158-dc61-4131-a09a-2b0f09a074a2.smoketest.1 Job ✔ Successful 3m36s
$ kubectl -n example get pods
NAME READY STATUS RESTARTS AGE
e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1-f4n9k 0/1 Error 0 2m20s
myapp-56c5c9749d-5qnjr 1/1 Running 0 6m7s
myapp-64fc844b69-ptx8v 1/1 Running 0 3m25s
$ kubectl -n example get jobs
NAME COMPLETIONS DURATION AGE
e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1 0/1 2m29s 2m29s
$ kubectl -n example describe job e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1
Name: e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1
Namespace: example
Selector: controller-uid=fc389554-01a1-4fee-84e8-76777e857e14
Labels: analysisrun.argoproj.io/uid=e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d
Annotations: analysisrun.argoproj.io/metric-name: smoketest
analysisrun.argoproj.io/name: myapp-64fc844b69-32
Controlled By: AnalysisRun/myapp-64fc844b69-32
Parallelism: 1
Completions: 1
Start Time: Thu, 28 May 2020 12:05:22 -0700
Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels: controller-uid=fc389554-01a1-4fee-84e8-76777e857e14
job-name=e2f2718a-e8d4-4a7c-8867-7d8e17e6b01d.smoketest.1
Analysis Template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: smoketest
spec:
args:
- name: service-url
metrics:
- name: smoketest
failureLimit: 1
provider:
job:
spec:
backoffLimit: 0
template:
spec:
containers:
- name: smoketest
image: smoketest:image
args:
- "{{ args.service-url }}"
Rollout
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
annotations:
rollout.argoproj.io/revision: "32"
name: myapp
namespace: example
resourceVersion: "3948767"
spec:
progressDeadlineSeconds: 300
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
app.kubernetes.io/name: myapp
strategy:
blueGreen:
activeService: myapp
autoPromotionEnabled: false
prePromotionAnalysis:
args:
- name: service-url
value: http://myapp-preview.example.svc.cluster.local:8080
templates:
- templateName: smoketest
previewService: myapp-preview
template:
metadata:
labels:
app.kubernetes.io/name: myapp
app.kubernetes.io/version: 1.0.0
spec:
containers: [...]
restartPolicy: Always
terminationGracePeriodSeconds: 160
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 19 (6 by maintainers)
ok, I figured it out. Seemed like I used a wrong image to execute my job. I was using
curlimages/curl:latest
at first and then replacing that image with a different one from my personal library worked. Anyway, thanks a lot for your help dude !EDIT: Actually I was wrong… The problem was not the image but the options I put into my job. Adding these options:
caused my strange behaviour of failed job with successful
AnalysisRun
. So maybe a bug hereRE EDIT: ok sorry for saying bullshit. I finally understood the true reason. my
count
was equal to1
and myfailureLimit
also equal to1
. You needcount
>failureLimit
to make it work. Anyway…It’s late Im tired and I should have had gone to bed instead of saying non sense. Maybe it will help someone 😃 Good night