milvus: [Bug]: panic: failed to flush delete data
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version: v2.1.0-hotfix-dcd6c9e
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.0.2
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory:
Limits:
cpu: 32
memory: 16Gi
Requests:
cpu: 4
memory: 16Gi
- GPU:
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
- Others:
Deployment is running on EKS
Current Behavior
When I deploy a fresh milvus standalone setup via the helm chart and we start inserting and deleting data after a short time (sometimes minutes, sometimes hours) the setup panics. The worst thing is, that once it panicked, it is not possible to get it running again. Only by stopping everything, deleting the PVCs and recreating an empty cluster from scratch without any data it is possible to get a running system. The system with inserted data cannot be revived.
We consume from kafka with a python kafka consumer and use the milvus python sdk to frequently insert and delete vectors with 100 - 300 dimensions (we experiment with different setups).
We were running on Milvus 1.1.1 and as we had some performance problems we recently upgraded to Milvus 2, we were trying out Milvus 2.0.2 , 2.1.0 and v2.1.0-hotfix-dcd6c9e , first in the clustered version and then as this failed in standalone mode.Our usecase is that we consume messages from kafka, vectorize them and then put them into Milvus via the python SDK, we also have a webservice which queries Milvus. The inserts are one insert every couple of seconds, so about 20-60 inserts into Milvus per Minute. After some minutes or hours the clustered setup failed. With querynodes going offline and not being able to come online again (the logs do not show any meaningful message).When I try to reproduce the issues locally with some dummy generators/searches and docker-compose it seems to work quite well, without errors, even with 500.000 vectors. But in the setup deployed in Kubernetes we get issues of all kinds (dying queryNode -> queryCoord, failed to find shared leader, etc.) . Also Standalone deployed on kubernetes does not run stable with very little load and vectors (20-60 inserts per Minute, about 1-10k vectors).
open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: failed to flush delete data, err = All attempts results:
attempt #1:cannot find segment, id = 435148680510046209
attempt #2:cannot find segment, id = 435148680510046209
attempt #3:cannot find segment, id = 435148680510046209
attempt #4:cannot find segment, id = 435148680510046209
attempt #5:cannot find segment, id = 435148680510046209
goroutine 1551 [running]:
github.com/milvus-io/milvus/internal/datanode.(*deleteNode).Operate(0xc000f0db90, 0xc0079068c0, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_delete_node.go:249 +0x10c5
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc002408680)
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70
Other logs:
__ _________ _ ____ ______
/ |/ / _/ /| | / / / / / __/
/ /|_/ // // /_| |/ / /_/ /\ \
/_/ /_/___/____/___/\____/___/
Welcome to use Milvus!
Version: v2.1.0-hotfix-dcd6c9e
Built: Mon Aug 1 08:54:24 UTC 2022
GitCommit: dcd6c9e5
GoVersion: go version go1.16.9 linux/amd64
open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: insertNode processDeleteMessages failed, collectionID = 435147331058208065, err = partition 435148989329309697 hasn't been loaded or has been released, channel: by-dev-rootcoord-dml_1_435147331058208065v1
goroutine 191864 [running]:
github.com/milvus-io/milvus/internal/querynode.(*insertNode).Operate(0xc00b047170, 0xc00caef170, 0x1, 0x1, 0x0, 0x0, 0x0)
/go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_insert_node.go:234 +0x3c45
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc005772780)
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70
milvus2-etcd-0 1/1 Running 0 19h
milvus2-minio-7f7556f4b7-xp7r8 1/1 Running 0 18h
milvus2-standalone-c7bd4b9d9-gf8pj 0/1 Running 51 19h
Expected Behavior
- Milvus should run stable and not shutdown because of (recoverable?) errors
- Milvus should be able to recover itself if it finds unconsistent state
- Milvus should never end up in a state where it can’t heal itself
Steps To Reproduce
1. Deploy milvus standalone on EKS cluster
2. Insert and delete vectors in parallel (same id can be inserted, deleted multiple times in relatively short succession (several seconds)) in a rate about 20-60 times per minute
3. After a short time some raceconditions (?) appear which bring down milvus
4. Milvus is not able to recover from this and restarts endlessly
Milvus Log
I have 60MB of logs (gzipped 2.3 MB) where can I drop them?
Some excerpt:
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435148894957731843/435148894957731849] [error="file/delta_log/43514733105820806
5/435147415318167553/435148894957731843/435148894957731849 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).work\n\t/
go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go:98"]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435149549579796488/435149549592903686] [error="file/delta_log/43514733105820806
5/435147415318167553/435149549579796488/435149549592903686 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/
...
[2022/08/09 04:38:00.026 +00:00] [DEBUG] [client/client.go:91] ["QueryCoordClient msess key not existed"] [key=querycoord]
[2022/08/09 04:38:00.026 +00:00] [ERROR] [grpcclient/client.go:140] ["failed to get client address"] [error="find no available querycoord, check querycoord state"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).connect\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:140\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).GetGrpcClient\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:112\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).callOnce\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:202\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:257\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:118\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:65\ngithub.com/milvus-io/milvus/internal/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/internal/util/retry/retry.go:37\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:89\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentHealthy\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:104\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:461\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:288\ngithub.com/milvus-io/milvus/cmd/components.(*Proxy).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/proxy.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runProxy.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:155"]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 15 (13 by maintainers)
Commits related to this issue
- Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
- Fix flush panic after compaction (#18677) See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago
- Fix flush panic after compaction (#18678) See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago
@datenhahn I would like to exam the logs, please email to me at
xuan.yang@zilliz.com@XuanYang-cn I have uploaded the logs here now, maybe that works: https://drive.google.com/file/d/1TJYGAx_H2emOcF0XQruhlozDzGI9008I/view?usp=sharing
The queries use vector search:
Regarding the problems:
Regarding the rest, maybe you find something in the logfiles.
Here are event more logs in the other issue ( https://github.com/milvus-io/milvus/issues/18595#issuecomment-1210780825 ) it is the same setup. I tried with Milvus Cluster and Milvus Standalone on Kubernetes, but both run into issues.
For me I see the following problems:
Currently I have the feeling:
@XuanYang-cn I have mailed you the other logs, I didn’t realize the logs have been rotated, so the new file is 300MB big.
Maybe some more information about our setup which might be special. We use lots of partitions to speed up searches. We add every vector in multiple partitions:
Year (e.g. 2022) Month (e.g. 2022-08) Week (e.g. 2022-08-KW32) Day (e.g. 2022-08-10)
So we have lots of partitions with (initially) few vectors, the more vectors we add, the partitions fill up with more data.