milvus: [Bug]: panic: failed to flush delete data

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.1.0-hotfix-dcd6c9e
- Deployment mode(standalone or cluster): standalone
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.0.2
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 
    Limits:
      cpu:     32
      memory:  16Gi
    Requests:
      cpu:      4
      memory:   16Gi
- GPU:
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
- Others: 
    Deployment is running on EKS

Current Behavior

When I deploy a fresh milvus standalone setup via the helm chart and we start inserting and deleting data after a short time (sometimes minutes, sometimes hours) the setup panics. The worst thing is, that once it panicked, it is not possible to get it running again. Only by stopping everything, deleting the PVCs and recreating an empty cluster from scratch without any data it is possible to get a running system. The system with inserted data cannot be revived.

We consume from kafka with a python kafka consumer and use the milvus python sdk to frequently insert and delete vectors with 100 - 300 dimensions (we experiment with different setups).

We were running on Milvus 1.1.1 and as we had some performance problems we recently upgraded to Milvus 2, we were trying out Milvus 2.0.2 , 2.1.0 and v2.1.0-hotfix-dcd6c9e , first in the clustered version and then as this failed in standalone mode.Our usecase is that we consume messages from kafka, vectorize them and then put them into Milvus via the python SDK, we also have a webservice which queries Milvus. The inserts are one insert every couple of seconds, so about 20-60 inserts into Milvus per Minute. After some minutes or hours the clustered setup failed. With querynodes going offline and not being able to come online again (the logs do not show any meaningful message).When I try to reproduce the issues locally with some dummy generators/searches and docker-compose it seems to work quite well, without errors, even with 500.000 vectors. But in the setup deployed in Kubernetes we get issues of all kinds (dying queryNode -> queryCoord, failed to find shared leader, etc.) . Also Standalone deployed on kubernetes does not run stable with very little load and vectors (20-60 inserts per Minute, about 1-10k vectors).

open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: failed to flush delete data, err = All attempts results:
attempt #1:cannot find segment, id = 435148680510046209
attempt #2:cannot find segment, id = 435148680510046209
attempt #3:cannot find segment, id = 435148680510046209
attempt #4:cannot find segment, id = 435148680510046209
attempt #5:cannot find segment, id = 435148680510046209


goroutine 1551 [running]:
github.com/milvus-io/milvus/internal/datanode.(*deleteNode).Operate(0xc000f0db90, 0xc0079068c0, 0x1, 0x1, 0x0, 0x0, 0x0)
	/go/src/github.com/milvus-io/milvus/internal/datanode/flow_graph_delete_node.go:249 +0x10c5
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc002408680)
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70

Other logs:

    __  _________ _   ____  ______    
   /  |/  /  _/ /| | / / / / / __/    
  / /|_/ // // /_| |/ / /_/ /\ \    
 /_/  /_/___/____/___/\____/___/     

Welcome to use Milvus!
Version:   v2.1.0-hotfix-dcd6c9e
Built:     Mon Aug  1 08:54:24 UTC 2022
GitCommit: dcd6c9e5
GoVersion: go version go1.16.9 linux/amd64

open pid file: /run/milvus/standalone.pid
lock pid file: /run/milvus/standalone.pid
---Milvus Proxy successfully initialized and ready to serve!---
panic: insertNode processDeleteMessages failed, collectionID = 435147331058208065, err = partition 435148989329309697 hasn't been loaded or has been released, channel: by-dev-rootcoord-dml_1_435147331058208065v1

goroutine 191864 [running]:
github.com/milvus-io/milvus/internal/querynode.(*insertNode).Operate(0xc00b047170, 0xc00caef170, 0x1, 0x1, 0x0, 0x0, 0x0)
	/go/src/github.com/milvus-io/milvus/internal/querynode/flow_graph_insert_node.go:234 +0x3c45
github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work(0xc005772780)
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:102 +0x23b
created by github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start
	/go/src/github.com/milvus-io/milvus/internal/util/flowgraph/node.go:70 +0x70

milvus2-etcd-0                                                1/1     Running   0          19h
milvus2-minio-7f7556f4b7-xp7r8                                1/1     Running   0          18h
milvus2-standalone-c7bd4b9d9-gf8pj                            0/1     Running   51         19h

Expected Behavior

Milvus should run stable and not shutdown because of (recoverable?) errors
Milvus should be able to recover itself if it finds unconsistent state
Milvus should never end up in a state where it can’t heal itself

Steps To Reproduce

1. Deploy milvus standalone on EKS cluster
2. Insert and delete vectors in parallel (same id can be inserted, deleted multiple times in relatively short succession (several seconds)) in a rate about 20-60 times per minute
3. After a short time some raceconditions (?) appear which bring down milvus
4. Milvus is not able to recover from this and restarts endlessly

Milvus Log

I have 60MB of logs (gzipped 2.3 MB) where can I drop them?

Some excerpt:

[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435148894957731843/435148894957731849] [error="file/delta_log/43514733105820806
5/435147415318167553/435148894957731843/435148894957731849 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).work\n\t/
go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go:98"]
[2022/08/09 04:37:59.035 +00:00] [ERROR] [datacoord/garbage_collector.go:148] ["pa
rse segment id error"] [infoKey=file/delta_log/435147331058208065/4351474153181675
53/435149549579796488/435149549592903686] [error="file/delta_log/43514733105820806
5/435147415318167553/435149549579796488/435149549592903686 is not a valid binlog p
ath"] [stack="github.com/milvus-io/milvus/internal/datacoord.(*garbageCollector).s
can\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/garbage_collector.go
:148\ngithub.com/milvus-io/milvus/
...
[2022/08/09 04:38:00.026 +00:00] [DEBUG] [client/client.go:91] ["QueryCoordClient msess key not existed"] [key=querycoord]
[2022/08/09 04:38:00.026 +00:00] [ERROR] [grpcclient/client.go:140] ["failed to get client address"] [error="find no available querycoord, check querycoord state"] [stack="github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).connect\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:140\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).GetGrpcClient\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:112\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).callOnce\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:202\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase).ReCall\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:257\ngithub.com/milvus-io/milvus/internal/distributed/querycoord/client.(*Client).GetComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/client/client.go:118\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates.func1\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:65\ngithub.com/milvus-io/milvus/internal/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/internal/util/retry/retry.go:37\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentStates\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:89\ngithub.com/milvus-io/milvus/internal/util/funcutil.WaitForComponentHealthy\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/func.go:104\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).init\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:461\ngithub.com/milvus-io/milvus/internal/distributed/proxy.(*Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/proxy/service.go:288\ngithub.com/milvus-io/milvus/cmd/components.(*Proxy).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/proxy.go:50\ngithub.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).runProxy.func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:155"]

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 15 (13 by maintainers)

Commits related to this issue

Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix flush panic after compaction See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to XuanYang-cn/milvus by XuanYang-cn 2 years ago
Fix flush panic after compaction (#18677) See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago
Fix flush panic after compaction (#18678) See also: #18565 Signed-off-by: yangxuan <xuan.yang@zilliz.com> Signed-off-by: yangxuan <xuan.yang@zilliz.com> — committed to milvus-io/milvus by XuanYang-cn 2 years ago

Most upvoted comments

Milvus Log

I have 60MB of logs (gzipped 2.3 MB) where can I drop them?

@datenhahn I would like to exam the logs, please email to me at xuan.yang@zilliz.com

XuanYang-cn on Aug 9, 2022

@XuanYang-cn I have uploaded the logs here now, maybe that works: https://drive.google.com/file/d/1TJYGAx_H2emOcF0XQruhlozDzGI9008I/view?usp=sharing

We loaded the collection (I was not aware it is possible or required to load partitions separately) and collection was shown in attu as “loaded”
Yes, I just saw we used it as debugging information

            logger.info(f"Number of all inserted vectors: {inserted_vectors.insert_count}.")
            logger.info(
                f"Vectors {inserted_vectors.primary_keys} successfully added to partition {partition} of collection {collection.name}")
            logger.info(
                f"New number of entities in collection {collection.name}:\n{collection.num_entities}")

The delete uses a “id in …” expression, the search uses vectors with limit. The delete use the following expression:

        check_if_id_exists = f"id in {[id]}"
        deleted_entity = collection.delete(expr=check_if_id_exists, partition_name=partition)

The queries use vector search:

@run_time
def query_milvus(query_vectors, request, search_partitions, path=""):
    """Return the vectors of documents that are most close to the query vector"""

    list_of_query_vectors = [dict_elem["embeddings"] if isinstance(dict_elem, dict) else dict_elem for dict_elem in query_vectors]

    query_results = collection.search(
        data=list_of_query_vectors,
        anns_field="embeddings",
        param=search_params,
        partition_names=search_partitions,
        limit=request.num_similar_documents
    )
    logger.info(f"Milvus Suche beendet. Ergebnisse: {query_results}")
    return query_results

Regarding the problems:

DataNode might have been killed by the kubernetes cluster (e.g. we are using AWS Spot Instances)

Regarding the rest, maybe you find something in the logfiles.

Here are event more logs in the other issue ( https://github.com/milvus-io/milvus/issues/18595#issuecomment-1210780825 ) it is the same setup. I tried with Milvus Cluster and Milvus Standalone on Kubernetes, but both run into issues.

For me I see the following problems:

If milvus is running into ressource issues it should block/reject further insertion/deletion requests, but it should not completely shutdown
If milvus is entering an invalid cluster state (e.g. important services like datanode, etc. missing) it should not panic, but continue logging/retrying until the resource is available again
If milvus for whatever reason dies from panics or because AWS or Kubernetes ripped some cluster nodes away, it should be able to recover itself into some workable/debuggable state
If the damage done to the cluster is too large and unhealable (e.g. permanent data loss, etc.) it can either completely shut down with meaningful messages or it can recover with the data still available (danger of not noticing that there was a data loss)
One note: It might be that liveness probes prevent the cluster getting in valid states if recovery takes longer than liveness probes and kubernetes marks the pod as dead and terminates and recreates it introducing and endless recovery/restart loop. But while the restart loop this seems to happen here, I don’t think this is an issue. It is not an issue of time currently, as queryCoord never will recover (for a test to see if it needs more time I was increasing the liveness probe retries to a very high number).

Currently I have the feeling:

Once resource limits are reached (memory, queuesizes, whatever) milvus will just panic
Once paniced it cannot pull itself together anymore due to inconsistent state
The required sizings for many vectors/partitions and required config settings (e.g. queuesizes, memory sizes, config settings) are not clear, maybe we need to configure something differently?

datenhahn on Aug 11, 2022

@XuanYang-cn I have mailed you the other logs, I didn’t realize the logs have been rotated, so the new file is 300MB big.

Maybe some more information about our setup which might be special. We use lots of partitions to speed up searches. We add every vector in multiple partitions:

Year (e.g. 2022) Month (e.g. 2022-08) Week (e.g. 2022-08-KW32) Day (e.g. 2022-08-10)

So we have lots of partitions with (initially) few vectors, the more vectors we add, the partitions fill up with more data.

datenhahn on Aug 10, 2022