dvc: repro: rwlock file corrupted

Bug Report

Issue name

repro: rwlock file corrupted

Description

Run multiple pipelines that read the same file as input, in the middle of the process rwlock file get corrupted

Exception

image (1)

Reproduce

We execute a huge amount of pipelines that share the same file as input, all processes executed on a computer cluster with GPU support, but I don’t think GPU has any impact on this bug. It seems like concurrency bug that hard to reproduce. rwlock file See last line in this file, it’s reformatted and renamed JSON for easy navigation

Environment information

DVC version: 2.0.18 (pip)
---------------------------------
Platform: Python 3.8.8 on Linux-4.18.0-240.15.1.el8_3.x86_64-x86_64-with-glibc2.10
Supports: http, https, s3

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (7 by maintainers)

Most upvoted comments

Hi, dvc config core.hardlink_lock true did a great job and we succeeded do finish the bulk of running jobs without a crash. Now i am going to check if rwlock issue also solved by this change, will update

sure, will try. 10x