tensorflow: [Bug]The file copied by TF from HDFS to local may be wrong, when HDFS file is being overwritten

This is a issue from TaiJi AI platform in Tencent.

System information

  • OS Platform and Distribution : Linux version 4.14.105-1-tlinux3-0010
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.13.1(we use)and the latest version also has this problem
  • Python version: 3.6
  • C++ version: 11

Describe the current behavior Our training sample data is generated by the spark program and stored on HDFS, an example of a training sample file: hdfs://xxxx/example/20200822/part-r-0000036.tfr.gz, and the file data is compressed by gzip. The trigger condition of the training program is that the _SUCCESS file appears under hdfs://xxxx/example/20200822/. The training program first downloads the training samples on HDFS to the local, and then reads the local data for training. When the training program and the spark program are running at the same time, the downloaded HDFS file may be overwritten by the spark program, causing the gzip file downloaded to the local to be damaged. Once the gzip file is wrong, our tensorflow training program will always stay unzipped, and the CPU utilization rate is high. The wrong local gzip file is composed of part of the data of the HDFS file before and after overwriting.

code:

 auto env = tensorflow::Env::Default();
 auto st = env->CopyFile(src_file, des_file);

process pstack info: image top info: image

Describe the expected behavior The local gzip file is consistent with the data of the HDFS file before overwriting, or the data of the HDFS file after overwriting, instead of containing the data of the HDFS file before and after overwriting

issues analysis In order to solve the issue:5438 that the tensorboard needs to get the latest data written, the HDFS file is reopened in the HDFSRandomAccessFile Read: when n>0 r=0, call hdfsOpenFile to reopen the HDFS file. Please refer to this commit for details. Before calling hdfsOpenFile, if the HDFS file is overwritten, a new HDFS file is generated. After calling hdfsOpenFile, it will point to the new HDFS file. If the size of the new HDFS file is larger than the size of the old HDFS file, the HDFS file copied to the local file system by FileSystemCopyFile contains part of the data of the new and old HDFS files, causing the local gzip file to be wrong

temp solution: patch-1 FileSystemCopyFile avoids triggering the hdfsOpenFile operation of the HDFSRandomAccessFile Read. The size of the file copy is based on the file size, not based on kCopyFileBufferSize. The implementation principle of the temporary solution is the same as that of ReadFileToString, but it is still possible that the file copied to the local file is wrong. Because GetFileSize and READ cannot form an atomic operation. For example, when the file size is obtained through GetFileSize, the HDFS file is overwritten, and the data of the new file is read based on the size of the old file. However, the possibility that the local file of the temporary solution is wrong is far less than the original solution. Generally speaking, reading the file data to the end of the file is a time-consuming operation, and the time-consuming operation of obtaining the file size is negligible

may be a better solution: patch-2 I’m not sure if this solution is a better solution. In some scenarios I don’t know, it may require further discussion. RandomAccessFile READ is an abstraction of the operations supported by each file system, and the specific implementation is transparent to users. Adding the hdfsOpenFile to the HDFSRandomAccessFile READ to read the latest data is a hidden and dangerous behavior. Because hdfsOpenFile may point to new files, data inconsistencies may occur. More fatally, there are a large number of methods that depend on the READ, which may cause some behaviors that are not what we expect, which is the root of all errors. I think it is better for users to use READ and REOPEN to obtain the latest data in the program.

accepted solution: patch-3 Quoting mihaimaruseac’s comment:

Patch-1 has the issue of breaking separation of concern design principles (what happens if there is a new scheme for hdfs? We would have a bug in there until someone remembers the additional if). 
Patch-2 has the issue of removing a test that was added for creating a bug.

In order to overcome the shortcomings of patch-1 and patch-2, a switch is added to patch-3, the default HDFS_DISABLE_READ_EOF_RETRIED is false. patch-3 will not remove the WriteWhileReading test case, and it can also solve the problem we encountered. If you need to turn off the HDFS_READ_EOF_RETRIED, set the environment variable:

source HDFS_DISABLE_READ_EOF_RETRIED=1

For more details, please refer to the comments below.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (8 by maintainers)

Commits related to this issue

Most upvoted comments

This has been solved by the third patch