mpifileutils: dsync not updating times correctly

Hello,

I recently discovered mpifileutils and they are awesome. I moved about 200 TB in 4 million files between two different lustre filesystems over the weekend. Thank you for your work on these tools. After doing an initial data move, my intent was to keep the data in sync for a period of time using rsync. With the bulk of the data transferred, this should go fast but it would also catch any file types that `dsync doesn’t handle (hard links, etc.). Here is the script I’m using.

source=/nobackup/PROJECTS/ccdev/
dest=/nobackup2/PROJECTS/ccdev/

mpiexec --allow-run-as-root dsync -X non-lustre $source $dest > dsync-ccdev.out.$( date +%F )

rsync -aPHASx $source $dest > rsync-ccdev.out.$( date +%F )

The problem I’m running into is that rsync is copying files that dync already copied. Here is one example of a file that rsync is trying to transfer again.

Source:

238412051 -rw-r----- 1 dvicker eg3 244210724864 Aug 23 16:38 /nobackup/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

Destination:

238487228 -rw-r----- 1 dvicker eg3 244210724864 Sep 29 17:15 /nobackup2/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

The file size it correct but it looks like the date on the destination is wrong - this is the date that that the file got created. After poking around in the destination more, it looks like this is prevalent - dates were not preserved. This will trip up rsync’s default algorithm of checking by date and size. I could switch to rsync’s checksum mode to get around this but that will be quite a bit more time consuming for this dataset.

Is this a bug with dsync or am I possibly doing something wrong? I’m using version 0.11.1 of mpifileutils.

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 87 (46 by maintainers)

Most upvoted comments

I’ve started my test with the updated code.

Hi @dvickernasa Since the destination LFS is ldiskfs, that rules out my theory. Back to the drawing board for us. Thank you.

I’m re-running my test with the updated code. Should be finished in the morning. Since I’ve transferred this same set of files a few times now, we should get a good idea if the extra fsyncs’s slow things down much.

I’ve compiled the update and I’ll repeat the test. I’m not going to use the new option for now, just to be consistent. I’ll email the output for the latest results when its done.

Sorry, this fell off my radar. The “production” transfers I needed to do are done and I didn’t get back to my testing. I’m restarting a transfer as a test. I’m going to recreate the original transfer this time, including the original options I used (most notably without --batch-files. I don’t think I’ve run into a case where this didn’t work when using --batch-files and I’m starting to suspect that is why. This new transfer is the exact same set of files from the first post in this issue. The source and destination filesystem are the same as well, I’m just using a different path in the destination FS. Fingers crossed that this recreates it. Details below. I’m using the debugtimestamps branch for both dsync runs so if we do catch the problem, we should have some good data to go over.

ml load mpifileutils/debugtimestamps

source=/nobackup/PROJECTS/ccdev.to-delete/
dest=/nobackup2/dvicker/dsync_test/ccdev/

dsync_opts='-X non-lustre'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out