runtime: TarReader throws on various archives that other tools accept

I tried opening:

  1. each of the tar files used to test Golang’s tar package (here with details about each in the tests here).
  2. each of the tar files used to test node-tar, found here.
  3. each of the tar files used to test libarchive, found here. Note I had to uudecode these.

Note all the above have permissive licenses so it may be possible to borrow these tars for our test assets.

I used the test code below to open each, ignored those that opened successfully, and for those that failed compared whether some other tools could open them. The interesting cases are where other tools (particularly GNU tar) can open them, but we cannot. Note: I mostly didn’t extract the entries, just checked they could be listed. In some cases, the tar can be listed, but extraction will fail.

test code I used
// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;

public static class C
{

    public async static Task Main()
    {
        List<Task> tasks = new();
        foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
        {
            tasks.Add(Task.Run(async () =>
            {
                TarEntry? entry = null;

                try
                {
                    //Console.WriteLine($"{path} opening...");
                    using FileStream fs = new(path, FileMode.Open);
                    using TarReader reader = new(fs, leaveOpen: false);

                    while ((entry = await reader.GetNextEntryAsync()) != null)
                    {
                        var ms = new MemoryStream();

                        Assert.NotEmpty(entry.Name);
                        Assert.True(Enum.IsDefined(entry.EntryType));
                        Assert.True(Enum.IsDefined(entry.Format));

                        if (entry.EntryType == TarEntryType.Directory)
                            continue;

                        var ds = entry.DataStream;
                        if (ds != null && ds.Length > 0)
                        {
                            ds.CopyTo(ms);
                        }
                    }
                }
                catch (Exception ex) //when (!(ex is FormatException))
                {
                    Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
                }
            }));
        }

        await Task.WhenAll(tasks);
    }
}
 
source Column1 issue gnu tar 7z golang .NET .NET Exception
golang gnu-multi-hdrs.tar duplicate headers reads one reads one w/warning reads one ERROR A metadata entry of type ‘LongPath’ was unexpectedly found after a metadata entry of type ‘LongPath’.
golang gnu-incremental.tar incremental format reads ok reads ok ERROR Unable to read beyond the end of the stream.
golang invalid-go17.tar ?? reads ok reads ok reads ok ERROR Could not find any recognizable digits.
golang hdr-only.tar just header reads with errors reads ok reads ok ERROR Additional non-parsable characters are at the end of the string.
golang nil-uid.tar zero uid reads ok reads w/warnings reads ok ERROR Unable to read beyond the end of the stream.
golang pax-multi-hdrs.tar 2 headers reads ok reads w/warnings reads ok ERROR A metadata entry of type ‘ExtendedAttributes’ was unexpectedly found after a metadata entry of type ‘ExtendedAttributes’.
golang pax-bad-mtime-file.tar bad modified time reads ok reads w/warnings ERROR Unable to read beyond the end of the stream.
golang pax-pos-size-file.tar ? reads ok reads w/warnings reads ok ERROR Unable to read beyond the end of the stream.
golang v7.tar v7 reads ok reads ok reads ok ERROR Could not find any recognizable digits.
golang sparse-formats.tar something about sparseness reads ok reads ok ERROR Additional non-parsable characters are at the end of the string.
golang ustar-file-reg.tar non-zero device numbers. reads ok reads ok ERROR Unable to read beyond the end of the stream.
golang writer-big.tar truncated huge ERROR reads ok ERROR Could not find any recognizable digits.
golang pax-path-hdr.tar ? reads empty ERROR reads header ERROR Unable to read beyond the end of the stream.
golang writer-big-long.tar truncated huge ERROR reads w/ unexpected end of data reads ok ERROR Unable to read beyond the end of the stream.
mine huge.tar dd if=/dev/zero bs=1G count=16 > huge.tar reads ERROR Value was either too large or too small for a UInt32
golang issue10968.tar garbled header ERROR ERROR (but OK) Could not find any recognizable digits.
golang issue11169.tar ?? ERROR ERROR (but OK) Additional non-parsable characters are at the end of the string.
golang neg-size.tar negative size ERROR refuses ERROR ERROR (but OK) Could not find any recognizable digits.
golang pax-bad-hdr-file.tar bad header reads with errors reads ok ERROR ERROR (but OK) Unable to read beyond the end of the stream.
node long-pax.tar 120 byte filename (pax limit 100) reads headers reads w/ unexpected end of data ERROR 120-byte-filename-cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc threw Unable to read beyond the end of the stream.
node next-file-has-long.tar link to 170 byte name in GNU ERROR Entry ‘NextFileHasLongPath’ was expected to be in the GNU format, but did not have the expected version data.
node path-missing.tar empty name “Substituting `.’ for empty member name” (but not clear this is useful…) silently uses tar file name ERROR on extraction Cannot create ‘c:\tar’ because a file or directory with the same name already exists (NOTE – we should probably fix to fail earlier, in GetDestinationAndLinkPaths())
node links-strip.tar ?symlink and hardlinks reads ok reads w/ unexpected end of data ERROR Unable to read beyond the end of the stream.
mine empty.tar 0 bytes reads OK reads ok OK
libarchive test_compat_gtar_2.tar huge gid reads OK reads ok ERROR Could not find any recognizable digits.
libarchive test_compat_perl_archive_tar.tar ? reads OK reads ok ERROR Could not find any recognizable digits.
libarchive test_compat_gtar_1.tar 200 byte filenames and symlink? reads OK reads ok ERROR Could not find any recognizable digits.
libarchive test_compat_plexus_archiver_tar.tar reads OK w/tar: A lone zero block at 3 reads w/ There are some data after the end of the payload data ERROR Could not find any recognizable digits.
libarchive test_compat_solaris_tar_acl.tar reads OK w/Unknown file type ‘A’ reads ok OK (no exception, but unexpected TarEntryType 65 = ‘A’ … A custom extension)
libarchive test_compat_tar_hardlink_1.tar reads OK reads w/ unexpected end of data ERROR Could not find any recognizable digits.
libarchive test_read_format_gtar_sparse_1_17_posix00.tar reads OK reads ok ERROR The entry ‘./PaxHeaders.38659/sparse’ has a duplicate extended attribute.
libarchive test_read_format_tar_invalid_pax_size.tar ERRORS ERROR ERROR Could not find any recognizable digits.

Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 60 (56 by maintainers)

Most upvoted comments

Would we accept a change that, on failure to decompress, included the compression format in the message (by looking at magic numbers presumably). That might help in cases like this. It could just be best effort. Hmm maybe we already decided we didn’t want to, I can’t remember. But IIRC others have had this kind of confusion.

@NickeManarin the *.xz extension means the archive was compressed using the LZMA algorithm: https://www.tutorialspoint.com/using-xz-compression-in-linux#:~:text=The xz compression algorithm works,block independently using LZMA algorithm.

We don’t yet support the LZMA algorithm in System.IO.Compression but we do have an issue tracking the request to eventually add it: it:https://github.com/dotnet/runtime/issues/1542

A very easy workaround is to import CSharpCompress to read the LZMA part of your archive, and then pass it to the System.Formats.Tar.TarFile stream-based extraction method. I tested it and it works:

using SharpCompress.Compressors.Xz;
using System.Formats.Tar;
using System.IO;

class CSharpTestClass
{
    static void Main()
    {
        string tarXzArchivePath = @"D:\Downloads\gifski-1.11.0.tar.xz";
        string destinationDirectoryPath = @"D:\Downloads\extractedxz";

        if (!Directory.Exists(destinationDirectoryPath))
        {
            Directory.CreateDirectory(destinationDirectoryPath);
        }

        using FileStream file = File.Open(tarXzArchivePath, FileMode.Open);
        using XZStream xzStream = new(file);
        TarFile.ExtractToDirectory(xzStream, destinationDirectoryPath, overwriteFiles: false);
    }
}

Hope that helps!

How confident are we in the opposite direction, that tars produced by TarWriter are consumable by all commonly-used tools? Do we have tests for that direction, e.g. generate various outputs with TarWriter, shell out to tar to unpack, and compare that everything roundtripped as expected?

This delta fixes all issues with node-tar fixtures: https://github.com/dotnet/runtime/compare/main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).

@MichalPetryka it means I didn’t try it.

using File.Exists before opening the file is unreliable due to FS changes from other sources.

Jared’s point is something like: even if you check File.Exists, you still need to handle the possibility of the file not existing a moment later when you try to read it. In this case, if such a thing happened, an exception would be thrown to the caller which is fine. Yes, as matter of style, or maximum efficiency, we could catch FileNotFoundException instead, so long as the message was just as good.

I agree, we could add logic to TarReader to detect a compressed archive by reading the magic numbers. I opened issue https://github.com/dotnet/runtime/issues/89056 to track that request specifically. I would like people looking for the error Unable to parse number get directed there.

This issue can be closed since it was tracking a different problem (missing edge cases that I already addressed).

FWIW, I’ve run into a real-world scenario of a tarball that can’t be read with .NET runtime 7.0.0. I’m attempting to read the tarball for a Fedora container image layer. It fails with this callstack:

We had the following patch in #74358, which was matching BSD & GNU tar(1) as well as libarchive’s behavior: ignore non-octal bytes when reading the attributes

--- a/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
+++ b/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
@@ -221,6 +221,11 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
             buffer = TrimEndingNullsAndSpaces(buffer);
             buffer = TrimLeadingNullsAndSpaces(buffer);
 
+            // skip leading non-octal bytes
+            int offset = 0;
+            for (; offset < buffer.Length && (buffer[offset] < (byte)'0' || buffer[offset] > (byte)'7'); ++offset);
+            buffer = buffer.Slice(offset);
+
             if (buffer.Length == 0)
             {
                 return T.Zero;

it was rejected because it seemed too permissive (going by the standard; which suggests to only ignore 0 and 32 from ASCII table; Fedora image has ACK (ASCII 06) at index 0…)

No more work for 7.0 in my opinion, @danmoseley. Thanks for moving the milestone.

Gah, totally forgot about that PR. 😄 Thanks for the reminder.

But at least we now know that the exceptions for those files are ok. So there’s no rush.

What remains here to close it?

The libarchive tests.

Hmm not ready yet. The latest package version is from last Friday. I’ll post the number when I see it.

Here is another corpus under permissive license. I don’t have time to run the code above on those too, but we should do that after we fix the bugs above. https://github.com/alexcrichton/tar-rs/tree/master/tests/archives