runtime: TarReader throws on various archives that other tools accept

I tried opening:

each of the tar files used to test Golang’s tar package (here with details about each in the tests here).
each of the tar files used to test node-tar, found here.
each of the tar files used to test libarchive, found here. Note I had to uudecode these.

Note all the above have permissive licenses so it may be possible to borrow these tars for our test assets.

I used the test code below to open each, ignored those that opened successfully, and for those that failed compared whether some other tools could open them. The interesting cases are where other tools (particularly GNU tar) can open them, but we cannot. Note: I mostly didn’t extract the entries, just checked they could be listed. In some cases, the tar can be listed, but extraction will fail.

test code I used

// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;

public static class C
{

    public async static Task Main()
    {
        List<Task> tasks = new();
        foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
        {
            tasks.Add(Task.Run(async () =>
            {
                TarEntry? entry = null;

                try
                {
                    //Console.WriteLine($"{path} opening...");
                    using FileStream fs = new(path, FileMode.Open);
                    using TarReader reader = new(fs, leaveOpen: false);

                    while ((entry = await reader.GetNextEntryAsync()) != null)
                    {
                        var ms = new MemoryStream();

                        Assert.NotEmpty(entry.Name);
                        Assert.True(Enum.IsDefined(entry.EntryType));
                        Assert.True(Enum.IsDefined(entry.Format));

                        if (entry.EntryType == TarEntryType.Directory)
                            continue;

                        var ds = entry.DataStream;
                        if (ds != null && ds.Length > 0)
                        {
                            ds.CopyTo(ms);
                        }
                    }
                }
                catch (Exception ex) //when (!(ex is FormatException))
                {
                    Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
                }
            }));
        }

        await Task.WhenAll(tasks);
    }
}

source	Column1	issue	gnu tar	7z	golang	.NET	.NET Exception
golang	gnu-multi-hdrs.tar	duplicate headers	reads one	reads one w/warning	reads one	ERROR	A metadata entry of type ‘LongPath’ was unexpectedly found after a metadata entry of type ‘LongPath’.
golang	gnu-incremental.tar	incremental format	reads ok		reads ok	ERROR	Unable to read beyond the end of the stream.
golang	invalid-go17.tar	??	reads ok	reads ok	reads ok	ERROR	Could not find any recognizable digits.
golang	hdr-only.tar	just header	reads with errors	reads ok	reads ok	ERROR	Additional non-parsable characters are at the end of the string.
golang	nil-uid.tar	zero uid	reads ok	reads w/warnings	reads ok	ERROR	Unable to read beyond the end of the stream.
golang	pax-multi-hdrs.tar	2 headers	reads ok	reads w/warnings	reads ok	ERROR	A metadata entry of type ‘ExtendedAttributes’ was unexpectedly found after a metadata entry of type ‘ExtendedAttributes’.
golang	pax-bad-mtime-file.tar	bad modified time	reads ok	reads w/warnings		ERROR	Unable to read beyond the end of the stream.
golang	pax-pos-size-file.tar	?	reads ok	reads w/warnings	reads ok	ERROR	Unable to read beyond the end of the stream.
golang	v7.tar	v7	reads ok	reads ok	reads ok	ERROR	Could not find any recognizable digits.
golang	sparse-formats.tar	something about sparseness	reads ok		reads ok	ERROR	Additional non-parsable characters are at the end of the string.
golang	ustar-file-reg.tar	non-zero device numbers.	reads ok		reads ok	ERROR	Unable to read beyond the end of the stream.
golang	writer-big.tar	truncated huge	ERROR		reads ok	ERROR	Could not find any recognizable digits.
golang	pax-path-hdr.tar	?	reads empty	ERROR	reads header	ERROR	Unable to read beyond the end of the stream.
golang	writer-big-long.tar	truncated huge	ERROR	reads w/ unexpected end of data	reads ok	ERROR	Unable to read beyond the end of the stream.
mine	huge.tar	dd if=/dev/zero bs=1G count=16 > huge.tar		reads		ERROR	Value was either too large or too small for a UInt32
golang	issue10968.tar	garbled header			ERROR	ERROR (but OK)	Could not find any recognizable digits.
golang	issue11169.tar	??			ERROR	ERROR (but OK)	Additional non-parsable characters are at the end of the string.
golang	neg-size.tar	negative size	ERROR	refuses	ERROR	ERROR (but OK)	Could not find any recognizable digits.
golang	pax-bad-hdr-file.tar	bad header	reads with errors	reads ok	ERROR	ERROR (but OK)	Unable to read beyond the end of the stream.
node	long-pax.tar	120 byte filename (pax limit 100)	reads headers	reads w/ unexpected end of data		ERROR	120-byte-filename-cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc threw Unable to read beyond the end of the stream.
node	next-file-has-long.tar	link to 170 byte name in GNU				ERROR	Entry ‘NextFileHasLongPath’ was expected to be in the GNU format, but did not have the expected version data.
node	path-missing.tar	empty name	“Substituting `.’ for empty member name” (but not clear this is useful…)	silently uses tar file name		ERROR on extraction	Cannot create ‘c:\tar’ because a file or directory with the same name already exists (NOTE – we should probably fix to fail earlier, in GetDestinationAndLinkPaths())
node	links-strip.tar	?symlink and hardlinks	reads ok	reads w/ unexpected end of data		ERROR	Unable to read beyond the end of the stream.
mine	empty.tar	0 bytes	reads OK	reads ok		OK
libarchive	test_compat_gtar_2.tar	huge gid	reads OK	reads ok		ERROR	Could not find any recognizable digits.
libarchive	test_compat_perl_archive_tar.tar	?	reads OK	reads ok		ERROR	Could not find any recognizable digits.
libarchive	test_compat_gtar_1.tar	200 byte filenames and symlink?	reads OK	reads ok		ERROR	Could not find any recognizable digits.
libarchive	test_compat_plexus_archiver_tar.tar		reads OK w/tar: A lone zero block at 3	reads w/ There are some data after the end of the payload data		ERROR	Could not find any recognizable digits.
libarchive	test_compat_solaris_tar_acl.tar		reads OK w/Unknown file type ‘A’	reads ok		OK	(no exception, but unexpected TarEntryType 65 = ‘A’ … A custom extension)
libarchive	test_compat_tar_hardlink_1.tar		reads OK	reads w/ unexpected end of data		ERROR	Could not find any recognizable digits.
libarchive	test_read_format_gtar_sparse_1_17_posix00.tar		reads OK	reads ok		ERROR	The entry ‘./PaxHeaders.38659/sparse’ has a duplicate extended attribute.
libarchive	test_read_format_tar_invalid_pax_size.tar		ERRORS	ERROR		ERROR	Could not find any recognizable digits.

Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 3
Comments: 60 (56 by maintainers)

Most upvoted comments

Would we accept a change that, on failure to decompress, included the compression format in the message (by looking at magic numbers presumably). That might help in cases like this. It could just be best effort. Hmm maybe we already decided we didn’t want to, I can’t remember. But IIRC others have had this kind of confusion.

danmoseley on Jul 17, 2023

@NickeManarin the *.xz extension means the archive was compressed using the LZMA algorithm: https://www.tutorialspoint.com/using-xz-compression-in-linux#:~:text=The xz compression algorithm works,block independently using LZMA algorithm.

We don’t yet support the LZMA algorithm in System.IO.Compression but we do have an issue tracking the request to eventually add it: it:https://github.com/dotnet/runtime/issues/1542

A very easy workaround is to import CSharpCompress to read the LZMA part of your archive, and then pass it to the System.Formats.Tar.TarFile stream-based extraction method. I tested it and it works:

using SharpCompress.Compressors.Xz;
using System.Formats.Tar;
using System.IO;

class CSharpTestClass
{
    static void Main()
    {
        string tarXzArchivePath = @"D:\Downloads\gifski-1.11.0.tar.xz";
        string destinationDirectoryPath = @"D:\Downloads\extractedxz";

        if (!Directory.Exists(destinationDirectoryPath))
        {
            Directory.CreateDirectory(destinationDirectoryPath);
        }

        using FileStream file = File.Open(tarXzArchivePath, FileMode.Open);
        using XZStream xzStream = new(file);
        TarFile.ExtractToDirectory(xzStream, destinationDirectoryPath, overwriteFiles: false);
    }
}

Hope that helps!

carlossanlop on Jul 16, 2023

How confident are we in the opposite direction, that tars produced by TarWriter are consumable by all commonly-used tools? Do we have tests for that direction, e.g. generate various outputs with TarWriter, shell out to tar to unpack, and compare that everything roundtripped as expected?

stephentoub on Aug 27, 2022

This delta fixes all issues with node-tar fixtures: https://github.com/dotnet/runtime/compare/main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).

am11 on Aug 22, 2022

@MichalPetryka it means I didn’t try it.

using File.Exists before opening the file is unreliable due to FS changes from other sources.

Jared’s point is something like: even if you check File.Exists, you still need to handle the possibility of the file not existing a moment later when you try to read it. In this case, if such a thing happened, an exception would be thrown to the caller which is fine. Yes, as matter of style, or maximum efficiency, we could catch FileNotFoundException instead, so long as the message was just as good.

danmoseley on Aug 22, 2022

I agree, we could add logic to TarReader to detect a compressed archive by reading the magic numbers. I opened issue https://github.com/dotnet/runtime/issues/89056 to track that request specifically. I would like people looking for the error Unable to parse number get directed there.

This issue can be closed since it was tracking a different problem (missing edge cases that I already addressed).

carlossanlop on Jul 17, 2023

FWIW, I’ve run into a real-world scenario of a tarball that can’t be read with .NET runtime 7.0.0. I’m attempting to read the tarball for a Fedora container image layer. It fails with this callstack:

We had the following patch in #74358, which was matching BSD & GNU tar(1) as well as libarchive’s behavior: ignore non-octal bytes when reading the attributes

--- a/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
+++ b/src/libraries/System.Formats.Tar/src/System/Formats/Tar/TarHelpers.cs
@@ -221,6 +221,11 @@ internal static TarEntryType GetCorrectTypeFlagForFormat(TarEntryFormat format,
             buffer = TrimEndingNullsAndSpaces(buffer);
             buffer = TrimLeadingNullsAndSpaces(buffer);
 
+            // skip leading non-octal bytes
+            int offset = 0;
+            for (; offset < buffer.Length && (buffer[offset] < (byte)'0' || buffer[offset] > (byte)'7'); ++offset);
+            buffer = buffer.Slice(offset);
+
             if (buffer.Length == 0)
             {
                 return T.Zero;

it was rejected because it seemed too permissive (going by the standard; which suggests to only ignore 0 and 32 from ASCII table; Fedora image has ACK (ASCII 06) at index 0…)

am11 on Nov 11, 2022

No more work for 7.0 in my opinion, @danmoseley. Thanks for moving the milestone.

carlossanlop on Aug 25, 2022

Gah, totally forgot about that PR. 😄 Thanks for the reminder.

But at least we now know that the exceptions for those files are ok. So there’s no rush.

carlossanlop on Aug 24, 2022

What remains here to close it?

The libarchive tests.

carlossanlop on Aug 24, 2022

Hmm not ready yet. The latest package version is from last Friday. I’ll post the number when I see it.

carlossanlop on Aug 22, 2022

Here is another corpus under permissive license. I don’t have time to run the code above on those too, but we should do that after we fix the bugs above. https://github.com/alexcrichton/tar-rs/tree/master/tests/archives

danmoseley on Aug 22, 2022