runtime: TarReader throws on various archives that other tools accept
I tried opening:
- each of the tar files used to test Golang’s tar package (here with details about each in the tests here).
- each of the tar files used to test node-tar, found here.
- each of the tar files used to test libarchive, found here. Note I had to uudecode these.
Note all the above have permissive licenses so it may be possible to borrow these tars for our test assets.
I used the test code below to open each, ignored those that opened successfully, and for those that failed compared whether some other tools could open them. The interesting cases are where other tools (particularly GNU tar) can open them, but we cannot. Note: I mostly didn’t extract the entries, just checked they could be listed. In some cases, the tar can be listed, but extraction will fail.
test code I used
// See https://aka.ms/new-console-template for more information
using System.Formats.Tar;
using Xunit;
public static class C
{
public async static Task Main()
{
List<Task> tasks = new();
foreach (string path in Directory.EnumerateFiles(@"C:\git\go\src\archive\tar\testdata", "*.tar"))
{
tasks.Add(Task.Run(async () =>
{
TarEntry? entry = null;
try
{
//Console.WriteLine($"{path} opening...");
using FileStream fs = new(path, FileMode.Open);
using TarReader reader = new(fs, leaveOpen: false);
while ((entry = await reader.GetNextEntryAsync()) != null)
{
var ms = new MemoryStream();
Assert.NotEmpty(entry.Name);
Assert.True(Enum.IsDefined(entry.EntryType));
Assert.True(Enum.IsDefined(entry.Format));
if (entry.EntryType == TarEntryType.Directory)
continue;
var ds = entry.DataStream;
if (ds != null && ds.Length > 0)
{
ds.CopyTo(ms);
}
}
}
catch (Exception ex) //when (!(ex is FormatException))
{
Console.WriteLine($"{path} opening {entry?.Name} threw {ex.Message}");
}
}));
}
await Task.WhenAll(tasks);
}
}
source | Column1 | issue | gnu tar | 7z | golang | .NET | .NET Exception |
---|---|---|---|---|---|---|---|
golang | gnu-multi-hdrs.tar | duplicate headers | reads one | reads one w/warning | reads one | ERROR | A metadata entry of type ‘LongPath’ was unexpectedly found after a metadata entry of type ‘LongPath’. |
golang | gnu-incremental.tar | incremental format | reads ok | reads ok | ERROR | Unable to read beyond the end of the stream. | |
golang | invalid-go17.tar | ?? | reads ok | reads ok | reads ok | ERROR | Could not find any recognizable digits. |
golang | hdr-only.tar | just header | reads with errors | reads ok | reads ok | ERROR | Additional non-parsable characters are at the end of the string. |
golang | nil-uid.tar | zero uid | reads ok | reads w/warnings | reads ok | ERROR | Unable to read beyond the end of the stream. |
golang | pax-multi-hdrs.tar | 2 headers | reads ok | reads w/warnings | reads ok | ERROR | A metadata entry of type ‘ExtendedAttributes’ was unexpectedly found after a metadata entry of type ‘ExtendedAttributes’. |
golang | pax-bad-mtime-file.tar | bad modified time | reads ok | reads w/warnings | ERROR | Unable to read beyond the end of the stream. | |
golang | pax-pos-size-file.tar | ? | reads ok | reads w/warnings | reads ok | ERROR | Unable to read beyond the end of the stream. |
golang | v7.tar | v7 | reads ok | reads ok | reads ok | ERROR | Could not find any recognizable digits. |
golang | sparse-formats.tar | something about sparseness | reads ok | reads ok | ERROR | Additional non-parsable characters are at the end of the string. | |
golang | ustar-file-reg.tar | non-zero device numbers. | reads ok | reads ok | ERROR | Unable to read beyond the end of the stream. | |
golang | writer-big.tar | truncated huge | ERROR | reads ok | ERROR | Could not find any recognizable digits. | |
golang | pax-path-hdr.tar | ? | reads empty | ERROR | reads header | ERROR | Unable to read beyond the end of the stream. |
golang | writer-big-long.tar | truncated huge | ERROR | reads w/ unexpected end of data | reads ok | ERROR | Unable to read beyond the end of the stream. |
mine | huge.tar | dd if=/dev/zero bs=1G count=16 > huge.tar | reads | ERROR | Value was either too large or too small for a UInt32 | ||
golang | issue10968.tar | garbled header | ERROR | ERROR (but OK) | Could not find any recognizable digits. | ||
golang | issue11169.tar | ?? | ERROR | ERROR (but OK) | Additional non-parsable characters are at the end of the string. | ||
golang | neg-size.tar | negative size | ERROR | refuses | ERROR | ERROR (but OK) | Could not find any recognizable digits. |
golang | pax-bad-hdr-file.tar | bad header | reads with errors | reads ok | ERROR | ERROR (but OK) | Unable to read beyond the end of the stream. |
node | long-pax.tar | 120 byte filename (pax limit 100) | reads headers | reads w/ unexpected end of data | ERROR | 120-byte-filename-cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc threw Unable to read beyond the end of the stream. | |
node | next-file-has-long.tar | link to 170 byte name in GNU | ERROR | Entry ‘NextFileHasLongPath’ was expected to be in the GNU format, but did not have the expected version data. | |||
node | path-missing.tar | empty name | “Substituting `.’ for empty member name” (but not clear this is useful…) | silently uses tar file name | ERROR on extraction | Cannot create ‘c:\tar’ because a file or directory with the same name already exists (NOTE – we should probably fix to fail earlier, in GetDestinationAndLinkPaths()) | |
node | links-strip.tar | ?symlink and hardlinks | reads ok | reads w/ unexpected end of data | ERROR | Unable to read beyond the end of the stream. | |
mine | empty.tar | 0 bytes | reads OK | reads ok | OK | ||
libarchive | test_compat_gtar_2.tar | huge gid | reads OK | reads ok | ERROR | Could not find any recognizable digits. | |
libarchive | test_compat_perl_archive_tar.tar | ? | reads OK | reads ok | ERROR | Could not find any recognizable digits. | |
libarchive | test_compat_gtar_1.tar | 200 byte filenames and symlink? | reads OK | reads ok | ERROR | Could not find any recognizable digits. | |
libarchive | test_compat_plexus_archiver_tar.tar | reads OK w/tar: A lone zero block at 3 | reads w/ There are some data after the end of the payload data | ERROR | Could not find any recognizable digits. | ||
libarchive | test_compat_solaris_tar_acl.tar | reads OK w/Unknown file type ‘A’ | reads ok | OK | (no exception, but unexpected TarEntryType 65 = ‘A’ … A custom extension) | ||
libarchive | test_compat_tar_hardlink_1.tar | reads OK | reads w/ unexpected end of data | ERROR | Could not find any recognizable digits. | ||
libarchive | test_read_format_gtar_sparse_1_17_posix00.tar | reads OK | reads ok | ERROR | The entry ‘./PaxHeaders.38659/sparse’ has a duplicate extended attribute. | ||
libarchive | test_read_format_tar_invalid_pax_size.tar | ERRORS | ERROR | ERROR | Could not find any recognizable digits. |
Possibly some of these are expected limitations, but for the others we should add checkboxes and work through and fix them.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 60 (56 by maintainers)
Would we accept a change that, on failure to decompress, included the compression format in the message (by looking at magic numbers presumably). That might help in cases like this. It could just be best effort. Hmm maybe we already decided we didn’t want to, I can’t remember. But IIRC others have had this kind of confusion.
@NickeManarin the
*.xz
extension means the archive was compressed using the LZMA algorithm: https://www.tutorialspoint.com/using-xz-compression-in-linux#:~:text=The xz compression algorithm works,block independently using LZMA algorithm.We don’t yet support the LZMA algorithm in
System.IO.Compression
but we do have an issue tracking the request to eventually add it: it:https://github.com/dotnet/runtime/issues/1542A very easy workaround is to import
CSharpCompress
to read the LZMA part of your archive, and then pass it to theSystem.Formats.Tar.TarFile
stream-based extraction method. I tested it and it works:Hope that helps!
How confident are we in the opposite direction, that tars produced by TarWriter are consumable by all commonly-used tools? Do we have tests for that direction, e.g. generate various outputs with TarWriter, shell out to
tar
to unpack, and compare that everything roundtripped as expected?This delta fixes all issues with node-tar fixtures: https://github.com/dotnet/runtime/compare/main...am11:runtime:feature/system.formats.tar/hardlinks-support (working on tests).
@MichalPetryka it means I didn’t try it.
Jared’s point is something like: even if you check File.Exists, you still need to handle the possibility of the file not existing a moment later when you try to read it. In this case, if such a thing happened, an exception would be thrown to the caller which is fine. Yes, as matter of style, or maximum efficiency, we could catch FileNotFoundException instead, so long as the message was just as good.
I agree, we could add logic to TarReader to detect a compressed archive by reading the magic numbers. I opened issue https://github.com/dotnet/runtime/issues/89056 to track that request specifically. I would like people looking for the error
Unable to parse number
get directed there.This issue can be closed since it was tracking a different problem (missing edge cases that I already addressed).
We had the following patch in #74358, which was matching BSD & GNU tar(1) as well as libarchive’s behavior:
ignore non-octal bytes when reading the attributes
it was rejected because it seemed too permissive (going by the standard; which suggests to only ignore 0 and 32 from ASCII table; Fedora image has ACK (ASCII 06) at index 0…)
No more work for 7.0 in my opinion, @danmoseley. Thanks for moving the milestone.
Gah, totally forgot about that PR. 😄 Thanks for the reminder.
But at least we now know that the exceptions for those files are ok. So there’s no rush.
The libarchive tests.
Hmm not ready yet. The latest package version is from last Friday. I’ll post the number when I see it.
Here is another corpus under permissive license. I don’t have time to run the code above on those too, but we should do that after we fix the bugs above. https://github.com/alexcrichton/tar-rs/tree/master/tests/archives