runtime: TarReader throws on archive that other tools accept

Description

I have a *.tar.gz file that I am trying to process in C# by passing the GZipStream to TarReader.

This fails at the line

TarHelpers.ParseOctal<ulong>(buffer.Slice(124, 12))

The

if (num >= 8)

check inside ParseOctal is getting to the ThrowInvalidNumber part as the num being tested is 80.

I have seen a similar issue where non conformance to the spec is blamed for this type of thing.

But Azure Data Factory can process this *.tar.gz fine and additionally Windows Explorer seems able to understand the contents of the *.tar.gz file - showing it as a compressed archive - clicking through shows the details of the single file contained within.

The complete 512 bytes of the header is

6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

So the problematic part is

8000000000000011E7310C9D

The underlying file size is 76893195421 bytes.

Reproduction Steps

Not practical for me to share the actual file but it consists of the 512 byte header already given.

Followed by 76893195421 bytes of arbitrary data that is the actual file contents

Followed by a new line and 8547 null bytes.

You can use

var header = Convert.FromHexString(
    "6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000");


System.IO.File.WriteAllBytes(@"C:\somefile.tar", header
    );

And then open the archive in 7zip and see that this utility can cope with it as one example

Expected behavior

These types of file have been generated by a system in my company for years.

Every utility used to read them has apparently had no issues with them except for the new .NET framework classes.

I have no idea whether or not this file format does in fact violate any spec but I would expect to be able to open it as other utilities can cope (maybe by setting a mode parameter to ignore such issues as long as they aren’t fatal to the extraction)

Actual behavior

System.IO.InvalidDataException: ‘Unable to parse number.’ exception thrown on initial tarReader.GetNextEntry() call

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response

About this issue

Original URL
State: open
Created 8 months ago
Comments: 15 (11 by maintainers)

Most upvoted comments

Ah here’s another doc mentioning the high bit.

https://manpages.debian.org/testing/libarchive-dev/tar.5.en.html

Another extension, utilized by GNU tar, star, and other newer tar implementations, permits binary numbers in the standard numeric fields. This is flagged by setting the high bit of the first byte. The remainder of the field is treated as a signed twos-complement value.

danmoseley on Oct 20, 2023

 	System.Formats.Tar.dll!System.Formats.Tar.TarHelpers.ThrowInvalidNumber() Line 199	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHelpers.ParseOctal<ulong>(System.ReadOnlySpan<byte> buffer) Line 191	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHeader.TryReadCommonAttributes(System.Span<byte> buffer, System.Formats.Tar.TarEntryFormat initialFormat) Line 390	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHeader.TryReadAttributes(System.Formats.Tar.TarEntryFormat initialFormat, System.Span<byte> buffer) Line 170	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarHeader.TryGetNextHeader(System.IO.Stream archiveStream, bool copyData, System.Formats.Tar.TarEntryFormat initialFormat, bool processDataBlock) Line 145	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarReader.TryGetNextEntryHeader(bool copyData) Line 212	C#
 	System.Formats.Tar.dll!System.Formats.Tar.TarReader.GetNextEntry(bool copyData) Line 92	C#

The magic number in this data is "ustar " (with 2 spaces) which is apparently OLDGNU_MAGIC. 7z reports Characteristics: 0 GNU ASCII bin_psize bin_size

The size (12 bytes at offset 124) is 8000000000000011E7310C9D and it is failing on the first byte.

The code expects this to be octal coded as chars (ie., 0=0x30 through 7=0x37) which it isn’t.

Instead the high bit of the first byte is set, then the remaining is base 256. 0x11 0xE7 0x31 0x0C 0x9D is 76 893 195 421 which is what 7z reports.

As to why the high bit is set, I cannot find mention in the GNU tar spec, but Wikipedia says “2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.[citation needed] GNU-tar and BSD-tar followed this idea”

I think the tar reader just can’t handle the format used for archives larger than 8GB?

This isn’t my area, deferring to @carlossanlop

danmoseley on Oct 20, 2023