runtime: TarReader throws on archive that other tools accept
Description
I have a *.tar.gz file that I am trying to process in C# by passing the GZipStream to TarReader.
This fails at the line
TarHelpers.ParseOctal<ulong>(buffer.Slice(124, 12))
The
if (num >= 8)
check inside ParseOctal is getting to the ThrowInvalidNumber part as the num being tested is 80.
I have seen a similar issue where non conformance to the spec is blamed for this type of thing.
But Azure Data Factory can process this *.tar.gz fine and additionally Windows Explorer seems able to understand the contents of the *.tar.gz file - showing it as a compressed archive - clicking through shows the details of the single file contained within.
The complete 512 bytes of the header is
6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
So the problematic part is
8000000000000011E7310C9D
The underlying file size is 76893195421 bytes.
Reproduction Steps
Not practical for me to share the actual file but it consists of the 512 byte header already given.
Followed by 76893195421 bytes of arbitrary data that is the actual file contents
Followed by a new line and 8547 null bytes.
You can use
var header = Convert.FromHexString(
"6E625F726D73655F727061735F696E76656E746F72792E64617400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003030303036363600303135323036310030313532303631008000000000000011E7310C9D3134353134303736373333003031363332350020300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000075737461722020006F7261636C6500000000000000000000000000000000000000000000000000006F696E7374616C6C000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000");
System.IO.File.WriteAllBytes(@"C:\somefile.tar", header
);
And then open the archive in 7zip and see that this utility can cope with it as one example
Expected behavior
These types of file have been generated by a system in my company for years.
Every utility used to read them has apparently had no issues with them except for the new .NET framework classes.
I have no idea whether or not this file format does in fact violate any spec but I would expect to be able to open it as other utilities can cope (maybe by setting a mode parameter to ignore such issues as long as they aren’t fatal to the extraction)
Actual behavior
System.IO.InvalidDataException: ‘Unable to parse number.’ exception thrown on initial tarReader.GetNextEntry() call
Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 15 (11 by maintainers)
Ah here’s another doc mentioning the high bit.
https://manpages.debian.org/testing/libarchive-dev/tar.5.en.html
The magic number in this data is "ustar " (with 2 spaces) which is apparently OLDGNU_MAGIC. 7z reports
Characteristics: 0 GNU ASCII bin_psize bin_size
The size (12 bytes at offset 124) is 8000000000000011E7310C9D and it is failing on the first byte.
The code expects this to be octal coded as chars (ie., 0=0x30 through 7=0x37) which it isn’t.
Instead the high bit of the first byte is set, then the remaining is base 256. 0x11 0xE7 0x31 0x0C 0x9D is 76 893 195 421 which is what 7z reports.
As to why the high bit is set, I cannot find mention in the GNU tar spec, but Wikipedia says “2001 star introduced a base-256 coding that is indicated by setting the high-order bit of the leftmost byte of a numeric field.[citation needed] GNU-tar and BSD-tar followed this idea”
I think the tar reader just can’t handle the format used for archives larger than 8GB?
This isn’t my area, deferring to @carlossanlop