libexpat: Support for XML 1.0r5/1.1 (was: XML Parser does not respect valid XML Start and Name chars from XML specification 1.0r5 and after)

I found that the XML parser characters detected as valid start and name characters within XML documents does not seem to be correct, so I put together a test where each character from 0x0001 and through 0xFFFD (converted to UTF-8) is inserted into the following documents:

<?xml version="1.1" encoding="utf-8"?>
<data>
	<CHARACTERa>
		CHARACTERa
	<CHARACTERa>
</data>

and

<?xml version="1.1" encoding="utf-8"?>
<data>
	<aCHARACTER>
		aCHARACTER
	</aCHARACTER>
</data>

According the XML spec here: https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-common-syn The valid NameStartChars should match these ranges: [#0x3A - #0x3A] [#0x41 - #0x5A] [#0x5F - #0x5F] [#0x61 - #0x7A] [#0xC0 - #0xD6] [#0xD8 - #0xF6] [#0xF8 - #0x2FF] [#0x370 - #0x37D] [#0x37F - #0x1FFF] [#0x200C - #0x200D] [#0x2070 - #0x218F] [#0x2C00 - #0x2FEF] [#0x3001 - #0xD7FF] [#0xF900 - #0xFDCF] [#0xFDF0 - #0xFFFD]

while the valid NameChars should match: [#0x2D - #0x2E] [#0x30 - #0x3A] [#0x3E - #0x3E] [#0x41 - #0x5A] [#0x5F - #0x5F] [#0x61 - #0x7A] [#0xB7 - #0xB7] [#0xC0 - #0xD6] [#0xD8 - #0xF6] [#0xF8 - #0x37D] [#0x37F - #0x1FFF] [#0x200C - #0x200D] [#0x203F - #0x2040] [#0x2070 - #0x218F] [#0x2C00 - #0x2FEF] [#0x3001 - #0xD7FF] [#0xF900 - #0xFDCF] [#0xFDF0 - #0xFFFD]

However, what I found from expat’s output was lots and lots of missing valid characters: For instance, running the test with libexpat, the NameStart chars looked as follows (Once you get past 0x00FF, things get very messy): [#0x3A - #0x3A] [#0x41 - #0x5A] [#0x5F - #0x5F] [#0x61 - #0x7A] [#0xC0 - #0xD6] [#0xD8 - #0xF6] [#0xF8 - #0x131] [#0x134 - #0x13E] [#0x141 - #0x148] [#0x14A - #0x17E] [#0x180 - #0x1C3] [#0x1CD - #0x1F0] [#0x1F4 - #0x1F5] [#0x1FA - #0x217] [#0x250 - #0x2A8] [#0x2BB - #0x2C1] [#0x386 - #0x386] [#0x388 - #0x38A] [#0x38C - #0x38C] [#0x38E - #0x3A1] [#0x3A3 - #0x3CE] [#0x3D0 - #0x3D6] [#0x3DA - #0x3DA] [#0x3DC - #0x3DC] [#0x3DE - #0x3DE] [#0x3E0 - #0x3E0] [#0x3E2 - #0x3F3] [#0x401 - #0x40C] [#0x40E - #0x44F] [#0x451 - #0x45C] [#0x45E - #0x481] [#0x490 - #0x4C4] [#0x4C7 - #0x4C8] [#0x4CB - #0x4CC] [#0x4D0 - #0x4EB] [#0x4EE - #0x4F5] [#0x4F8 - #0x4F9] [#0x531 - #0x556] [#0x559 - #0x559] [#0x561 - #0x586] [#0x5D0 - #0x5EA] [#0x5F0 - #0x5F2] [#0x621 - #0x63A] [#0x641 - #0x64A] [#0x671 - #0x6B7] [#0x6BA - #0x6BE] [#0x6C0 - #0x6CE] [#0x6D0 - #0x6D3] [#0x6D5 - #0x6D5] [#0x6E5 - #0x6E6] [#0x905 - #0x939] [#0x93D - #0x93D] [#0x958 - #0x961] [#0x985 - #0x98C] [#0x98F - #0x990] [#0x993 - #0x9A8] [#0x9AA - #0x9B0] [#0x9B2 - #0x9B2] [#0x9B6 - #0x9B9] [#0x9DC - #0x9DD] [#0x9DF - #0x9E1] [#0x9F0 - #0x9F1] [#0xA05 - #0xA0A] [#0xA0F - #0xA10] [#0xA13 - #0xA28] [#0xA2A - #0xA30] [#0xA32 - #0xA33] [#0xA35 - #0xA36] [#0xA38 - #0xA39] [#0xA59 - #0xA5C] [#0xA5E - #0xA5E] [#0xA72 - #0xA74] [#0xA85 - #0xA8B] [#0xA8D - #0xA8D] [#0xA8F - #0xA91] [#0xA93 - #0xAA8] [#0xAAA - #0xAB0] [#0xAB2 - #0xAB3] [#0xAB5 - #0xAB9] [#0xABD - #0xABD] [#0xAE0 - #0xAE0] [#0xB05 - #0xB0C] [#0xB0F - #0xB10] [#0xB13 - #0xB28] [#0xB2A - #0xB30] [#0xB32 - #0xB33] [#0xB36 - #0xB39] [#0xB3D - #0xB3D] [#0xB5C - #0xB5D] [#0xB5F - #0xB61] [#0xB85 - #0xB8A] [#0xB8E - #0xB90] [#0xB92 - #0xB95] [#0xB99 - #0xB9A] [#0xB9C - #0xB9C] [#0xB9E - #0xB9F] [#0xBA3 - #0xBA4] [#0xBA8 - #0xBAA] [#0xBAE - #0xBB5] [#0xBB7 - #0xBB9] [#0xC05 - #0xC0C] [#0xC0E - #0xC10] [#0xC12 - #0xC28] [#0xC2A - #0xC33] [#0xC35 - #0xC39] [#0xC60 - #0xC61] [#0xC85 - #0xC8C] [#0xC8E - #0xC90] [#0xC92 - #0xCA8] [#0xCAA - #0xCB3] [#0xCB5 - #0xCB9] [#0xCDE - #0xCDE] [#0xCE0 - #0xCE1] [#0xD05 - #0xD0C] [#0xD0E - #0xD10] [#0xD12 - #0xD28] [#0xD2A - #0xD39] [#0xD60 - #0xD61] [#0xE01 - #0xE2E] [#0xE30 - #0xE30] [#0xE32 - #0xE33] [#0xE40 - #0xE45] [#0xE81 - #0xE82] [#0xE84 - #0xE84] [#0xE87 - #0xE88] [#0xE8A - #0xE8A] [#0xE8D - #0xE8D] [#0xE94 - #0xE97] [#0xE99 - #0xE9F] [#0xEA1 - #0xEA3] [#0xEA5 - #0xEA5] [#0xEA7 - #0xEA7] [#0xEAA - #0xEAB] [#0xEAD - #0xEAE] [#0xEB0 - #0xEB0] [#0xEB2 - #0xEB3] [#0xEBD - #0xEBD] [#0xEC0 - #0xEC4] [#0xF40 - #0xF47] [#0xF49 - #0xF69] [#0x10A0 - #0x10C5] [#0x10D0 - #0x10F6] [#0x1100 - #0x1100] [#0x1102 - #0x1103] [#0x1105 - #0x1107] [#0x1109 - #0x1109] [#0x110B - #0x110C] [#0x110E - #0x1112] [#0x113C - #0x113C] [#0x113E - #0x113E] [#0x1140 - #0x1140] [#0x114C - #0x114C] [#0x114E - #0x114E] [#0x1150 - #0x1150] [#0x1154 - #0x1155] [#0x1159 - #0x1159] [#0x115F - #0x1161] [#0x1163 - #0x1163] [#0x1165 - #0x1165] [#0x1167 - #0x1167] [#0x1169 - #0x1169] [#0x116D - #0x116E] [#0x1172 - #0x1173] [#0x1175 - #0x1175] [#0x119E - #0x119E] [#0x11A8 - #0x11A8] [#0x11AB - #0x11AB] [#0x11AE - #0x11AF] [#0x11B7 - #0x11B8] [#0x11BA - #0x11BA] [#0x11BC - #0x11C2] [#0x11EB - #0x11EB] [#0x11F0 - #0x11F0] [#0x11F9 - #0x11F9] [#0x1E00 - #0x1E9B] [#0x1EA0 - #0x1EF9] [#0x1F00 - #0x1F15] [#0x1F18 - #0x1F1D] [#0x1F20 - #0x1F45] [#0x1F48 - #0x1F4D] [#0x1F50 - #0x1F57] [#0x1F59 - #0x1F59] [#0x1F5B - #0x1F5B] [#0x1F5D - #0x1F5D] [#0x1F5F - #0x1F7D] [#0x1F80 - #0x1FB4] [#0x1FB6 - #0x1FBC] [#0x1FBE - #0x1FBE] [#0x1FC2 - #0x1FC4] [#0x1FC6 - #0x1FCC] [#0x1FD0 - #0x1FD3] [#0x1FD6 - #0x1FDB] [#0x1FE0 - #0x1FEC] [#0x1FF2 - #0x1FF4] [#0x1FF6 - #0x1FFC] [#0x2126 - #0x2126] [#0x212A - #0x212B] [#0x212E - #0x212E] [#0x2180 - #0x2182] [#0x3007 - #0x3007] [#0x3021 - #0x3029] [#0x3041 - #0x3094] [#0x30A1 - #0x30FA] [#0x3105 - #0x312C] [#0x4E00 - #0x9FA5] [#0xAC00 - #0xD7A3]

The NameChars output is similarly messy.

When I took a look in Expat’s code, the problems stemmed from lib/nametab.h which has a lookup table for valid characters that is used within lib/xmltok.c via the various _GET_NAMING() macros. I tried to modify the existing mapping table, but was not able to figure out a way to do so, so I took a somewhat simpler approach with full sized mapping tables for namestart and name characters.

I changed the nametab.h tables to be a straight out one way bit mapping of this form: nametab_h.txt

static const unsigned int nameStartBitmap[] = {
	0x00000000, 0x04000000, 0x87FFFFFE, 0x07FFFFFE, //0x00
	0x00000000, 0x00000000, 0xFF7FFFFF, 0xFF7FFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x01
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x02
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0x00000000, 0x00000000, 0x00000000, 0xBFFF0000, //0x03
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
//etc...
};

static const unsigned int nameCharBitmap[] = {
	0x00000000, 0x07FF6000, 0x87FFFFFE, 0x07FFFFFE, //0x00
	0x00000000, 0x00800000, 0xFF7FFFFF, 0xFF7FFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x01
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x02
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xBFFFFFFF, //0x03
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
//etc... up to 0xFF
};

Each section corresponds to 32 code points within the Unicode double byte range. Within the 1.1 XML spec, any 4 byte UTF8 character is allowed, so in the macros, I immediately return 1 for 4 byte chars and updated the isNever for 4 bytes to isAlways.

where each bit corresponds to a single Unicode codepoint referred to within the w3 spec.

I likewise changed the xmltok.c UTF8_GET_NAMING*() macros as follows: xmltok_c.txt

//Basically, the UTF8 to UCS conversion is applied and mapped into the bit maps above. 
#define UTF8_GET_NAMING2(pages, byte) \
		(pages[(((((byte)[0]) & 0x1C) >> 2) << 3) \
			+ ((((((byte)[0]) & 0x03) << 6) \
			+ (((byte)[1]) & 0x3F)) >> 5)] \
		 & (1u << ((((((byte)[0]) & 0x03) << 6) \
			+ (((byte)[1]) & 0x3F)) & 0x1F)))

#define UTF8_GET_NAMING3(pages, byte) \
		(pages[((((((byte)[0]) & 0x0F) << 4) \
				+ ((((byte)[1]) & 0x3C) >> 2)) << 3) \
			+ ((((((byte)[1]) & 0x03) << 6) \
				+ (((byte)[2]) & 0x3F)) >> 5)] \
		& (1u << ((((((byte)[1]) & 0x03) << 6) \
			+ (((byte)[2]) & 0x3F)) & 0x1F)))

#define UTF8_GET_NAMING4(pages, byte) \
	return 1;

#define UTF8_GET_NAMING(pages, p, n) \
  ((n) == 2 \
  ? UTF8_GET_NAMING2(pages, (const unsigned char *)(p)) \
  : ((n) == 3 \
     ? UTF8_GET_NAMING3(pages, (const unsigned char *)(p)) \
     : ((n) == 4 \\
	     ? UTF8_GET_NAMING4(pages, (const unsigned char *)(p)) \
			0))

Then updated the usages of the macros to either reference nameStartBitmap or nameCharBitmap. When I reran the test through our product, the output matched the specification.

About this issue

Original URL
State: open
Created 7 years ago
Comments: 16 (8 by maintainers)

Most upvoted comments

Since there was activity on this and I’ve been poor at providing a fix for this, I created a pull request https://github.com/libexpat/libexpat/pull/711 that contains the patches that I was applying when building libexpat to sort this out and updated the runtests accordingly. There might be documentation that will need to be updated if this is applied.

mazer1310 on May 9, 2023

@hartwork thank you for the data.

Could you elaborate how XML 1.0r5+ support is relevant in 2022?

Main (and the only) reason is this issue: iBotPeaches/Apktool#1407. I’ve got a Fire TV Stick behind SSL proxy and I need to tweak a few apps (eg, Apple TV and Netflix) to make them respect user-installed CAs, so they can work behind proxy.

According to this article: https://android-developers.googleblog.com/2016/07/changes-to-trusted-certificate.html, I just need to add:

<certificates src="user" />

to their network config.

dimitry-ishenko on Feb 2, 2022

Thanks @timbray!

So it would need to be XML 1.0 fifth edition plus XML 1.0 namespaces third edition at the same time.

Questions remaining to be answered:

How to deal with XML delared as non-1.0? (Three options as @timbray mentioned above)
Who puts in the time to iterate proper code review for this?
Who puts in the time to iterate a correct and well-tested implementation? (May need conflict-resolving rebases, since some form of #674 is scheduled to be merged first.)

hartwork on May 23, 2023

I think this is the right place to reply. Here’s my use case: I’m building a general purpose XML parser and DOM on top of expat for node. I chose expat because I’d used it successfully before, it has streaming properties I like, and because it compiled cleanly to WASM with only the frustration that anyone feels using emscripten. Plain C interfaces using pointers to UTF8 for the win.

I would like to be able to test my work for spec compliance with the W3C’s XML Conformance Test Suite. Ideally, it would eventually support XML 1.1 and Namespaces 1.1. I see XML 1.0r5 support as a stepping stone toward that.

XML still matters to me, even though I wouldn’t recommend it to people who are starting new projects.
Spec compliance matters to me. The time to argue over whether the spec is right was years ago, and I wasn’t in the room at the time.
More full-featured Unicode support is important to me, even if I would urge protocol designers to use ASCII-7 element and attribute names whenever possible. I don’t want to dictate to those designers that they have to agree with me.
There exist interesting documents in the wild that can’t be parsed without this work. @dimitry-ishenko gave an example, and it sounds like @mazer1310 has run into those documents as well.
I do care about performance, stability, and backward-compatibility. I agree that might raise the bar for how much testing and proof is required for a non-trivial change.
If you want to avoid ANY non-trivial changes, that’s a stance I can understand and respect. I wish there had been something in the README that let me know.

hildjj on May 9, 2023