libexpat: Support for XML 1.0r5/1.1 (was: XML Parser does not respect valid XML Start and Name chars from XML specification 1.0r5 and after)

I found that the XML parser characters detected as valid start and name characters within XML documents does not seem to be correct, so I put together a test where each character from 0x0001 and through 0xFFFD (converted to UTF-8) is inserted into the following documents:

<?xml version="1.1" encoding="utf-8"?>
<data>
	<CHARACTERa>
		CHARACTERa
	<CHARACTERa>
</data>

and

<?xml version="1.1" encoding="utf-8"?>
<data>
	<aCHARACTER>
		aCHARACTER
	</aCHARACTER>
</data>

According the XML spec here: https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-common-syn The valid NameStartChars should match these ranges: [#0x3A - #0x3A] [#0x41 - #0x5A] [#0x5F - #0x5F] [#0x61 - #0x7A] [#0xC0 - #0xD6] [#0xD8 - #0xF6] [#0xF8 - #0x2FF] [#0x370 - #0x37D] [#0x37F - #0x1FFF] [#0x200C - #0x200D] [#0x2070 - #0x218F] [#0x2C00 - #0x2FEF] [#0x3001 - #0xD7FF] [#0xF900 - #0xFDCF] [#0xFDF0 - #0xFFFD]

while the valid NameChars should match: [#0x2D - #0x2E] [#0x30 - #0x3A] [#0x3E - #0x3E] [#0x41 - #0x5A] [#0x5F - #0x5F] [#0x61 - #0x7A] [#0xB7 - #0xB7] [#0xC0 - #0xD6] [#0xD8 - #0xF6] [#0xF8 - #0x37D] [#0x37F - #0x1FFF] [#0x200C - #0x200D] [#0x203F - #0x2040] [#0x2070 - #0x218F] [#0x2C00 - #0x2FEF] [#0x3001 - #0xD7FF] [#0xF900 - #0xFDCF] [#0xFDF0 - #0xFFFD]

However, what I found from expat’s output was lots and lots of missing valid characters: For instance, running the test with libexpat, the NameStart chars looked as follows (Once you get past 0x00FF, things get very messy): [#0x3A - #0x3A] [#0x41 - #0x5A] [#0x5F - #0x5F] [#0x61 - #0x7A] [#0xC0 - #0xD6] [#0xD8 - #0xF6] [#0xF8 - #0x131] [#0x134 - #0x13E] [#0x141 - #0x148] [#0x14A - #0x17E] [#0x180 - #0x1C3] [#0x1CD - #0x1F0] [#0x1F4 - #0x1F5] [#0x1FA - #0x217] [#0x250 - #0x2A8] [#0x2BB - #0x2C1] [#0x386 - #0x386] [#0x388 - #0x38A] [#0x38C - #0x38C] [#0x38E - #0x3A1] [#0x3A3 - #0x3CE] [#0x3D0 - #0x3D6] [#0x3DA - #0x3DA] [#0x3DC - #0x3DC] [#0x3DE - #0x3DE] [#0x3E0 - #0x3E0] [#0x3E2 - #0x3F3] [#0x401 - #0x40C] [#0x40E - #0x44F] [#0x451 - #0x45C] [#0x45E - #0x481] [#0x490 - #0x4C4] [#0x4C7 - #0x4C8] [#0x4CB - #0x4CC] [#0x4D0 - #0x4EB] [#0x4EE - #0x4F5] [#0x4F8 - #0x4F9] [#0x531 - #0x556] [#0x559 - #0x559] [#0x561 - #0x586] [#0x5D0 - #0x5EA] [#0x5F0 - #0x5F2] [#0x621 - #0x63A] [#0x641 - #0x64A] [#0x671 - #0x6B7] [#0x6BA - #0x6BE] [#0x6C0 - #0x6CE] [#0x6D0 - #0x6D3] [#0x6D5 - #0x6D5] [#0x6E5 - #0x6E6] [#0x905 - #0x939] [#0x93D - #0x93D] [#0x958 - #0x961] [#0x985 - #0x98C] [#0x98F - #0x990] [#0x993 - #0x9A8] [#0x9AA - #0x9B0] [#0x9B2 - #0x9B2] [#0x9B6 - #0x9B9] [#0x9DC - #0x9DD] [#0x9DF - #0x9E1] [#0x9F0 - #0x9F1] [#0xA05 - #0xA0A] [#0xA0F - #0xA10] [#0xA13 - #0xA28] [#0xA2A - #0xA30] [#0xA32 - #0xA33] [#0xA35 - #0xA36] [#0xA38 - #0xA39] [#0xA59 - #0xA5C] [#0xA5E - #0xA5E] [#0xA72 - #0xA74] [#0xA85 - #0xA8B] [#0xA8D - #0xA8D] [#0xA8F - #0xA91] [#0xA93 - #0xAA8] [#0xAAA - #0xAB0] [#0xAB2 - #0xAB3] [#0xAB5 - #0xAB9] [#0xABD - #0xABD] [#0xAE0 - #0xAE0] [#0xB05 - #0xB0C] [#0xB0F - #0xB10] [#0xB13 - #0xB28] [#0xB2A - #0xB30] [#0xB32 - #0xB33] [#0xB36 - #0xB39] [#0xB3D - #0xB3D] [#0xB5C - #0xB5D] [#0xB5F - #0xB61] [#0xB85 - #0xB8A] [#0xB8E - #0xB90] [#0xB92 - #0xB95] [#0xB99 - #0xB9A] [#0xB9C - #0xB9C] [#0xB9E - #0xB9F] [#0xBA3 - #0xBA4] [#0xBA8 - #0xBAA] [#0xBAE - #0xBB5] [#0xBB7 - #0xBB9] [#0xC05 - #0xC0C] [#0xC0E - #0xC10] [#0xC12 - #0xC28] [#0xC2A - #0xC33] [#0xC35 - #0xC39] [#0xC60 - #0xC61] [#0xC85 - #0xC8C] [#0xC8E - #0xC90] [#0xC92 - #0xCA8] [#0xCAA - #0xCB3] [#0xCB5 - #0xCB9] [#0xCDE - #0xCDE] [#0xCE0 - #0xCE1] [#0xD05 - #0xD0C] [#0xD0E - #0xD10] [#0xD12 - #0xD28] [#0xD2A - #0xD39] [#0xD60 - #0xD61] [#0xE01 - #0xE2E] [#0xE30 - #0xE30] [#0xE32 - #0xE33] [#0xE40 - #0xE45] [#0xE81 - #0xE82] [#0xE84 - #0xE84] [#0xE87 - #0xE88] [#0xE8A - #0xE8A] [#0xE8D - #0xE8D] [#0xE94 - #0xE97] [#0xE99 - #0xE9F] [#0xEA1 - #0xEA3] [#0xEA5 - #0xEA5] [#0xEA7 - #0xEA7] [#0xEAA - #0xEAB] [#0xEAD - #0xEAE] [#0xEB0 - #0xEB0] [#0xEB2 - #0xEB3] [#0xEBD - #0xEBD] [#0xEC0 - #0xEC4] [#0xF40 - #0xF47] [#0xF49 - #0xF69] [#0x10A0 - #0x10C5] [#0x10D0 - #0x10F6] [#0x1100 - #0x1100] [#0x1102 - #0x1103] [#0x1105 - #0x1107] [#0x1109 - #0x1109] [#0x110B - #0x110C] [#0x110E - #0x1112] [#0x113C - #0x113C] [#0x113E - #0x113E] [#0x1140 - #0x1140] [#0x114C - #0x114C] [#0x114E - #0x114E] [#0x1150 - #0x1150] [#0x1154 - #0x1155] [#0x1159 - #0x1159] [#0x115F - #0x1161] [#0x1163 - #0x1163] [#0x1165 - #0x1165] [#0x1167 - #0x1167] [#0x1169 - #0x1169] [#0x116D - #0x116E] [#0x1172 - #0x1173] [#0x1175 - #0x1175] [#0x119E - #0x119E] [#0x11A8 - #0x11A8] [#0x11AB - #0x11AB] [#0x11AE - #0x11AF] [#0x11B7 - #0x11B8] [#0x11BA - #0x11BA] [#0x11BC - #0x11C2] [#0x11EB - #0x11EB] [#0x11F0 - #0x11F0] [#0x11F9 - #0x11F9] [#0x1E00 - #0x1E9B] [#0x1EA0 - #0x1EF9] [#0x1F00 - #0x1F15] [#0x1F18 - #0x1F1D] [#0x1F20 - #0x1F45] [#0x1F48 - #0x1F4D] [#0x1F50 - #0x1F57] [#0x1F59 - #0x1F59] [#0x1F5B - #0x1F5B] [#0x1F5D - #0x1F5D] [#0x1F5F - #0x1F7D] [#0x1F80 - #0x1FB4] [#0x1FB6 - #0x1FBC] [#0x1FBE - #0x1FBE] [#0x1FC2 - #0x1FC4] [#0x1FC6 - #0x1FCC] [#0x1FD0 - #0x1FD3] [#0x1FD6 - #0x1FDB] [#0x1FE0 - #0x1FEC] [#0x1FF2 - #0x1FF4] [#0x1FF6 - #0x1FFC] [#0x2126 - #0x2126] [#0x212A - #0x212B] [#0x212E - #0x212E] [#0x2180 - #0x2182] [#0x3007 - #0x3007] [#0x3021 - #0x3029] [#0x3041 - #0x3094] [#0x30A1 - #0x30FA] [#0x3105 - #0x312C] [#0x4E00 - #0x9FA5] [#0xAC00 - #0xD7A3]

The NameChars output is similarly messy.

When I took a look in Expat’s code, the problems stemmed from lib/nametab.h which has a lookup table for valid characters that is used within lib/xmltok.c via the various _GET_NAMING() macros. I tried to modify the existing mapping table, but was not able to figure out a way to do so, so I took a somewhat simpler approach with full sized mapping tables for namestart and name characters.

I changed the nametab.h tables to be a straight out one way bit mapping of this form: nametab_h.txt

static const unsigned int nameStartBitmap[] = {
	0x00000000, 0x04000000, 0x87FFFFFE, 0x07FFFFFE, //0x00
	0x00000000, 0x00000000, 0xFF7FFFFF, 0xFF7FFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x01
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x02
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0x00000000, 0x00000000, 0x00000000, 0xBFFF0000, //0x03
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
//etc...
};

static const unsigned int nameCharBitmap[] = {
	0x00000000, 0x07FF6000, 0x87FFFFFE, 0x07FFFFFE, //0x00
	0x00000000, 0x00800000, 0xFF7FFFFF, 0xFF7FFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x01
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, //0x02
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xBFFFFFFF, //0x03
	0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF,
//etc... up to 0xFF
};

Each section corresponds to 32 code points within the Unicode double byte range. Within the 1.1 XML spec, any 4 byte UTF8 character is allowed, so in the macros, I immediately return 1 for 4 byte chars and updated the isNever for 4 bytes to isAlways.

where each bit corresponds to a single Unicode codepoint referred to within the w3 spec.

I likewise changed the xmltok.c UTF8_GET_NAMING*() macros as follows: xmltok_c.txt

//Basically, the UTF8 to UCS conversion is applied and mapped into the bit maps above. 
#define UTF8_GET_NAMING2(pages, byte) \
		(pages[(((((byte)[0]) & 0x1C) >> 2) << 3) \
			+ ((((((byte)[0]) & 0x03) << 6) \
			+ (((byte)[1]) & 0x3F)) >> 5)] \
		 & (1u << ((((((byte)[0]) & 0x03) << 6) \
			+ (((byte)[1]) & 0x3F)) & 0x1F)))

#define UTF8_GET_NAMING3(pages, byte) \
		(pages[((((((byte)[0]) & 0x0F) << 4) \
				+ ((((byte)[1]) & 0x3C) >> 2)) << 3) \
			+ ((((((byte)[1]) & 0x03) << 6) \
				+ (((byte)[2]) & 0x3F)) >> 5)] \
		& (1u << ((((((byte)[1]) & 0x03) << 6) \
			+ (((byte)[2]) & 0x3F)) & 0x1F)))

#define UTF8_GET_NAMING4(pages, byte) \
	return 1;

#define UTF8_GET_NAMING(pages, p, n) \
  ((n) == 2 \
  ? UTF8_GET_NAMING2(pages, (const unsigned char *)(p)) \
  : ((n) == 3 \
     ? UTF8_GET_NAMING3(pages, (const unsigned char *)(p)) \
     : ((n) == 4 \\
	     ? UTF8_GET_NAMING4(pages, (const unsigned char *)(p)) \
			0))

Then updated the usages of the macros to either reference nameStartBitmap or nameCharBitmap. When I reran the test through our product, the output matched the specification.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

Since there was activity on this and I’ve been poor at providing a fix for this, I created a pull request https://github.com/libexpat/libexpat/pull/711 that contains the patches that I was applying when building libexpat to sort this out and updated the runtests accordingly. There might be documentation that will need to be updated if this is applied.

@hartwork thank you for the data.

Could you elaborate how XML 1.0r5+ support is relevant in 2022?

Main (and the only) reason is this issue: iBotPeaches/Apktool#1407. I’ve got a Fire TV Stick behind SSL proxy and I need to tweak a few apps (eg, Apple TV and Netflix) to make them respect user-installed CAs, so they can work behind proxy.

According to this article: https://android-developers.googleblog.com/2016/07/changes-to-trusted-certificate.html, I just need to add:

<certificates src="user" />

to their network config.

Thanks @timbray!

So it would need to be XML 1.0 fifth edition plus XML 1.0 namespaces third edition at the same time.

Questions remaining to be answered:

  • How to deal with XML delared as non-1.0? (Three options as @timbray mentioned above)
  • Who puts in the time to iterate proper code review for this?
  • Who puts in the time to iterate a correct and well-tested implementation? (May need conflict-resolving rebases, since some form of #674 is scheduled to be merged first.)

I think this is the right place to reply. Here’s my use case: I’m building a general purpose XML parser and DOM on top of expat for node. I chose expat because I’d used it successfully before, it has streaming properties I like, and because it compiled cleanly to WASM with only the frustration that anyone feels using emscripten. Plain C interfaces using pointers to UTF8 for the win.

I would like to be able to test my work for spec compliance with the W3C’s XML Conformance Test Suite. Ideally, it would eventually support XML 1.1 and Namespaces 1.1. I see XML 1.0r5 support as a stepping stone toward that.

  • XML still matters to me, even though I wouldn’t recommend it to people who are starting new projects.
  • Spec compliance matters to me. The time to argue over whether the spec is right was years ago, and I wasn’t in the room at the time.
  • More full-featured Unicode support is important to me, even if I would urge protocol designers to use ASCII-7 element and attribute names whenever possible. I don’t want to dictate to those designers that they have to agree with me.
  • There exist interesting documents in the wild that can’t be parsed without this work. @dimitry-ishenko gave an example, and it sounds like @mazer1310 has run into those documents as well.
  • I do care about performance, stability, and backward-compatibility. I agree that might raise the bar for how much testing and proof is required for a non-trivial change.
  • If you want to avoid ANY non-trivial changes, that’s a stance I can understand and respect. I wish there had been something in the README that let me know.