crawler4j: Problems with "?" in robots.txt

In https://www.welt.de/robots.txt there are ? containing entries like Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not. Whereas entries like Disallow: /*.xmli work properly and disallow https://www.welt.de/test.xmli. After my investigation I figured out that ? is the problematic character.

I use RobotstxtServer#allow("https://www.welt.de/test?config") for testing.

About this issue

Original URL
State: open
Created 6 years ago
Comments: 15 (2 by maintainers)

Most upvoted comments

I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!

sebastian-nagel on Mar 22, 2018