crawler4j: Problems with "?" in robots.txt

In https://www.welt.de/robots.txt there are ? containing entries like Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not. Whereas entries like Disallow: /*.xmli work properly and disallow https://www.welt.de/test.xmli. After my investigation I figured out that ? is the problematic character.

I use RobotstxtServer#allow("https://www.welt.de/test?config") for testing.

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 15 (2 by maintainers)

Most upvoted comments

I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!