crawler4j: Problems with "?" in robots.txt
In https://www.welt.de/robots.txt there are ? containing entries like Disallow: /*?config. Hence https://www.welt.de/test?config should be allowed but it is not. Whereas entries like Disallow: /*.xmli work properly and disallow https://www.welt.de/test.xmli. After my investigation I figured out that ? is the problematic character.
I use RobotstxtServer#allow("https://www.welt.de/test?config") for testing.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 15 (2 by maintainers)
I’ll keep this on the radar, and will add a unit test to crawler-commons’ robots.txt parser, just to make sure that it continues to work. Thanks!