requests: requests can't properly handle redirects if the response body is encoded in something else than 'utf8'

Just like in the topic. The response body is encoded in iso-8859-2 and the location happens to contain non-ascii character so that it results in UnicodeDecodeError being thrown.

Expected Result

Flawless execution of the code.

Actual Result

UnicodeDecodeError

Reproduction Steps

import requests
requests.get("http://www.biblia.deon.pl/ksiega.php?id=3")

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  }, 
  "cryptography": {
    "version": "2.3"
  }, 
  "idna": {
    "version": "2.7"
  }, 
  "implementation": {
    "name": "CPython", 
    "version": "2.7.15+"
  }, 
  "platform": {
    "release": "4.18.0-13-generic", 
    "system": "Linux"
  }, 
  "pyOpenSSL": {
    "openssl_version": "1010100f", 
    "version": "18.0.0"
  }, 
  "requests": {
    "version": "2.19.0"
  }, 
  "system_ssl": {
    "version": "1010100f"
  }, 
  "urllib3": {
    "version": "1.23"
  }, 
  "using_pyopenssl": true
}

This command is only available on Requests v2.16.4 and greater. Otherwise, please provide some basic information about your system (Python version, operating system, &c).

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 17 (5 by maintainers)

Most upvoted comments

I have also recently run into this issue and would like to see #4933 merged.

@tomchristie Thank you for answer. Technically speaking it might not be a bug but I will still maintain that this is an expected behaviour from the library which advertises itself as “HTTP for Humans”.

Following Python3 code works as expected

import urllib.request
contents = urllib.request.urlopen("http://www.biblia.deon.pl/ksiega.php?id=3").read()
print(contents)

Following Go code works as expected

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
)

func main() {
	resp, err := http.Get("http://www.biblia.deon.pl/ksiega.php?id=3")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	body, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s", body)
}

Both of them use only standard library.

The encoding of the response body is irrelevant here. The location header should be strictly ascii encoded. (See eg. https://stackoverflow.com/questions/7654207/what-charset-should-be-used-for-a-location-header-in-a-301-response.)

Requests will (reasonably enough) decode it as utf8, since it is ascii compatible, and ends up being more robust in practice.

In short: The http://www.biblia.deon.pl/ksiega.php?id=3 address is serving an invalid HTTP response.

$ curl -v http://www.biblia.deon.pl/ksiega.php?id=3
*   Trying 104.25.144.117...
* TCP_NODELAY set
* Connected to www.biblia.deon.pl (104.25.144.117) port 80 (#0)
> GET /ksiega.php?id=3 HTTP/1.1
> Host: www.biblia.deon.pl
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Date: Tue, 08 Jan 2019 14:25:32 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=d73c8f399ac453a2e4fe967faaa1251c81546957532; expires=Wed, 08-Jan-20 14:25:32 GMT; path=/; domain=.deon.pl; HttpOnly
< Location: otworz.php?skrot=Kp? 1
< J-Cache: HIT
< Server: cloudflare
< CF-RAY: 495f558234763572-LHR

(As an aside it also doesn’t include ‘iso-8859-2’ in the content-type, so there’s really no way to determine what the intended content type of the byte sequence might be)

Requests could decode the header with errors="ignore" or something like that, in order to be more robust against malformed headers, but it’d just be masking the issue that the response header is malformed.