google-cloud-python: BigQuery upload_from_file unicode file-like must be opened in binary-mode if it's more than RESUMABLE_UPLOAD_THRESHOLD, otherwise str-mode.

Steps to reproduce

import sys
from gcloud.bigquery import Client, SchemaField
from gcloud.bigquery.job import CreateDisposition, WriteDisposition

csv_filename = 'sandwiches.csv'

if len(sys.argv) > 1:
    csv_filename = sys.argv[1]

bq = Client()

ds = bq.dataset('test_unicode')
ds.location = 'EU'


def test_unicode_upload(filename, mode):

    if not ds.exists():
        print('Creating dataset: {}'.format(ds.name))
        ds.create()

    fields = [
        SchemaField('name', 'STRING'),
        SchemaField('main_ingredient', 'STRING'),
    ]

    table = ds.table('sandwiches', fields)

    print('Uploading CSV: {}, mode={!r}'.format(csv_filename, mode))
    table.upload_from_file(
        open(filename, mode),
        encoding='UTF-8',
        source_format='CSV',
        write_disposition=WriteDisposition.WRITE_TRUNCATE,
        create_disposition=CreateDisposition.CREATE_IF_NEEDED)

test_unicode_upload(csv_filename, 'r') # Works

test_unicode_upload(csv_filename, 'rb')
# Fails in http.client.HTTPConnection()._send_request
#
# /usr/lib/python3.4/http/client.py in _send_request(self, method, url,
# body, headers)
#    1178         if isinstance(body, str):
#    1179             # RFC 2616 Section 3.7.1 says that text default has a
#    1180             # default charset of iso-8859-1.
# -> 1181             body = body.encode('iso-8859-1')
#    1182         self.endheaders(body)
#
# UnicodeEncodeError: 'latin-1' codec can't encode characters in position
#649-650: ordinal not in range(256)

sandwiches.csv

name,main_ingredient
Räksmörgås,Räkor
Baguette,Bröd

Expected behavior

When i send in a binary-mode file-like I expect upload_from_file to pass the data through to BigQuery as-is, and that the BigQuery load job will decode it for me using encoding=.

Enviroment

$ python --version
Python 3.4.3+
$ pip freeze | egrep 'httplib2|gcloud'
gcloud==0.13.0
httplib2==0.9.2

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 18 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@joar Thanks for your efforts. #1779 has my re-working of your patch:

  • I reused the existing gcloud._helpers._to_bytes
  • I ensured that the tests all pass for both Python2 and Python3.
  • I fixed comments that your changes invalidated.

@thobrla gsutil doesn’t seem to run on Python 3, so it might be unaffected by this bug.

@tseaver I think that the use of six.StringIO in Upload._configure_multiport_request might be close to the root cause of this issue.

  • PY2: six.StringIO == six.BytesIO == StringIO.StringIO
  • PY3: six.StringIO == io.StringIO != io.BytesIO1.