pip: pip doesn't properly parse git URL if branch name contains @ or #

Description

Trying pip install git+https://example.com/repository@branch fails if branch contains characters @ or #, even percent-encoded.

Expected behavior

pip must parse percent-encoded special characters in branch name, split the branch name from the URL, clone the repository and checkout the named branch with special characters decoded. I.e.

pip install https://example.com/repository@master%40test

must clone https://example.com/repository and checkout master@test branch. The same for # character %-encoded as %23.

pip version

Any; tested with 21.1.3

Python version

Any; tested with Python 3.9

OS

Any; tested with Debian 10 buster

How to Reproduce

Here is a test program test-pip-git that creates a repository, tries pip download and cleanups:

#! /bin/sh
set -e

PERCENT_ENCODING=0
while getopts p: opt; do
    case $opt in
        p ) PERCENT_ENCODING="${OPTARG:-1}" ;;
    esac
done
shift `expr $OPTIND - 1`

if [ -z "$1" ]; then
    echo "Usage: $0 [-p1|2] test_char" >&2
    exit 1
fi

TEST_CHAR1="$1"
if [ $PERCENT_ENCODING -ge 1 ]; then

    py_ver=`python -c "import sys; print(sys.version_info[0])"`
    if [ $py_ver -eq 2 ]; then
        percent_encode() {
            python -c "import urllib; print(urllib.quote('$1'))"
        }
    elif [ $py_ver -eq 3 ]; then
        percent_encode() {
            python -c "import urllib.parse; print(urllib.parse.quote('$1'))"
        }
    else
        echo "Unknown python version" >&1
        exit 1
    fi
    TEST_CHAR2=`percent_encode "$1"`
    if [ $PERCENT_ENCODING -eq 2 ]; then
        TEST_CHAR2=`percent_encode "$TEST_CHAR2"`
    fi
else
    TEST_CHAR2="$1"
fi

rm -rf test-pip-git-repo test-pip-git-spec-char-0.0.1.zip
git init test-pip-git-repo
cd test-pip-git-repo

echo test >test
git add test
git commit -m test

git branch -M master # to fixed name
git checkout -b test${TEST_CHAR1}test # new branch

cat >setup.py <<EOF
#!/usr/bin/env python

from setuptools import setup

setup(
    name='test_pip_git_spec_char',
    version='0.0.1',
    description='Test pip+git+special characters',
    author='Oleg Broytman',
    author_email='phd@phdru.name',
    keywords=['pip', 'git', '@', '!', '#', '/'],
    platforms='Any',
)
EOF

git add setup.py
git commit -m setup.py
git checkout master # make test branch non-current

cd ..
pip download git+file://`pwd`/test-pip-git-repo@test${TEST_CHAR2}test | grep '\(clone\|checkout\)' || : # ignore errors

rm -rf test-pip-git-repo test-pip-git-spec-char-0.0.1.zip

Output

./test-pip-git @

Running command git clone -q file:///home/phd/tmp/test-pip-git-repo@master /tmp/pip-req-build-v1v16zoe
fatal: '/home/phd/tmp/test-pip-git-repo@master' does not appear to be a git repository

pip clones incorrect repository test-pip-git-repo@master; the repo must be test-pip-git-repo.

./test-pip-git -p1 @

Running command git clone -q file:///home/phd/tmp/test-pip-git-repo@master /tmp/pip-req-build-__0b6wh5
fatal: '/home/phd/tmp/test-pip-git-repo@master' does not appear to be a git repository

The same incorrect repo.

./test-pip-git \!

Running command git clone -q file:///home/phd/tmp/test-pip-git-repo /tmp/pip-req-build-fsnq5vt_
Running command git checkout -b 'master!test' --track 'origin/master!test'

Just a test with another less special character. pip clones correct repository test-pip-git-repo and checks out correct branch master!test. Doesn’t even require %-encoding. Test passed!

./test-pip-git \#

Running command git clone -q file:///home/phd/tmp/test-pip-git-repo /tmp/pip-req-build-i0wy10_1
ERROR: File "setup.py" not found

pip clones correct repository test-pip-git-repo but doesn’t check out branch master#test. It just uses branch master and ignores everything after #.

./test-pip-git -p1 \#

Exactly the same problem.

Code of Conduct

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 15 (8 by maintainers)

Most upvoted comments

The VCS URLs should be first parsed by urlsplit, and then we apply our custom parsing logic to the path part. The #egg= part belongs to the fragment, not the path.

>>> from urllib.parse import urlsplit
>>> urlsplit('https://example.com/%40uranusjr/pkg@dev#egg=myproj')
SplitResult(scheme='https', netloc='example.com', path='/%40uranusjr/pkg@dev', query='', fragment='egg=myproj')

Not \. Special characters in the URL need to be percent-encoded.

Thanks for looking into this! I took a quick look at the implementation and, well, let’s say the original code author probably didn’t think too hard before deciding on the syntax and implementation. Both @ and / are unfortunately perfectly valid characters in a Git branch name and the path part of the URL, so there are quite a few problematic edge cases:

  • https://domain/@username/repo.git (implies default branch)
  • https://domain/username/repo.git@branch
  • https://domain/username/repo.git@branch/with/slash
  • https://domain/username/repo.git@branch-with-#-a-hash

I think at this point, the only reasonable-ish approach is to require the user to escape @ in both the URL and the branch name, since / is quite common in both the path (obviously) and the branch (many people use patterns like stable/1.2.3), and pip should unescape when parsing the branch out. The parsing part is in VersionControl.get_url_rev_and_auth, so we should add a urllib.parse.unquote call somewhere in there. Would you be interested in working on a PR and some test cases to cover this? I would be happy to offer assistance.