yt-dlp: [prosiebensat1] Unable to extract clip id

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Region

Accessible in Germany/Austria/Switzerland. Possibly worldwide.

Provide a description that is worded well enough to be understood

steps to reproduce this issue

  1. disable Widevine as instructed to confirm video is not DRM-protected
  2. open URL in Firefox
  3. click on video
  4. site redirects to an authentication page
  5. login (signup is free and fast as only email/firstname/bday are required, but can share login credentials if needed)
  6. site redirects back to URL from step 2
  7. video is now playable in Firefox
  8. run yt-dlp --no-config -f- -v --cookies-from-browser firefox "URL"

expected result

yt-dlp download should start

actual result

ERROR: [prosiebensat1] tv/videos/der-sat-1-bio-check-aldi-rewe-denns-co-ganze-folge: Unable to extract clip id; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template.

possibly related issues

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

yt-dlp --no-config -f- -v --cookies-from-browser firefox "https://www.sat1.at/tv/videos/der-sat-1-bio-check-aldi-rewe-denns-co-ganze-folge"
[debug] Command-line config: ['--no-config', '-f-', '-v', '--cookies-from-browser', 'firefox', 'https://www.sat1.at/tv/videos/der-sat-1-bio-check-aldi-rewe-denns-co-ganze-folge']
[debug] Encodings: locale cp1252, fs utf-8, pref cp1252, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version nightly@2023.06.05.155301 [59d9fe083] (win_exe)
[debug] Python 3.8.10 (CPython AMD64 64bit) - Windows-10-10.0.19045-SP0 (OpenSSL 1.1.1k  25 Mar 2021)
[debug] exe versions: none
[debug] Optional libraries: Cryptodome-3.18.0, brotli-1.0.9, certifi-2023.05.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-11.0.3
[Cookies] Extracting cookies from firefox
[debug] Extracting cookies from: "C:\Users\User\AppData\Roaming\Mozilla\Firefox\Profiles\4ad8do09.monika\cookies.sqlite"
[Cookies] Extracted 719 cookies from firefox
[debug] Proxy map: {}
[debug] Loaded 1840 extractors
[prosiebensat1] Extracting URL: https://www.sat1.at/tv/videos/der-sat-1-bio-check-aldi-rewe-denns-co-ganze-folge
[prosiebensat1] tv/videos/der-sat-1-bio-check-aldi-rewe-denns-co-ganze-folge: Downloading webpage
ERROR: [prosiebensat1] tv/videos/der-sat-1-bio-check-aldi-rewe-denns-co-ganze-folge: Unable to extract clip id; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U
  File "yt_dlp\extractor\common.py", line 703, in extract
  File "yt_dlp\extractor\prosiebensat1.py", line 491, in _real_extract
  File "yt_dlp\extractor\prosiebensat1.py", line 431, in _extract_clip
  File "yt_dlp\extractor\common.py", line 1287, in _html_search_regex
  File "yt_dlp\extractor\common.py", line 1251, in _search_regex

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 15 (6 by maintainers)

Most upvoted comments

I haven’t looked at the atv.at pages or (recently) any of the sites supported in the existing ProSiebenSat1 extractor. However, you can get metadata from the page under test in two ways (at least). The method InfoExtractor._search_ld_json() returns an info_dict like this:

{
  'description': 'Wir haben eine Million Euro versteckt. In einem Koffer. Irgendwo in Deutschland. Und jede:r von euch hat die Chance, die Million zu finden.',
  'duration': 283,
  'thumbnail': 'https://mim.p7s1.io/pis/mw/3c4cjq5FgV8h73OMk30hSmQ3ksnTujGLThfZrJTemL4yREOdH4xmFJqCsiW3b1ZT2Z_aOLVU57XF-vIBm_Gxs8m-84sW3ArA5w2ymcfYya-jeg/profile:ezone-teaser940x528?w=1200',
  'timestamp': 1699557600,
  'title': 'Lesung & Konzert'
}

The method InfoExtractor._search_nextjs_data() returns the page hydration JSON, where the .props.pageProps.info member is full of metadata.

The atv.at extractor logic looks for hydration JSON in the webpage containing video IDs and then makes the vas-v4.p7s1video.net request using a JWT with those IDs (content_ids). However for prosieben.de, the videos object was empty ({}) when the web client made the equivalent request, and the video ID that was the key can be extracted from the end of the URL:

    _VALID_URL = r'''(?x)
                    https?://
                        (?:www\.)?
                        prosieben(?:maxx)?\.(?:de|at|ch)/
                        serien/(?:[a-z\d-]+/)+videos/
                        (?P<slug>[a-z\d-]+)-(?P<id>[a-z]_[a-z\d]+)
                 '''

The vas-v4.p7s1video.net getsources API is geo-restricted. As you saw, this API host is already used in the existing yt-dlp ProSiebenSat1 extractor, but with different endpoints (eg geturls) and relying on other API hosts as removed in PR #5593. Presumably it’s geturls vs getsources that causes the old code to return the “eingestellt” video.

The problem mentioned above is that ProSiebenSat1BaseIE is the base class of Puls4IE, which apparently still works. Therefore a separate or intermediate base class is needed for ProSiebenSat1IE, ideally one that could be used as a base class to simplify ATVAtIE. If any of the other sites supported by the existing ProSiebenSat1IE extractor might still be handled correctly, it will have to be cloned so that the module has ProSiebenSat1IE for those sites and ProSiebenSat1v2IE for prosieben.de and any other broken sites (eg, sat1.de from OP).

Thanks for the hint with the HAR files!

Here are more findings…

The JWT key is not in the HAR file. It is calculated using JavaScript in In https://oasis-player-prod.p7s1.io/web/15.18.0/bootstrap/bootstrap.js

        t.getJWT = function(e, t) {
            var n = e.encryption_key
              , a = e.access_id
              , o = Math.round(Date.now() / 1e3)
              , s = r.__assign(r.__assign({}, t), {
                iat: o,
                nbf: o + -300,
                exp: o + 300
            });
            return (0,
            i.default)(s, n, {
                kid: a
            })
        }

The JWT token is signed using algorithm “HS256”. The secret can be found using the Chrome Developer Tools (“Sources” tab). I could set a breakpoint at the sign() method in “webpack:///node_modules/.pnpm/jwt-encode@1.0.1/node_modules/jwt-encode/src/index.js” to see the secrets.

The secrets are:

  • access_id = “x_supernovatvc-de”
  • encryption_key = “Ahsh3soxiemusijophoophiodeevujup” (used for SHA256-HMAC)
function encode (data) {
  const stringifiedData = CryptoJS.enc.Utf8.parse(JSON.stringify(data));
  return base64url(stringifiedData);
}
var header = {
  "alg": "HS256",
  "typ": "JWT",
  "kid": "x_supernovatvc-de"
};
var payload = {
  "content_ids": {
    "v_yd0czlu26gxi": {}
  },
  "secure_delivery": true,
  "iat": Date.now(),
  "nbf": Date.now()-300,
  "exp": Date.now()+300
};
var payload = encode(header)+"."+encode(data);
var jwt = payload+"."+base64url(CryptoJS.HmacSHA256(payload, "Ahsh3soxiemusijophoophiodeevujup"))
console.log(jwt);

(It should be tested if the encryption key is individual for each video or not)

Unfortunately, I don’t have experience with Python or the yt-dlp code, so I have no idea how to implement it…