yt-dlp: [twitch] Cannot download more than 100 VODs from a channel when providing a `/channel/videos` URL

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Region

Canada

Provide a description that is worded well enough to be understood

Scraping a Twitch channel’s VODs only seems to yield up to 100 videos, even if the channel has more. I’m not sure how this would be fixed, but it does seem to yield more videos in browser, so it’s possible an API endpoint isn’t paginating properly or something.

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

[debug] Command-line config: ['--ignore-config', '--skip-download', 'https://www.twitch.tv/beyondthesummit/videos?filter=archives&sort=time', '-vU']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (pip)
[debug] Python 3.9.2 (CPython x86_64 64bit) - Linux-5.10.0-20-amd64-x86_64-with-glibc2.31 (OpenSSL 1.1.1n  15 Mar 2022, glibc 2.31)
[debug] exe versions: ffmpeg 4.4.4 (fdk,setts), ffprobe 4.4.4
[debug] Optional libraries: Cryptodome-3.9.7, brotli-1.0.9, certifi-2022.12.07, mutagen-1.45.1, sqlite3-2.6.0, websockets-11.0.3
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.03.04, Current version: stable@2023.03.04
yt-dlp is up to date (stable@2023.03.04)
[TwitchVideos] Extracting URL: https://www.twitch.tv/beyondthesummit/videos?filter=archives&sort=time
[download] Downloading playlist: beyondthesummit - Past Broadcasts sorted by Date
[TwitchVideos] beyondthesummit: Downloading Videos GraphQL page 1
[TwitchVideos] beyondthesummit: Downloading Videos GraphQL page 2
[TwitchVideos] Playlist beyondthesummit - Past Broadcasts sorted by Date: Downloading 100 items of 100
[download] Downloading item 1 of 100

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 21 (16 by maintainers)

Commits related to this issue

Most upvoted comments

The _PAGE_LIMIT variable sets how many VOD items are requested with each GraphQL call. The extractor cannot know ahead of time how many VOD items there are total in a channel, and some APIs will have a max (or even fixed) number that can be requested in one call.

_PAGE_LIMIT is not the real issue though, the problem is that the request for the 2nd page is failing with this JSON error response:

[
  {
    "errors": [
      {
        "message": "failed integrity check",
        "path": [
          "user",
          "videos"
        ]
      }
    ],
    "data": {
      "user": {
        "id": "29578325",
        "videos": null,
        "__typename": "User"
      }
    },
    "extensions": {
      "challenge": {
        "type": "integrity"
      },
      "durationMilliseconds": 22,
      "operationName": "FilterableVideoTower_Videos",
      "requestID": "01H0MTRTVFB2R7F6DRRZSDXH05"
    }
  }
]

The browser now calls an integrity endpoint inbetween every pagination request, which the extractor is not doing

I can reproduce when including cookies with --cookies "/mnt/W/ytdl/cookies.txt" or --cookies-from-browser firefox, excluding cookies returns normal behavior.

With cookies (abnormal):

[debug] Command-line config: ["https://www.twitch.tv/amemiyanazuna/videos?filter=all'&'sort=views", '-s', '--cookies', '/mnt/W/ytdl/cookies.txt', '-vU']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.11.16 from yt-dlp/yt-dlp [24f827875] (linux_exe)
[debug] Python 3.10.13 (CPython x86_64 64bit) - Linux-6.1.0-16-amd64-x86_64-with-glibc2.36 (OpenSSL 3.1.4 24 Oct 2023, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.4-0 (setts), ffprobe 5.1.4-0
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.07.22, mutagen-1.47.0, requests-2.31.0, secretstorage-3.3.3, sqlite3-3.44.0, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests
[debug] Loaded 1901 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
[debug] Downloading _update_spec from https://github.com/yt-dlp/yt-dlp/releases/latest/download/_update_spec
[debug] Downloading SHA2-256SUMS from https://github.com/yt-dlp/yt-dlp/releases/download/2023.12.30/SHA2-256SUMS
Current version: stable@2023.11.16 from yt-dlp/yt-dlp
Latest version: stable@2023.12.30 from yt-dlp/yt-dlp
Current Build Hash: 331d8637a0000633c74b7ba7e3c5ce8cfd19940fe7b8ba8bcc3fb771fe182220
Updating to stable@2023.12.30 from yt-dlp/yt-dlp ...
ERROR: Unable to write to /usr/local/bin/yt-dlp; try running as administrator
[TwitchVideos] Extracting URL: https://www.twitch.tv/amemiyanazuna/videos?filter=all'&'sort=views
[download] Downloading playlist: amemiyanazuna - All Videos sorted by Date
[TwitchVideos] amemiyanazuna: Downloading Videos GraphQL page 1
[TwitchVideos] amemiyanazuna: Downloading Videos GraphQL page 2
[TwitchVideos] Playlist amemiyanazuna - All Videos sorted by Date: Downloading 100 items of 100
[download] Downloading item 1 of 100

Without:

[debug] Command-line config: ["https://www.twitch.tv/amemiyanazuna/videos?filter=all'&'sort=views", '-s', '-vU']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.11.16 from yt-dlp/yt-dlp [24f827875] (linux_exe)
[debug] Python 3.10.13 (CPython x86_64 64bit) - Linux-6.1.0-16-amd64-x86_64-with-glibc2.36 (OpenSSL 3.1.4 24 Oct 2023, glibc 2.36)
[debug] exe versions: ffmpeg 5.1.4-0 (setts), ffprobe 5.1.4-0
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.07.22, mutagen-1.47.0, requests-2.31.0, secretstorage-3.3.3, sqlite3-3.44.0, urllib3-2.1.0, websockets-12.0
[debug] Proxy map: {}
[debug] Request Handlers: urllib, requests
[debug] Loaded 1901 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
[debug] Downloading _update_spec from https://github.com/yt-dlp/yt-dlp/releases/latest/download/_update_spec
[debug] Downloading SHA2-256SUMS from https://github.com/yt-dlp/yt-dlp/releases/download/2023.12.30/SHA2-256SUMS
Current version: stable@2023.11.16 from yt-dlp/yt-dlp
Latest version: stable@2023.12.30 from yt-dlp/yt-dlp
Current Build Hash: 331d8637a0000633c74b7ba7e3c5ce8cfd19940fe7b8ba8bcc3fb771fe182220
Updating to stable@2023.12.30 from yt-dlp/yt-dlp ...
ERROR: Unable to write to /usr/local/bin/yt-dlp; try running as administrator
[TwitchVideos] Extracting URL: https://www.twitch.tv/amemiyanazuna/videos?filter=all'&'sort=views
[download] Downloading playlist: amemiyanazuna - All Videos sorted by Date
[TwitchVideos] amemiyanazuna: Downloading Videos GraphQL page 1
[TwitchVideos] amemiyanazuna: Downloading Videos GraphQL page 2
[TwitchVideos] Playlist amemiyanazuna - All Videos sorted by Date: Downloading 117 items of 117
[download] Downloading item 1 of 117

Downloading chat history (operation VideoCommentsByOffsetOrCursor) is also affected by this integrity API. I think it started somewhere between May the 2nd and the 4th, as with a VOD published on the 4th is when my json files containing the live chat started to be much smaller than usual. There it is still possible to download the chat history if for querying for the next page you use contentOffsetSeconds instead of a cursor, but that has its disadvantages.

sending a header like this

if you decode the JWT: “is_bad_bot”:“true”

It is good to know that it can be quickly checked if the integrity token returned is actually useful. I dont understand though how does one decode the token to find the is_bad_bot key. Could you please elaborate?


In the last few days I have started looking into how is the integrity token generated. It is not yet clear to me exactly which headers must be present, but the x-kpsdk-cd and x-kpsdk-ct are important. These are calulcated by a script, which is currently loaded from here: https://k.twitchcdn.net/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/p.js If you look into it, you should pretty print it with something for readability. From names that I have seen in the code, the first header probably stands for client data, and the second one for client token.

The generation of the JSON object found in x-kpsdk-cd starts in the function thats stored in _0x297f6a['solveChallenge'], if you search for this string in the code the only result should be where the function is defined. As I have followed what it calls, it looks like it basically iterates on hashing a piece of data, maybe also with a time limit (the workTime in the JSON). Its a bit picky on what hashes it accepts… thats the challenge. That part doesnt seem to be too complicated, though.

I did not yet look into how is x-kpsdk-ct made.

Also, obtaining the right x-kpsdk-ct and x-kpsdk-cd values may be a 1 or 2 step process. It starts with contacting this with a GET: https://gql.twitch.tv/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fp?x-kpsdk-v=j-0.0.0 If you are a first time visitor, as in you dont yet have the KP_UIDz and KP_UIDz-ssn cookies (so far had the same value for me: the x-kpsdk-ct returned in the headers of this same request), then this will return HTTP 429. The API is working correctly (probably…) though, but the headers you will get wont be the final ones yet, you will get those from sending these to this other endpoint: https://gql.twitch.tv/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/tl If you have those cookies, /fp will respond with the final headers, which can be used in the integrity request.

Interestingly, the /fp and /tl endpoints can be found both on the https://gql.twitch.tv/ and the https://passport.twitch.tv/ domains. However if you only keep /fp and /tl on https://gql.twitch.tv/, and https://gql.twitch.tv/integrity, but block everything else, then you’ll get a working integrity token, which you can hack into yt-dlp e.g. by setting the X-Device-Id and Client-Integrity headers in the yt_dlp.extractor.twitch.TwitchBaseIE._download_base_gql function. See a screenshot of the exact blocking setup. The * filter was enabled when the /integrity request has finished.

image

This is so deep (and I’m sure I forgot something, but I have things written up) that I think it would be worth putting together some kind of a document, but honestly I dont even know where to start, and also not sure if its a good idea to discuss all details publicly (though at one point it will get into the code anyways…).

Indeed, if only www.twitch.tv and its CDN are enabled for JS, the same error is returned to the browser. Therefore some other part of the vast amount of JS served as part of the page must be required to generate the parameters for the integrity request.