yt-dlp: [twitch] Cannot download more than 100 VODs from a channel when providing a `/channel/videos` URL
DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE
- I understand that I will be blocked if I intentionally remove or skip any mandatory* field
Checklist
- I’m reporting that a supported site is broken
- I’ve verified that I’m running yt-dlp version 2023.03.04 (update instructions) or later (specify commit)
- I’ve checked that all provided URLs are playable in a browser with the same IP and same login details
- I’ve checked that all URLs and arguments with special characters are properly quoted or escaped
- I’ve searched known issues and the bugtracker for similar issues including closed ones. DO NOT post duplicates
- I’ve read the guidelines for opening an issue
- I’ve read about sharing account credentials and I’m willing to share it if required
Region
Canada
Provide a description that is worded well enough to be understood
Scraping a Twitch channel’s VODs only seems to yield up to 100 videos, even if the channel has more. I’m not sure how this would be fixed, but it does seem to yield more videos in browser, so it’s possible an API endpoint isn’t paginating properly or something.
Provide verbose output that clearly demonstrates the problem
- Run your yt-dlp command with -vU flag added (
yt-dlp -vU <your command line>
) - If using API, add
'verbose': True
toYoutubeDL
params instead - Copy the WHOLE output (starting with
[debug] Command-line config
) and insert it below
Complete Verbose Output
[debug] Command-line config: ['--ignore-config', '--skip-download', 'https://www.twitch.tv/beyondthesummit/videos?filter=archives&sort=time', '-vU']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (pip)
[debug] Python 3.9.2 (CPython x86_64 64bit) - Linux-5.10.0-20-amd64-x86_64-with-glibc2.31 (OpenSSL 1.1.1n 15 Mar 2022, glibc 2.31)
[debug] exe versions: ffmpeg 4.4.4 (fdk,setts), ffprobe 4.4.4
[debug] Optional libraries: Cryptodome-3.9.7, brotli-1.0.9, certifi-2022.12.07, mutagen-1.45.1, sqlite3-2.6.0, websockets-11.0.3
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[debug] Fetching release info: https://api.github.com/repos/yt-dlp/yt-dlp/releases/latest
Available version: stable@2023.03.04, Current version: stable@2023.03.04
yt-dlp is up to date (stable@2023.03.04)
[TwitchVideos] Extracting URL: https://www.twitch.tv/beyondthesummit/videos?filter=archives&sort=time
[download] Downloading playlist: beyondthesummit - Past Broadcasts sorted by Date
[TwitchVideos] beyondthesummit: Downloading Videos GraphQL page 1
[TwitchVideos] beyondthesummit: Downloading Videos GraphQL page 2
[TwitchVideos] Playlist beyondthesummit - Past Broadcasts sorted by Date: Downloading 100 items of 100
[download] Downloading item 1 of 100
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 21 (16 by maintainers)
Commits related to this issue
- [extractor/twitch] Update `_CLIENT_ID` and add extractor-arg (#7200) Closes #7058, Closes #7183 Authored by: bashonly — committed to yt-dlp/yt-dlp by bashonly a year ago
The
_PAGE_LIMIT
variable sets how many VOD items are requested with each GraphQL call. The extractor cannot know ahead of time how many VOD items there are total in a channel, and some APIs will have a max (or even fixed) number that can be requested in one call._PAGE_LIMIT
is not the real issue though, the problem is that the request for the 2nd page is failing with this JSON error response:The browser now calls an
integrity
endpoint inbetween every pagination request, which the extractor is not doingI can reproduce when including cookies with
--cookies "/mnt/W/ytdl/cookies.txt"
or--cookies-from-browser firefox
, excluding cookies returns normal behavior.With cookies (abnormal):
Without:
Downloading chat history (operation
VideoCommentsByOffsetOrCursor
) is also affected by this integrity API. I think it started somewhere between May the 2nd and the 4th, as with a VOD published on the 4th is when my json files containing the live chat started to be much smaller than usual. There it is still possible to download the chat history if for querying for the next page you use contentOffsetSeconds instead of a cursor, but that has its disadvantages.It is good to know that it can be quickly checked if the integrity token returned is actually useful. I dont understand though how does one decode the token to find the
is_bad_bot
key. Could you please elaborate?In the last few days I have started looking into how is the integrity token generated. It is not yet clear to me exactly which headers must be present, but the
x-kpsdk-cd
andx-kpsdk-ct
are important. These are calulcated by a script, which is currently loaded from here: https://k.twitchcdn.net/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/p.js If you look into it, you should pretty print it with something for readability. From names that I have seen in the code, the first header probably stands for client data, and the second one for client token.The generation of the JSON object found in
x-kpsdk-cd
starts in the function thats stored in_0x297f6a['solveChallenge']
, if you search for this string in the code the only result should be where the function is defined. As I have followed what it calls, it looks like it basically iterates on hashing a piece of data, maybe also with a time limit (the workTime in the JSON). Its a bit picky on what hashes it accepts… thats the challenge. That part doesnt seem to be too complicated, though.I did not yet look into how is
x-kpsdk-ct
made.Also, obtaining the right
x-kpsdk-ct
andx-kpsdk-cd
values may be a 1 or 2 step process. It starts with contacting this with a GET: https://gql.twitch.tv/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fp?x-kpsdk-v=j-0.0.0 If you are a first time visitor, as in you dont yet have theKP_UIDz
andKP_UIDz-ssn
cookies (so far had the same value for me: thex-kpsdk-ct
returned in the headers of this same request), then this will return HTTP 429. The API is working correctly (probably…) though, but the headers you will get wont be the final ones yet, you will get those from sending these to this other endpoint: https://gql.twitch.tv/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/tl If you have those cookies, /fp will respond with the final headers, which can be used in the integrity request.Interestingly, the /fp and /tl endpoints can be found both on the
https://gql.twitch.tv/
and thehttps://passport.twitch.tv/
domains. However if you only keep /fp and /tl onhttps://gql.twitch.tv/
, andhttps://gql.twitch.tv/integrity
, but block everything else, then you’ll get a working integrity token, which you can hack into yt-dlp e.g. by setting the X-Device-Id and Client-Integrity headers in the yt_dlp.extractor.twitch.TwitchBaseIE._download_base_gql function. See a screenshot of the exact blocking setup. The*
filter was enabled when the /integrity request has finished.This is so deep (and I’m sure I forgot something, but I have things written up) that I think it would be worth putting together some kind of a document, but honestly I dont even know where to start, and also not sure if its a good idea to discuss all details publicly (though at one point it will get into the code anyways…).
Indeed, if only www.twitch.tv and its CDN are enabled for JS, the same error is returned to the browser. Therefore some other part of the vast amount of JS served as part of the page must be required to generate the parameters for the
integrity
request.