Parsing videos from Youtube channel (Python, BeautifulSoup, requests)

Question

Parsing videos from Youtube channel (Python, BeautifulSoup, requests)

There is a code like this:

from bs4 import BeautifulSoup
import requests

ycid = input('Введите код-идентификатор канала: ') #получение идентификатор канала

url = f'https://www.youtube.com/channel/{ycid}' #создание ссылки на канал
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 7.0; Win32; x32) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
print(f'Парсим с канала: {url}')
html = requests.get(url, headers = HEADERS )
html = html.text
soup = BeautifulSoup(html, 'lxml')

div_tags = soup.find_all('div', {'class': 'style-scope ytd-grid-video-renderer', 'id': 'dismissable'})
print(div_tags)
a_tags = [div.find('a') for div in div_tags]
print(a_tags)
url_img = [a['href'] for a in a_tags]
print(url_img)

The code should parse the first video from the specified channel, but it produces empty lists.

Result:

Введите код-индитификатор канала: UCviSYAcwdnDX1UoRzAHYgNg
Парсим с канала: https://www.youtube.com/channel/UCviSYAcwdnDX1UoRzAHYgNg
[]
[]
[]

How can I fix the code?

0

python python-3.x requests beautiful-soup

Author: Евгений, 2020-04-10

Source

1 answers

score 3 · Accepted Answer

Consider some channel. Let's view the page code using Ctr + U

If we perform a search by the name of a video, we will notice that it is present in the code, but not in the form of HTML. The fact is that BS sees the code as it was before executing all the scripts, etc., which may differ from the html code in your browser's debugging tool. You can get r. content and then use the json parser to extract information. Basically you can also get links to videos without using this method:

Since you are a beginner, I would recommend using a bundle of Selenium + geckodriver + BeautifulSoup. Selenium will go to the page, execute the javascript, and upload the resulting html code to BS. Each video is a separate ytd-grid-video-renderer with the "style-scope ytd-grid-renderer" class

The video title is contained in the title attribute of the tag with id = "video-title"

The href contains a link without a domain (/watch?v=HQxZaeGxwQs)

Sample code:

from selenium import webdriver
from bs4 import BeautifulSoup as BS

URL = "" #Ваш урл

driver = webdriver.Chrome()
driver.get(URL)
time.sleep(10)  #Можно ждать до загрузки страницы, но проще подождать 10 секунд, их хватит с запасом
html = driver.page_source

Now the html code of our page is in the variable. We will find all the videos and their titles in it:

soup = BS(html, "html.parser")
videos = soup.find_all("ytd-grid-video-renderer",{"class":"style-scope ytd-grid-renderer"})
for video in videos:
   a = video.find("a",{"id":"video-title"})
   name = link.get_text()
   link = "https://www.youtube.com/" + a.get("href") 
   print(name, link)