How to bypass the parsing protection?

Question

How to bypass the parsing protection?

This code should output everything that is in the body. But it seems that this site has some kind of protection against parsing.

import requests
from bs4 import BeautifulSoup

URL = 'https://edadeal.ru/kazan'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
       ' Chrome/86.0.4240.185 YaBrowser/20.11.2.78 Yowser/2.5 Safari/537.36', 'accept': '*/*'}

def get_html(url, params=None):
    r = requests.get(url, headers=HEADERS, params=params)
    return r


def get_content(html):
    print(BeautifulSoup(html, 'html.parser'))


def parse():
    html = get_html(URL)
    if html.status_code == 200:
        get_content(html.text)
    else:
        print('Error')

parse()

As a result, all that is parsed from the body is:

<div id="root"></div>

0

python парсер beautiful-soup безопасность

Author: McRishka, 2020-12-08

Source

1 answers

score 2 · Accepted Answer

The fact is that this site really only loads <div id="root"></div>, such applications are called - SPA (single page application) . All other content and DOM tree is created using JS scripts.

To get the full page code after loading all the JS scripts, you need to use Selenium, but after loading, you can already pass the page code to parsing in BeautifulSoup.

Here is an article with an example on parsing dynamic sites.