Scraping Single page applications
These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.
Headless Chrome with Python
You will need to install the selenium package:
pip install selenium
And of course, you need a Chrome browser, and Chromedriver installed on your system.
On macOS, you can simply use brew:
brew install chromedriver
Taking a screenshot
The code is really straightforward, I just added a parameter --window-size because the default size was too small.
You should now have a nice screenshot of the Nintendo's home page:
Waiting for the page load
Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.
A simple solution to this is to just
time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.
The other solution is to use the
WebDriverWait object from the Selenium API:
This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.
As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.
This is one of the reason we started ScrapingNinja, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!
This was our first post on this blog, I hope you enjoyed it!