Today JavaScript is heavily used to render the website content. Requests, a
Python HTTP library, is not enough for web scraping. In this post we will
try to use dryscrape, a lightweight web scraping library for Python, to scrape
dynamically rendered webpages by JavaScript.
#!/usr/bin/env python# -*- coding:utf-8 -*-importdryscrape# make sure you have xvfb installeddryscrape.start_xvfb()root_url='YOUR_BASE_URL'if__name__=='__main__':# set up a web scraping sessionsess=dryscrape.Session(base_url=root_url)# we don't need imagessess.set_attribute('auto_load_images',False)# visit webpagesess.visit('YOUR_RELATIVE_PATH_TO_BASE_URL')# search for iframe with id="mainframe"frame=sess.at_xpath('//*[@id="mainframe"]')# get the URL of iframeframeURL=root_url+frame['src']# visit the URL of iframesess2=dryscrape.Session()sess2.visit(frameURL)# fill in the form in iframename=sess2.at_xpath('//*[@id="username"]')name.set("John")pid=sess2.at_xpath('//*[@id="person_id"]')pid.set("Q123446589")year=sess2.at_xpath('//*[@id="bornyear"]')year.set("2000")mobile=sess2.at_xpath('//*[@id="mobile"]')mobile.set("5631365976")# submit formname.form().submit()# save a screenshot of the web pagesess2.render("test.png")print("Session rendered")
Tested on: Ubuntu Linux 15.10, Python 2.7.10, dryscrape 1.0.