[Python] Web Scrape JavaScript Webpage by dryscrape

Today JavaScript is heavily used to render the website content. Requests, a Python HTTP library, is not enough for web scraping. In this post we will try to use dryscrape, a lightweight web scraping library for Python, to scrape dynamically rendered webpages by JavaScript.

Install dryscrape

$ sudo apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
$ sudo pip install dryscrape

Real World Example

We will write a Python script to visit a webpage with iframe. Get the URL of the iframe. Fill in the form in the iframe and submit the form.

submit.py | repository | view raw

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import dryscrape

# make sure you have xvfb installed
dryscrape.start_xvfb()

root_url = 'YOUR_BASE_URL'

if __name__ == '__main__':
  # set up a web scraping session
  sess = dryscrape.Session(base_url = root_url)

  # we don't need images
  sess.set_attribute('auto_load_images', False)

  # visit webpage
  sess.visit('YOUR_RELATIVE_PATH_TO_BASE_URL')
  # search for iframe with id="mainframe"
  frame = sess.at_xpath('//*[@id="mainframe"]')

  # get the URL of iframe
  frameURL = root_url + frame['src']
  # visit the URL of iframe
  sess2 = dryscrape.Session()
  sess2.visit(frameURL)

  # fill in the form in iframe
  name = sess2.at_xpath('//*[@id="username"]')
  name.set("John")
  pid = sess2.at_xpath('//*[@id="person_id"]')
  pid.set("Q123446589")
  year = sess2.at_xpath('//*[@id="bornyear"]')
  year.set("2000")
  mobile = sess2.at_xpath('//*[@id="mobile"]')
  mobile.set("5631365976")

  # submit form
  name.form().submit()

  # save a screenshot of the web page
  sess2.render("test.png")
  print("Session rendered")

Tested on: Ubuntu Linux 15.10, Python 2.7.10, dryscrape 1.0.

References:

[1]	Google Search: python login script

[2]	Google Search: requests python javascript

[3]	Web-scraping JavaScript page with Python - Stack Overflow

[4]	How to submit a javascript-form using Python requests library? - Stack Overflow

[5]	Ultimate guide for scraping JavaScript rendered web pages \| IMPYTHONIST (【Python】爬虫技术:(JavaScript渲染)动态页面抓取超级指南 - 简书, 伯樂在線轉錄)

[6]	niklasb/dryscrape · GitHub (A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages)

[7]	Google Search: python scrape javascript

[8]	Scraping with JavaScript \| Web Scraping with Python

[9]	Selenium - Web Browser Automation (GitHub repo)

[10]	Selenium with Python — Selenium Python Bindings 2 documentation

[11]	How to use Selenium with Python? - Stack Overflow

[12]	Google Search: Selenium with Python

[13]	Google Search: Selenium Python

[14]	Web Scraping: Beyond BeautifulSoup : Python - Reddit

[15]	ghost.py: webkit web client written in python (GitHub repo)

[16]	Splinter - a tool for test web applications with a simple for find elements, form actions, and others browser actions (doc, 用Python开发自动化测试脚本 - Python - 伯乐在线)

[17]	如何用 Python 爬取需要登录的网站？ - Python - 伯乐在线

[18]	用python爬虫抓站的一些技巧总结 - Python - 伯乐在线

[19]	Scrapy 示例 —— Web 爬虫框架 - Python - 伯乐在线

[20]

小信' Blog

[21]	小趴趴--知乎精华回答的非专业大数据统计 (伯樂在線轉錄, GitHub - SmileXie/zhihu_crawler)

[22]	使用python进行web抓取 - 磁针石的个人空间 - 开源中国社区 (伯樂在線轉錄)

[23]	关于背单词软件,你不知道的惊人真相 (伯樂在線轉錄, GitHub)

[24]	Python爬虫：一些常用的爬虫技巧总结 - j_hao104的个人页面 - 开源中国社区