[Python] Web Scrape JavaScript Webpage by dryscrape


Today JavaScript is heavily used to render the website content. Requests, a Python HTTP library, is not enough for web scraping. In this post we will try to use dryscrape, a lightweight web scraping library for Python, to scrape dynamically rendered webpages by JavaScript.

Install dryscrape

$ sudo apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
$ sudo pip install dryscrape

Real World Example

We will write a Python script to visit a webpage with iframe. Get the URL of the iframe. Fill in the form in the iframe and submit the form.

submit.py | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import dryscrape

# make sure you have xvfb installed
dryscrape.start_xvfb()

root_url = 'YOUR_BASE_URL'

if __name__ == '__main__':
  # set up a web scraping session
  sess = dryscrape.Session(base_url = root_url)

  # we don't need images
  sess.set_attribute('auto_load_images', False)

  # visit webpage
  sess.visit('YOUR_RELATIVE_PATH_TO_BASE_URL')
  # search for iframe with id="mainframe"
  frame = sess.at_xpath('//*[@id="mainframe"]')

  # get the URL of iframe
  frameURL = root_url + frame['src']
  # visit the URL of iframe
  sess2 = dryscrape.Session()
  sess2.visit(frameURL)

  # fill in the form in iframe
  name = sess2.at_xpath('//*[@id="username"]')
  name.set("John")
  pid = sess2.at_xpath('//*[@id="person_id"]')
  pid.set("Q123446589")
  year = sess2.at_xpath('//*[@id="bornyear"]')
  year.set("2000")
  mobile = sess2.at_xpath('//*[@id="mobile"]')
  mobile.set("5631365976")

  # submit form
  name.form().submit()

  # save a screenshot of the web page
  sess2.render("test.png")
  print("Session rendered")

Tested on: Ubuntu Linux 15.10, Python 2.7.10, dryscrape 1.0.


References:

[1]Google Search: python login script
[2]Google Search: requests python javascript
[3]Web-scraping JavaScript page with Python - Stack Overflow
[4]How to submit a javascript-form using Python requests library? - Stack Overflow
[5]Ultimate guide for scraping JavaScript rendered web pages | IMPYTHONIST (【Python】爬虫技术:(JavaScript渲染)动态页面抓取超级指南 - 简书, 伯樂在線轉錄)
[6]niklasb/dryscrape · GitHub (A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages)
[7]Google Search: python scrape javascript
[8]Scraping with JavaScript | Web Scraping with Python
[9]Selenium - Web Browser Automation (GitHub repo)
[10]Selenium with Python — Selenium Python Bindings 2 documentation
[11]How to use Selenium with Python? - Stack Overflow
[12]Google Search: Selenium with Python
[13]Google Search: Selenium Python
[14]Web Scraping: Beyond BeautifulSoup : Python - Reddit
[15]ghost.py: webkit web client written in python (GitHub repo)
[16]Splinter - a tool for test web applications with a simple for find elements, form actions, and others browser actions (doc, 用Python开发自动化测试脚本 - Python - 伯乐在线)
[17]如何用 Python 爬取需要登录的网站? - Python - 伯乐在线
[18]用python爬虫抓站的一些技巧总结 - Python - 伯乐在线
[19]Scrapy 示例 —— Web 爬虫框架 - Python - 伯乐在线
[20]小信' Blog
[21]小趴趴--知乎精华回答的非专业大数据统计 (伯樂在線轉錄, GitHub - SmileXie/zhihu_crawler)
[22]使用python进行web抓取 - 磁针石的个人空间 - 开源中国社区 (伯樂在線轉錄)
[23]关于背单词软件,你不知道的惊人真相 (伯樂在線轉錄, GitHub)
[24]Python爬虫:一些常用的爬虫技巧总结 - j_hao104的个人页面 - 开源中国社区