[Golang] Web Scrape Facebook Post via goquery


Introduction

When it comes to web scraping, people usually thinks of Python. There are a lot of powerful libraries and tools for web scraping and html parsing in Python ecosystem, but Go is catching up too. Here we will show how to extract data from a public Facebook post via Golang and goquery, which is similar to JavaScript jQuery for html parsing.

We will us this Facebook post as example and shows how to extract data from it. Remember that Facebook serves different HTML contents depending on whether users are logged in or not. Here we fetch and parse only public posts, so login is not required. If you view source of the Facebook post, remember to logout all Facebook account, so that you will not be overwhelmed by too much HTML code.

The following are steps for web scraping:

  1. Get the part of HTML which contains the Facebook post.
  2. Get timestamp of the post
  3. Get profile link of the post
  4. Get image url of the post
  5. Get content of the post

Extract HTML String of Post

First we need to get the part of HTML which contains the Facebook post. After checking the source of the post, I found that the part of HTML containing the post looks like the following without login:

<div class="hidden_elem"><code id="u_0_f"><!-- {{POST_HTML}} --></code></div>

It's embedded as a comment and used by JavaScript. The following code can extract {{POST_HTML}}:

We can use goquery Find function with CSS selector div.hidden_elem > code. There will be two elements match the above selector, the post is in first matched element, so we use First function to get the first matched element. Retrieve the innerHTML of the first matched element by Html function, and remove the leading and trailing arrows of comments, and return the HTML string of the post.

Post Timestamp

Now given the HTML string of post, next step is to get the timestamp of the post. Again we look at the HTML string of the post, we find that the time is embedded in the following HTML element:

<abbr title="Wednesday, February 15, 2017 at 7:00am" data-utime="1487113202" data-shorten="1" class="_5ptz"><span class="timestampContent">Yesterday at 7:00am</span></abbr>

We can use the following code to extract the utime and convert it to human-readable form:

See my post for parsing Unix time [5] for more details.

Post Profile Link

Get the name and url of the user of the post.

The logic in above code is the same. Just find the element which contains the data you are looking for, and use correct CSS selector to get the element we need.

Post Image

Retrieve the URL of the image of the post:

Post Content

Get the content of the post:

Summary

Use all the above code to extract data from post:

The complete code can be found in my Github repo [4].


Tested on: Ubuntu Linux 16.10, Go 1.8.


References:

[1]GitHub - PuerkitoBio/goquery: A little like that j-thing, only in Go. godoc
[2]Tips and tricks · PuerkitoBio/goquery Wiki · GitHub
[3]goquery querySelector
[4]GitHub - siongui/go-facebook-post-parser: web scrape facebook post and extract data
[5][Golang] Parse Unix Time (utime) Example