[Golang] Web Scrape Facebook Post via goquery


Introduction

When it comes to web scraping, people usually thinks of Python. There are a lot of powerful libraries and tools for web scraping and html parsing in Python ecosystem, but Go is catching up too. Here we will show how to extract data from a public Facebook post via Golang and goquery, which is similar to JavaScript jQuery for html parsing.

We will us this Facebook post as example and shows how to extract data from it. Remember that Facebook serves different HTML contents depending on whether users are logged in or not. Here we fetch and parse only public posts, so login is not required. If you view source of the Facebook post, remember to logout all Facebook account, so that you will not be overwhelmed by too much HTML code.

The following are steps for web scraping:

  1. Get the part of HTML which contains the Facebook post.
  2. Get timestamp of the post
  3. Get profile link of the post
  4. Get image url of the post
  5. Get content of the post

Extract HTML String of Post

First we need to get the part of HTML which contains the Facebook post. After checking the source of the post, I found that the part of HTML containing the post looks like the following without login:

<div class="hidden_elem"><code id="u_0_f"><!-- {{POST_HTML}} --></code></div>

It's embedded as a comment and used by JavaScript. The following code can extract {{POST_HTML}}:

fb.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
package parsefb

import (
	"github.com/PuerkitoBio/goquery"
)

func Parse(url string) (string, error) {
	doc, err := goquery.NewDocument(url)
	if err != nil {
		return "", err
	}

	s := doc.Find("div.hidden_elem > code").First()
	cmt, err := s.Html()
	if err != nil {
		return "", err
	}
	postHtml := cmt[5 : len(cmt)-4]
	return postHtml, nil
}

We can use goquery Find function with CSS selector div.hidden_elem > code. There will be two elements match the above selector, the post is in first matched element, so we use First function to get the first matched element. Retrieve the innerHTML of the first matched element by Html function, and remove the leading and trailing arrows of comments, and return the HTML string of the post.

Post Timestamp

Now given the HTML string of post, next step is to get the timestamp of the post. Again we look at the HTML string of the post, we find that the time is embedded in the following HTML element:

<abbr title="Wednesday, February 15, 2017 at 7:00am" data-utime="1487113202" data-shorten="1" class="_5ptz"><span class="timestampContent">Yesterday at 7:00am</span></abbr>

We can use the following code to extract the utime and convert it to human-readable form:

time.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package parsefb

import (
	"errors"
	"github.com/PuerkitoBio/goquery"
	"strconv"
	"strings"
	"time"
)

func ParseTimeStamp(utime string) (string, error) {
	i, err := strconv.ParseInt(utime, 10, 64)
	if err != nil {
		return "", err
	}
	t := time.Unix(i, 0)
	return t.Format(time.RFC3339), nil
}

func GetTimeStamp(postHtml string) (string, error) {
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(postHtml))
	if err != nil {
		return "", err
	}

	s := doc.Find("abbr._5ptz").First()
	utime, ok := s.Attr("data-utime")
	if ok {
		return ParseTimeStamp(utime)
	}

	return "", errors.New("cannot find timestamp")
}

See my post for parsing Unix time [4] for more details.

Post Profile Link

Get the name and url of the user of the post.

profilelink.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
package parsefb

import (
	"errors"
	"github.com/PuerkitoBio/goquery"
	"strings"
)

type ProfileLink struct {
	Name string
	Url  string
}

func GetProfileLink(postHtml string) (*ProfileLink, error) {
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(postHtml))
	if err != nil {
		return nil, err
	}

	s := doc.Find("a.profileLink").First()
	if s.Length() == 0 {
		s = doc.Find("span.fwb.fcg > a").First()
	}

	pl := ProfileLink{}

	pl.Name = s.Text()
	if pl.Name == "" {
		return nil, errors.New("cannot find name of profile link")
	}

	url, ok := s.Attr("href")
	if !ok {
		return nil, errors.New("cannot find url of profile link")
	}
	pl.Url = url

	return &pl, nil
}

The logic in above code is the same. Just find the element which contains the data you are looking for, and use correct CSS selector to get the element we need.

Post Image

Retrieve the URL of the image of the post:

image.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package parsefb

import (
	"errors"
	"github.com/PuerkitoBio/goquery"
	"strings"
)

func GetImageUrl(postHtml string) (string, error) {
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(postHtml))
	if err != nil {
		return "", err
	}

	s := doc.Find("img.scaledImageFitHeight").First()
	if s.Length() == 0 {
		s = doc.Find("img.scaledImageFitWidth").First()
	}

	url, ok := s.Attr("src")
	if !ok {
		return "", errors.New("cannot find image url")
	}

	return url, nil
}

Post Content

Get the content of the post:

content.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package parsefb

import (
	"github.com/PuerkitoBio/goquery"
	"strings"
)

func GetContent(postHtml string) (string, error) {
	doc, err := goquery.NewDocumentFromReader(strings.NewReader(postHtml))
	if err != nil {
		return "", err
	}

	s := doc.Find("div._5pbx.userContent").First()
	if s.Length() == 0 {
		return "no content", nil
	}

	content, err := s.Html()
	if err != nil {
		return "", err
	}

	return content, nil
}

Summary

Use all the above code to extract data from post:

fb_test.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
package parsefb

import (
	"testing"
)

func TestParse(t *testing.T) {
	url := "https://www.facebook.com/jayasaro.panyaprateep.org/posts/1095007907274561:0"
	postHtml, err := Parse(url)
	if err != nil {
		t.Error(err)
		return
	}
	t.Log(postHtml)

	timestamp, err := GetTimeStamp(postHtml)
	if err != nil {
		t.Error(err)
		return
	}
	t.Log(timestamp)

	pl, err := GetProfileLink(postHtml)
	if err != nil {
		t.Error(err)
		return
	}
	t.Log(pl.Name)
	t.Log(pl.Url)

	imgurl, err := GetImageUrl(postHtml)
	if err != nil {
		t.Error(err)
		return
	}
	t.Log(imgurl)

	content, err := GetContent(postHtml)
	if err != nil {
		t.Error(err)
		return
	}
	t.Log(content)
}

Tested on: Ubuntu Linux 16.10, Go 1.8.


References:

[1]GitHub - PuerkitoBio/goquery: A little like that j-thing, only in Go. godoc
[2]Tips and tricks · PuerkitoBio/goquery Wiki · GitHub
[3]goquery querySelector
[4][Golang] Parse Unix Time (utime) Example