[Golang] Extract Text via State Machine and goquery


Introduction

Extract text (i.e., footnote) in HTML via state machine and goquery in Golang (Go programming language).

Assume we have the following HTML:

index.html | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
<!DOCTYPE html>
<html>
<head><title>[Golang] Extract Footnote via State Machine</title></head>
<body>
<p>I am paragraph #1</p>
<p>I am paragraph #2</p>
<p>I am paragraph #3</p>
<p>I am paragraph #4</p>
<hr>
<div>Reference:</div>
<div>[1] I am footnote #1</div>
<div>[2] I am footnote #2</div>
<div>[3] I am footnote #3</div>
<hr>
<p>Updated: 2016-04-11</p>
</body>
</html>

We want to extract the text (i.e., footnote) starting from Reference, and until Updated.

Install goquery Package

$ go get -u github.com/PuerkitoBio/goquery

Read HTML

html.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package main

import (
	"flag"
	"fmt"
	"os"
)

func parseCommandLineArguments() string {
	pPath := flag.String("input", "", "Path of HTML file to be processed")
	flag.Parse()
	path := *pPath
	if path == "" {
		fmt.Fprintf(os.Stderr, "Error: empty input file path!\n")
	}

	return path
}

func main() {
	inputFilePath := parseCommandLineArguments()

	f, err := os.Open(inputFilePath)
	if err != nil {
		panic("Fail to open " + inputFilePath)
	}
	defer f.Close()

	footnoteBody := extractFootnote(f)
	fmt.Println(footnoteBody)
}

Extract Text (Footnote)

Find all children of body element in HTML document. Convert each child of body element to text by Text() method. Process the text one line by one line. If the text line starting with Reference, the state machine enters InFootnote state, storing the text in the state machine. If the text line starting with Update, the state machine leave InFootnote state and stop storing the text. After all finished, output the text stored in the state machine, which is the text we want.

footnote.go | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
package main

import (
	"github.com/PuerkitoBio/goquery"
	"os"
	"strings"
)

const (
	InFootnote = iota
	NotInFootnote
)

type StateMachine struct {
	State        int
	FootnoteBody string
}

func NewStateMachine() *StateMachine {
	return &StateMachine{
		State: NotInFootnote,
	}
}

func (s *StateMachine) ProcessLine(line string) {
	if strings.HasPrefix(line, "Reference") {
		s.State = InFootnote
	}

	if strings.HasPrefix(line, "Update") && s.State == InFootnote {
		s.State = NotInFootnote
	}

	if s.State == InFootnote {
		s.FootnoteBody += line
	}
}

func extractFootnote(f *os.File) string {
	doc, err := goquery.NewDocumentFromReader(f)
	if err != nil {
		panic(err)
	}

	sm := NewStateMachine()
	doc.Find("body").Contents().Each(func(_ int, s *goquery.Selection) {
		sm.ProcessLine(s.Text())
	})

	return sm.FootnoteBody
}

Usage

Put above three files (index.html, html.go, footnote.go) together in current directory. Run the following command:

$ go run html.go footnote.go -input=index.html

Tested on: Ubuntu Linux 15.10, Go 1.6.


References:

[1]jquery iterate over elements - Google search
[2]github.com/PuerkitoBio/goquery - GoDoc