[Golang] Extract Text via State Machine and goquery
Introduction
Extract text (i.e., footnote) in HTML via state machine and goquery in Golang (Go programming language).
Assume we have the following HTML:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <!DOCTYPE html> <html> <head><title>[Golang] Extract Footnote via State Machine</title></head> <body> <p>I am paragraph #1</p> <p>I am paragraph #2</p> <p>I am paragraph #3</p> <p>I am paragraph #4</p> <hr> <div>Reference:</div> <div>[1] I am footnote #1</div> <div>[2] I am footnote #2</div> <div>[3] I am footnote #3</div> <hr> <p>Updated: 2016-04-11</p> </body> </html> |
We want to extract the text (i.e., footnote) starting from Reference, and until Updated.
Install goquery Package
$ go get -u github.com/PuerkitoBio/goquery
Read HTML
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | package main import ( "flag" "fmt" "os" ) func parseCommandLineArguments() string { pPath := flag.String("input", "", "Path of HTML file to be processed") flag.Parse() path := *pPath if path == "" { fmt.Fprintf(os.Stderr, "Error: empty input file path!\n") } return path } func main() { inputFilePath := parseCommandLineArguments() f, err := os.Open(inputFilePath) if err != nil { panic("Fail to open " + inputFilePath) } defer f.Close() footnoteBody := extractFootnote(f) fmt.Println(footnoteBody) } |
Extract Text (Footnote)
Find all children of body element in HTML document. Convert each child of body element to text by Text() method. Process the text one line by one line. If the text line starting with Reference, the state machine enters InFootnote state, storing the text in the state machine. If the text line starting with Update, the state machine leave InFootnote state and stop storing the text. After all finished, output the text stored in the state machine, which is the text we want.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | package main import ( "github.com/PuerkitoBio/goquery" "os" "strings" ) const ( InFootnote = iota NotInFootnote ) type StateMachine struct { State int FootnoteBody string } func NewStateMachine() *StateMachine { return &StateMachine{ State: NotInFootnote, } } func (s *StateMachine) ProcessLine(line string) { if strings.HasPrefix(line, "Reference") { s.State = InFootnote } if strings.HasPrefix(line, "Update") && s.State == InFootnote { s.State = NotInFootnote } if s.State == InFootnote { s.FootnoteBody += line } } func extractFootnote(f *os.File) string { doc, err := goquery.NewDocumentFromReader(f) if err != nil { panic(err) } sm := NewStateMachine() doc.Find("body").Contents().Each(func(_ int, s *goquery.Selection) { sm.ProcessLine(s.Text()) }) return sm.FootnoteBody } |
Usage
Put above three files (index.html, html.go, footnote.go) together in current directory. Run the following command:
$ go run html.go footnote.go -input=index.html
Tested on: Ubuntu Linux 15.10, Go 1.6.
References:
[1] | jquery iterate over elements - Google search |
[2] | github.com/PuerkitoBio/goquery - GoDoc |