[Golang] Determine Encoding of HTML Document
Given an URL, determine the encoding of the HTML document in Go using golang.org/x/net/html and golang.org/x/text packages. I came across the code snippet from [1], so I extract and re-organize the content to make it search engine friendly.
Install the packages first:
$ go get -u golang.org/x/text
$ go get -u golang.org/x/net/html
The following code shows how to determine the encoding of an HTML document given the URL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | package guess import ( "bufio" "fmt" "io" "net/http" "golang.org/x/net/html/charset" "golang.org/x/text/encoding" ) func UrlEncoding(url string) (name string, certain bool, err error) { resp, err := http.Get(url) if err != nil { return } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { err = fmt.Errorf("response status code: %d", resp.StatusCode) return } _, name, certain, err = DetermineEncodingFromReader(resp.Body) return } func DetermineEncodingFromReader(r io.Reader) (e encoding.Encoding, name string, certain bool, err error) { bytes, err := bufio.NewReader(r).Peek(1024) if err != nil { return } e, name, certain = charset.DetermineEncoding(bytes, "") return } |
Usage of the above code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | package guess import ( "testing" ) func TestUrlEncoding(t *testing.T) { name, _, err := UrlEncoding("http://shenfang.com.tw/") if err != nil { t.Error(err) return } if name != "big5" { t.Error("bad guess!") return } name, _, err = UrlEncoding("https://siongui.github.io/") if err != nil { t.Error(err) return } if name != "utf-8" { t.Error("bad guess!") return } } |
If you want to convert the non-utf8 encoded HTML to utf8, see [3].
Tested on: Ubuntu 18.04, Go 1.11.1
References:
[1] |
[2] | golang 用/x/net/html写的小爬虫,爬小说 - 简书 |
[3] | [Golang] Auto-Detect and Convert Encoding of HTML to UTF-8 |