goquery Handle Non-UTF8 HTML Web Page
goquery handles only UTF-8 encoded web pages. The wiki of goquery [1] provides a method to handle non-utf8 html pages if the character encoding (charset) of the pages is known. The trick is to use iconv to convert the encoding to utf8 first. I re-write the code in the wiki and make it more modular.
Install goquery and Go iconv binding:
$ go get -u github.com/PuerkitoBio/goquery
$ go get -u github.com/djimenez/iconv-go
Source code
import (
"net/http"
"github.com/PuerkitoBio/goquery"
iconv "github.com/djimenez/iconv-go"
)
func NewDocumentFromNonUtf8Url(url, charset string) (doc *goquery.Document, err error) {
resp, err := http.Get(url)
if err != nil {
return
}
defer resp.Body.Close()
utfBody, err := iconv.NewReader(resp.Body, charset, "utf-8")
if err != nil {
return
}
doc, err = goquery.NewDocumentFromReader(utfBody)
return
}
Example: Read Big5 webpage
func main() {
doc, err := NewDocumentFromNonUtf8Url("http://shenfang.com.tw/product.htm", "big5")
if err != nil {
panic(err)
}
// do something with the doc
}
Tested on: Ubuntu Linux 18.04, Go 1.10.1.
References:
[1] | Tips and tricks · PuerkitoBio/goquery Wiki · GitHub |