PARSE XML/HTML FROM A FILE
This post gives a real-world example about how to parse and retrieve data from
a XML /HTML file by the use of Python xml.dom.minidom library. The following
is a XML file which contains the explanation of a Pāli word abbhāna .
We want to parse the file and extract the information.
example.xml |
repository |
view raw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 <?xml version="1.0" encoding="utf-8"?>
<cd>
<item>
<dict> ◎ 《汉译パーリ语辞典》 黃秉榮譯 词数 7735.</dict>
<word> abbhāna</word>
<explain> %3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e5%ae%b9%e8%a8%b1%2c%20%e5%be%a9%e6%ad%b8%28%e6%81%a2%e5%be%a9%e5%8e%9f%e7%8b%80%29%2e</explain>
</item>
<item>
<dict> ◎ 《パーリ语辞典》 日本水野弘元教授 词数 13772.</dict>
<word> abbhāna</word>
<explain> %3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e8%a8%b1%e5%ae%b9%2c%20%e5%be%a9%e5%b8%b0%2e</explain>
</item>
<item>
<dict> ◎ 《巴汉词典》 明法尊者增订</dict>
<word> Abbhāna</word>
<explain> %2c%20%28%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%79%c4%81%20%28%69%29%29%2c%e3%80%90%e4%b8%ad%e3%80%91%e5%a4%8d%e5%bd%92%28%e6%af%94%e4%b8%98%e8%ba%ab%e4%bb%bd%29%28%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%29%e3%80%82</explain>
</item>
<item>
<dict> ◎ 《PTS Pali-English dictionary》 The Pali Text Society's Pali-English dictionary</dict>
<word> Abbhāna</word>
<explain> %2c%28%6e%74%2e%29%20%5b%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%3c%65%6d%3e%79%c4%81%3c%2f%65%6d%3e%3c%69%3e%20%28%3c%2f%69%3e%3c%65%6d%3e%69%3c%2f%65%6d%3e%3c%69%3e%29%3c%2f%69%3e%5d%20%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%20%56%69%6e%2e%49%2c%34%39%20%28%c2%b0%c3%a2%72%61%68%61%29%2c%20%35%33%20%28%69%64%2e%29%2c%20%31%34%33%2c%20%33%32%37%3b%20%49%49%2c%33%33%2c%20%34%30%2c%20%31%36%32%3b%20%41%2e%49%2c%39%39%2e%20%2d%2d%20%43%70%2e%20%3c%69%3e%61%62%62%68%65%74%69%3c%2f%69%3e%2e%20%28%50%61%67%65%20%36%30%29</explain>
</item>
</cd>
The following Python script parses the above XML file. In line 21, the script
parses the XML file first. In line 23, we get the item element by calling
getElementsByTagName . Then we parse each item one by one. Extract the content
of the text node in line 11, 12, 13. The result of each item is printed in line
15, 16, 17. The code is straight forward and easy to understand.
minidom-howto-7.py |
repository |
view raw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29 #!/usr/bin/env python
# -*- coding:utf-8 -*-
import xml.dom.minidom
def decodeItem ( item ):
dict = item . getElementsByTagName ( "dict" )[ 0 ]
word = item . getElementsByTagName ( "word" )[ 0 ]
explain = item . getElementsByTagName ( "explain" )[ 0 ]
dictstr = dict . childNodes [ 0 ] . data
wordstr = word . childNodes [ 0 ] . data
explainstr = explain . childNodes [ 0 ] . data
print ( "dict: %s " % dictstr )
print ( "word: %s " % wordstr )
print ( "explain: %s " % explainstr )
def main ():
dom = xml . dom . minidom . parse ( "example.xml" )
items = dom . getElementsByTagName ( "item" )
for item in items :
decodeItem ( item )
if __name__ == '__main__' :
main ()
The result of the above Python script is:
dict: ◎ 《汉译パーリ语辞典》 黃秉榮譯 词数 7735 .
word: abbhāna
explain: %3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e5%ae%b9%e8%a8%b1%2c%20%e5%be%a9%e6%ad%b8%28%e6%81%a2%e5%be%a9%e5%8e%9f%e7%8b%80%29%2e
dict: ◎ 《パーリ语辞典》 日本水野弘元教授 词数 13772 .
word: abbhāna
explain: %3a%6e%2e%20%5b%61%62%68%69%2d%c4%81%79%c4%81%6e%61%5d%20%e5%87%ba%e7%bd%aa%2c%20%e8%a8%b1%e5%ae%b9%2c%20%e5%be%a9%e5%b8%b0%2e
dict: ◎ 《巴汉词典》 明法尊者增订
word: Abbhāna
explain: %2c%20%28%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%79%c4%81%20%28%69%29%29%2c%e3%80%90%e4%b8%ad%e3%80%91%e5%a4%8d%e5%bd%92%28%e6%af%94%e4%b8%98%e8%ba%ab%e4%bb%bd%29%28%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%29%e3%80%82
dict: ◎ 《PTS Pali-English dictionary》 The Pali Text Society' s Pali-English dictionary
word: Abbhāna
explain: %2c%28%6e%74%2e%29%20%5b%61%62%68%69%20%2b%20%c4%81%79%61%6e%61%20%6f%66%20%c4%81%20%2b%20%3c%65%6d%3e%79%c4%81%3c%2f%65%6d%3e%3c%69%3e%20%28%3c%2f%69%3e%3c%65%6d%3e%69%3c%2f%65%6d%3e%3c%69%3e%29%3c%2f%69%3e%5d%20%63%6f%6d%69%6e%67%20%62%61%63%6b%2c%20%72%65%68%61%62%69%6c%69%74%61%74%69%6f%6e%20%6f%66%20%61%20%62%68%69%6b%6b%68%75%20%77%68%6f%20%68%61%73%20%75%6e%64%65%72%67%6f%6e%65%20%61%20%70%65%6e%61%6e%63%65%20%66%6f%72%20%61%6e%20%65%78%70%69%61%62%6c%65%20%6f%66%66%65%6e%63%65%20%56%69%6e%2e%49%2c%34%39%20%28%c2%b0%c3%a2%72%61%68%61%29%2c%20%35%33%20%28%69%64%2e%29%2c%20%31%34%33%2c%20%33%32%37%3b%20%49%49%2c%33%33%2c%20%34%30%2c%20%31%36%32%3b%20%41%2e%49%2c%39%39%2e%20%2d%2d%20%43%70%2e%20%3c%69%3e%61%62%62%68%65%74%69%3c%2f%69%3e%2e%20%28%50%61%67%65%20%36%30%29
Python Library xml.dom.minidom Howto series:
Reference: MiniDom - Python Wiki