Python parses html to extract data, and generates Word file instance parsing, pythonword
Introduction
Today, I tried to use ptyhon to capture the webpage content and generate a Word document. The function is very simple. Make a record for future use.
The third-party component python-docx is used to generate word. Therefore, install the third-party component first. Because python installed in windows does not include the setuptools module by default, you must first install the setuptools module.
Install
1. Find the https://bootstrap.pypa.io/ez_setup.py on the python official website, save the code to the local and execute: python ez_setup.py
2. Download python-docx (https://pypi.python.org/pypi/python-docx/0.7.4), unzip the package and go to XXX \ python-docx-0.7.4 to install python-docx: python setup. py install
So python-docx is installed successfully, you can use it to operate the Word document, Word document generation reference here https://python-docx.readthedocs.org/en/latest/index.html
Html parsing uses the SGMLParser url content in sgmllib to obtain urllib and urllib2.
Implementation Code
#-*-Coding: cp936-*-from sgmllib import SGMLParser import OS import sys import urllib import urllib2 from docx import Document from docx. shared import Inches import time # Get the url class GetUrl (SGMLParser) to be parsed: def _ init _ (self): SGMLParser. _ init _ (self) self. start = False self. urlArr = [] def start_div (self, attr): for name, value in attr: if value = "ChairmanCont Bureau": # fixed value self in page js. start = True def end_div (self): self. start = False def start_a (self, attr): if self. start: for name, value in attr: self. urlArr. append (value) def getUrlArr (self): return self. urlArr # parse the obtained url and obtain the useful data class getManInfo (SGMLParser): def _ init _ (self): SGMLParser. _ init _ (self) self. start = False self. p = False self. dl = False self. manInfo = [] self. subInfo = [] def start_div (self, attr): for name, value in attr: if value = "SpeakerInfo": # fixed value self in page js. start = True def end_div (self): self. start = False def start_p (self, attr): if self. dl: self. p = True def end_p (self): self. p = False def start_img (self, attr): if self. dl: for name, value in attr: self. subInfo. append (value) def handle_data (self, data): if self. p: self. subInfo. append (data. decode ('utf-8') def start_dl (self, attr): if self. start: self. dl = True def end_dl (self): self. manInfo. append (self. subInfo) self. subInfo = [] self. dl = False def getManInfo (self): return self. manInfo urlSource = "http://www.XXX" sourceData = urllib2.urlopen (urlSource ). read () startTime = time. clock () # get urls getUrl = GetUrl () getUrl. feed (sourceData) urlArr = getUrl. getUrlArr () getUrl. close () print "get url use:" + str (time. clock ()-startTime) startTime = time. clock () # get maninfos manInfos = getManInfo () for url in urlArr: # one url one person data = urllib2.urlopen (url ). read () manInfos. feed (data) infos = manInfos. getManInfo () manInfos. close () print "get maninfos use:" + str (time. clock ()-startTime) startTime = time. clock () # word saveFile = OS. getcwd () + "\ xxx.docx" doc = Document () # word title doc. add_heading ("HEAD ". decode ('gbk'), 0) p = doc. add_paragraph ("HEADCONTENT :". decode ('gbk') # write info for infoArr in infos: I = 0 for info in infoArr: if I = 0 :## img url arr1 = info. split ('. ') suffix = arr1 [len (arr1)-1] arr2 = info. split ('/') preffix = arr2 [len (arr2)-2] imgFile = OS. getcwd () + "\ imgs \" + preffix + ". "+ suffix if not OS. path. exists (OS. getcwd () + "\ imgs"): OS. mkdir (OS. getcwd () + "\ imgs") imgData = urllib2.urlopen (info ). read () try: f = open (imgFile, 'wb ') f. write (imgData) f. close () doc. add_picture (imgFile, width = Inches (1.25) OS. remove (imgFile) failed t Exception as err: print (err) elif I = 1: doc. add_heading (info + ":", level = 1) else: doc. add_paragraph (info, style = 'listbullet') I = I + 1 doc. save (saveFile) print "word use:" + str (time. clock ()-startTime ))
Summary
The above section describes how to parse html to extract data and generate a Word document instance to parse all the content. I hope it will be helpful to you. If you are interested, you can continue to refer to other related topics on this site. If you have any shortcomings, please leave a message. Thank you for your support!