Windows下安裝libxml2並在Python中使用XPath

來源:互聯網
上載者:User
文章目錄
  • 準備
  • 安裝
  • 使用XPath抽取

為了使用XPath技術,對爬蟲抓取的網頁資料進行抽取(如標題、本文等等),花了一天的時間熟悉了一下Python語言,今天嘗試在Windows下安裝libxml2模組,將自己的一點學習實踐簡單記錄一下。

Python在安裝一個擴充的模組時,可以通過安裝協助工具輔助包(Setuptools)來安裝新的Python packages,並可以實現對已經安裝的packages的管理。在http://pypi.python.org/pypi/setuptools上你可以找到對於不同平台下的安裝包,這些工具主要包括Python Eggs和 Easy Install。在網上搜了很多,比較常用的應該是Easy Install,而且在網站http://peak.telecommunity.com/DevCenter/EasyInstall上給出了對EasyInstall的介紹:

Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.

Easy Install是一個Python模組,通過它可以方便地安裝擴充的Python模組。

下面我們就一步步地準備、安裝、配置。

準備

需要的軟體包,及其相應的,分別整理如下:

Python 2.6 (python官網貌似打不開,也忘記從哪裡下載的,到網上搜一下吧)
libxml2-python-2.7.7.win32-py2.7.exe  (http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe,http://xmlsoft.org/sources/win32/python/)
setuptools-0.6c11.win32-py2.6.exe (http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc,http://pypi.python.org/pypi/setuptools#downloads)

安裝


第1步:安裝Python

在Windows下面,只需要安裝包的exe可行性檔案即可安裝,不在累述。

第2步:安裝Easy Install工具

前提是Python要安裝好,然後安裝上面準備好的setuptools-0.6c11.win32-py2.6.exe即可,它會自動找到Python的安裝目錄,並將安裝工具包安裝到對應的目錄下面。例如我的指令碼目錄為E:\Program Files\Python26\Scripts,驗證一下:

E:\>cd E:\Program Files\Python26\ScriptsE:\Program Files\Python26\Scripts>easy_install --helpGlobal options:  --verbose (-v)  run verbosely (default)  --quiet (-q)    run quietly (turns verbosity off)  --dry-run (-n)  don't actually do anything  --help (-h)     show detailed help messageOptions for 'easy_install' command:  --prefix                       installation prefix  --zip-ok (-z)                  install package as a zipfile  --multi-version (-m)           make apps have to require() a version  --upgrade (-U)                 force upgrade (searches PyPI for latest                                 versions)  --install-dir (-d)             install package to DIR  --script-dir (-s)              install scripts to DIR  --exclude-scripts (-x)         Don't install scripts  --always-copy (-a)             Copy all needed packages to install dir  --index-url (-i)               base URL of Python Package Index  --find-links (-f)              additional URL(s) to search for packages  --delete-conflicting (-D)      no longer needed; don't use this  --ignore-conflicts-at-my-risk  no longer needed; don't use this  --build-directory (-b)         download/extract/build in DIR; keep the                                 results  --optimize (-O)                also compile with optimization: -O1 for                                 "python -O", -O2 for "python -OO", and -O0 to                                 disable [default: -O0]  --record                       filename in which to record list of installed                                 files  --always-unzip (-Z)            don't install as a zipfile, no matter what  --site-dirs (-S)               list of directories where .pth files work  --editable (-e)                Install specified packages in editable form  --no-deps (-N)                 don't install dependencies  --allow-hosts (-H)             pattern(s) that hostnames must match  --local-snapshots-ok (-l)      allow building eggs from local checkoutsusage: easy_install-script.py [options] requirement_or_url ...   or: easy_install-script.py --help

如果能夠看到上述easy_install的命令選項,就說明安裝成功了。

第3步:安裝libxml2

libxml2安裝,通過libxml2-python-2.7.7.win32-py2.7.exe安裝即可。安裝完這個以後,只是將相應的模組解壓到了對應的目錄,並不能在Python編程中使用,還需要通過Easy Install來安裝一個lxml庫,它是一個C編寫的庫,能夠加速對HTML或XML的解析處理,詳細介紹可以參考(http://lxml.de/index.html)。安裝lxml需要使用Easy Install的執行指令碼,例如我的指令碼目錄為E:\Program
Files\Python26\Scripts,執行安裝:

E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2

可以看到安裝資訊:

E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2Searching for lxml==2.2.2Reading http://pypi.python.org/simple/lxml/Reading http://codespeak.net/lxmlBest match: lxml 2.2.2Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9Processing lxml-2.2.2-py2.6-win32.eggcreating e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.eggExtracting lxml-2.2.2-py2.6-win32.egg to e:\program files\python26\lib\site-packagesAdding lxml 2.2.2 to easy-install.pth fileInstalled e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.eggProcessing dependencies for lxml==2.2.2Finished processing dependencies for lxml==2.2.2

使用XPath抽取

下面,我們使用XPath來實現網頁資料的抽取。這裡,我使用了一個Python的IDE工具——EasyEclipse for Python(Version: 1.3.1),可以直接建立Pydev Project,具體使用請查閱相關資料。

驗證可以使用XPath來定向抽取網頁資料,Python代碼如下:

import codecsimport sysfrom lxml import etreedef readFile(file, decoding):    html = ''    try:        html = open(file).read().decode(decoding)    except:        pass    return htmldef extract(file, decoding, xpath):    html = readFile(file, decoding)    tree = etree.HTML(html)    return tree.xpath(xpath)if __name__ == '__main__':    sections = extract('peak.txt', 'utf-8', "//h3//a[@class='toc-backref']")    for title in sections:        print title.text

首先,把網頁http://peak.telecommunity.com/DevCenter/EasyInstall的原始碼下載下來,儲存到檔案peak.txt中,編碼UTF-8;

然後,在Python中讀取該檔案內容,使用XPath抽取頁面上每個段落的標題內容,最後輸出到控制台上,結果如下所示:

TroubleshootingWindows NotesMultiple Python VersionsRestricting Downloads with Installing on Un-networked MachinesPackaging Others' Projects As EggsCreating your own Package IndexPassword-Protected SitesControlling Build OptionsEditing and Viewing Source PackagesDealing with Installation ConflictsCompressed InstallationAdministrator InstallationMac OS X "User" InstallationCreating a "Virtual" Python"Traditional" Backward Compatibility

如果你足夠熟悉XPath,藉助於libxml2,你可以抽取網頁中任何你想要的內容。

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.