Install libxml2 in windows and use XPath in Python

Source: Internet
Author: User
Tags readfile
Document directory
  • Preparation
  • Install
  • Use XPath for Extraction

To use the XPath technology to extract web page data captured by crawlers (such as the title and body), it took a day to get familiar with the Python language. Today, I tried to install the libxml2 module in windows, record your learning practices.

When installing an extension module in Python, you can install the new Python packages by installing the auxiliary Toolkit (setuptools) and manage installed packages. In eggs and easy install. I searched a lot on the Internet, and usually used easy install. I also introduced easyinstallon the website http://peak.telecommunity.com/devcenter/easyinstall:

Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.

Easy install is a python module that allows you to conveniently install the extended Python module.

Next we will prepare, install, and configure step by step.

Preparation

The required software packages and their corresponding packages are as follows:

Python 2.6 (the python official website does not seem to be able to open, and I forgot where to download it. Please search for it on the Internet)
Libxml2-python-2.7.7.win32-py2.7.exe (http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe,http://xmlsoft.org/sources/win32/python)
Setuptools-0.6c11.win32-py2.6.exe (http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc,http://pypi.python.org/pypi/setuptools#downloads)

Install


Step 2: Install Python

In Windows, you only need to install the installation package's exe feasibility file.

Step 2: Install the easy install Tool

The above mentioned is that pythonshould be installed. After installation, you can install the customized setuptools-0.6c11.win32-py2.6.exe. It will automatically find the python installation directory and install the installation toolkit under the corresponding directory. For example, if my script directory is E: \ Program Files \ python26 \ scripts, verify that:

E:\>cd E:\Program Files\Python26\ScriptsE:\Program Files\Python26\Scripts>easy_install --helpGlobal options:  --verbose (-v)  run verbosely (default)  --quiet (-q)    run quietly (turns verbosity off)  --dry-run (-n)  don't actually do anything  --help (-h)     show detailed help messageOptions for 'easy_install' command:  --prefix                       installation prefix  --zip-ok (-z)                  install package as a zipfile  --multi-version (-m)           make apps have to require() a version  --upgrade (-U)                 force upgrade (searches PyPI for latest                                 versions)  --install-dir (-d)             install package to DIR  --script-dir (-s)              install scripts to DIR  --exclude-scripts (-x)         Don't install scripts  --always-copy (-a)             Copy all needed packages to install dir  --index-url (-i)               base URL of Python Package Index  --find-links (-f)              additional URL(s) to search for packages  --delete-conflicting (-D)      no longer needed; don't use this  --ignore-conflicts-at-my-risk  no longer needed; don't use this  --build-directory (-b)         download/extract/build in DIR; keep the                                 results  --optimize (-O)                also compile with optimization: -O1 for                                 "python -O", -O2 for "python -OO", and -O0 to                                 disable [default: -O0]  --record                       filename in which to record list of installed                                 files  --always-unzip (-Z)            don't install as a zipfile, no matter what  --site-dirs (-S)               list of directories where .pth files work  --editable (-e)                Install specified packages in editable form  --no-deps (-N)                 don't install dependencies  --allow-hosts (-H)             pattern(s) that hostnames must match  --local-snapshots-ok (-l)      allow building eggs from local checkoutsusage: easy_install-script.py [options] requirement_or_url ...   or: easy_install-script.py --help

If you can see the easy_install Command Options above, the installation is successful.

Step 2: Install libxml2

Install libxml2through libxml2-python-2.7.7.win32-py2.7.exe. After the installation, the corresponding module is extracted to the corresponding directory and cannot be used in Python programming. You also need to install an lxml library through easy install, it is a C-compiled library that can accelerate the parsing of HTML or XML, detailed introduction can refer to (http://lxml.de/index.html ). To install lxml, run the easy install script. For example, the script directory is E: \ Program.
Files \ python26 \ scripts:

E: \ Program Files \ python26 \ scripts> easy_install lxml = 2.2.2

You can see the installation information:

E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2Searching for lxml==2.2.2Reading http://pypi.python.org/simple/lxml/Reading http://codespeak.net/lxmlBest match: lxml 2.2.2Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9Processing lxml-2.2.2-py2.6-win32.eggcreating e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.eggExtracting lxml-2.2.2-py2.6-win32.egg to e:\program files\python26\lib\site-packagesAdding lxml 2.2.2 to easy-install.pth fileInstalled e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.eggProcessing dependencies for lxml==2.2.2Finished processing dependencies for lxml==2.2.2

Use XPath for Extraction

Next, we use XPath to extract webpage data. Here, I use a python ide tool, easyeclipse for python (Version: 1.3.1). You can directly create a pydev project. For more information, see.

Verify that you can use XPath to extract webpage data. The Python code is as follows:

import codecsimport sysfrom lxml import etreedef readFile(file, decoding):    html = ''    try:        html = open(file).read().decode(decoding)    except:        pass    return htmldef extract(file, decoding, xpath):    html = readFile(file, decoding)    tree = etree.HTML(html)    return tree.xpath(xpath)if __name__ == '__main__':    sections = extract('peak.txt', 'utf-8', "//h3//a[@class='toc-backref']")    for title in sections:        print title.text

First;

Then, read the file content in Python, use XPath to extract the title content of each section on the page, and then output it to the console. The result is as follows:

TroubleshootingWindows NotesMultiple Python VersionsRestricting Downloads with Installing on Un-networked MachinesPackaging Others' Projects As EggsCreating your own Package IndexPassword-Protected SitesControlling Build OptionsEditing and Viewing Source PackagesDealing with Installation ConflictsCompressed InstallationAdministrator InstallationMac OS X "User" InstallationCreating a "Virtual" Python"Traditional" Backward Compatibility

If you are familiar with XPath and use libxml2, you can extract whatever content you want from the webpage.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.