Document directory
- Preparation
- Install
- Use XPath for Extraction
To use the XPath technology to extract web page data captured by crawlers (such as the title and body), it took a day to get familiar with the Python language. Today, I tried to install the libxml2 module in windows, record your learning practices.
When installing an extension module in Python, you can install the new Python packages by installing the auxiliary Toolkit (setuptools) and manage installed packages. In eggs and easy install. I searched a lot on the Internet, and usually used easy install. I also introduced easyinstallon the website http://peak.telecommunity.com/devcenter/easyinstall:
Easy Install is a python module (easy_install) bundled with setuptools that lets you automatically download, build, install, and manage Python packages.
Easy install is a python module that allows you to conveniently install the extended Python module.
Next we will prepare, install, and configure step by step.
Preparation
The required software packages and their corresponding packages are as follows:
Python 2.6 (the python official website does not seem to be able to open, and I forgot where to download it. Please search for it on the Internet)
Libxml2-python-2.7.7.win32-py2.7.exe (http://xmlsoft.org/sources/win32/python/libxml2-python-2.7.7.win32-py2.7.exe,http://xmlsoft.org/sources/win32/python)
Setuptools-0.6c11.win32-py2.6.exe (http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11.win32-py2.6.exe#md5=1509752c3c2e64b5d0f9589aafe053dc,http://pypi.python.org/pypi/setuptools#downloads)
Install
Step 2: Install Python
In Windows, you only need to install the installation package's exe feasibility file.
Step 2: Install the easy install Tool
The above mentioned is that pythonshould be installed. After installation, you can install the customized setuptools-0.6c11.win32-py2.6.exe. It will automatically find the python installation directory and install the installation toolkit under the corresponding directory. For example, if my script directory is E: \ Program Files \ python26 \ scripts, verify that:
E:\>cd E:\Program Files\Python26\ScriptsE:\Program Files\Python26\Scripts>easy_install --helpGlobal options: --verbose (-v) run verbosely (default) --quiet (-q) run quietly (turns verbosity off) --dry-run (-n) don't actually do anything --help (-h) show detailed help messageOptions for 'easy_install' command: --prefix installation prefix --zip-ok (-z) install package as a zipfile --multi-version (-m) make apps have to require() a version --upgrade (-U) force upgrade (searches PyPI for latest versions) --install-dir (-d) install package to DIR --script-dir (-s) install scripts to DIR --exclude-scripts (-x) Don't install scripts --always-copy (-a) Copy all needed packages to install dir --index-url (-i) base URL of Python Package Index --find-links (-f) additional URL(s) to search for packages --delete-conflicting (-D) no longer needed; don't use this --ignore-conflicts-at-my-risk no longer needed; don't use this --build-directory (-b) download/extract/build in DIR; keep the results --optimize (-O) also compile with optimization: -O1 for "python -O", -O2 for "python -OO", and -O0 to disable [default: -O0] --record filename in which to record list of installed files --always-unzip (-Z) don't install as a zipfile, no matter what --site-dirs (-S) list of directories where .pth files work --editable (-e) Install specified packages in editable form --no-deps (-N) don't install dependencies --allow-hosts (-H) pattern(s) that hostnames must match --local-snapshots-ok (-l) allow building eggs from local checkoutsusage: easy_install-script.py [options] requirement_or_url ... or: easy_install-script.py --help
If you can see the easy_install Command Options above, the installation is successful.
Step 2: Install libxml2
Install libxml2through libxml2-python-2.7.7.win32-py2.7.exe. After the installation, the corresponding module is extracted to the corresponding directory and cannot be used in Python programming. You also need to install an lxml library through easy install, it is a C-compiled library that can accelerate the parsing of HTML or XML, detailed introduction can refer to (http://lxml.de/index.html ). To install lxml, run the easy install script. For example, the script directory is E: \ Program.
Files \ python26 \ scripts:
E: \ Program Files \ python26 \ scripts> easy_install lxml = 2.2.2
You can see the installation information:
E:\Program Files\Python26\Scripts>easy_install lxml==2.2.2Searching for lxml==2.2.2Reading http://pypi.python.org/simple/lxml/Reading http://codespeak.net/lxmlBest match: lxml 2.2.2Downloading http://pypi.python.org/packages/2.6/l/lxml/lxml-2.2.2-py2.6-win32.egg#md5=dc73ae17e486037580371077efdc13e9Processing lxml-2.2.2-py2.6-win32.eggcreating e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.eggExtracting lxml-2.2.2-py2.6-win32.egg to e:\program files\python26\lib\site-packagesAdding lxml 2.2.2 to easy-install.pth fileInstalled e:\program files\python26\lib\site-packages\lxml-2.2.2-py2.6-win32.eggProcessing dependencies for lxml==2.2.2Finished processing dependencies for lxml==2.2.2
Use XPath for Extraction
Next, we use XPath to extract webpage data. Here, I use a python ide tool, easyeclipse for python (Version: 1.3.1). You can directly create a pydev project. For more information, see.
Verify that you can use XPath to extract webpage data. The Python code is as follows:
import codecsimport sysfrom lxml import etreedef readFile(file, decoding): html = '' try: html = open(file).read().decode(decoding) except: pass return htmldef extract(file, decoding, xpath): html = readFile(file, decoding) tree = etree.HTML(html) return tree.xpath(xpath)if __name__ == '__main__': sections = extract('peak.txt', 'utf-8', "//h3//a[@class='toc-backref']") for title in sections: print title.text
First;
Then, read the file content in Python, use XPath to extract the title content of each section on the page, and then output it to the console. The result is as follows:
TroubleshootingWindows NotesMultiple Python VersionsRestricting Downloads with Installing on Un-networked MachinesPackaging Others' Projects As EggsCreating your own Package IndexPassword-Protected SitesControlling Build OptionsEditing and Viewing Source PackagesDealing with Installation ConflictsCompressed InstallationAdministrator InstallationMac OS X "User" InstallationCreating a "Virtual" Python"Traditional" Backward Compatibility
If you are familiar with XPath and use libxml2, you can extract whatever content you want from the webpage.