When it comes to crawlers, a python crawler package is processed from the Internet. Python library that contains web page capture and data processing. if you need a Python library, refer to the following network
General
Urllib-Network Library (stdlib ).
Requests-Network Library.
Grab-Network Library (based on pycurl ).
Pycurl-Network Library (bound to libcurl ).
Urllib3-Python HTTP library, secure connection pool, support for file post, high availability.
Httplib2-Network Library.
RoboBrowser-a simple and Python-style Python library that allows you to browse web pages without an independent browser.
Machicalsoup-a Python library that automatically interacts with the website.
Machize-Stateful and programmable Web browser Library.
Socket-underlying network interface (stdlib ).
Unirest for Python-Unirest is a lightweight HTTP library that can be used in multiple languages.
The HTTP/2 client of hyper-Python.
PySocks-SocksiPy is updated and actively maintained, including bug fixes and other features. Directly replace the socket module.
Asynchronous
Web crawler framework
Full-featured crawlers
Grab-Web crawler framework (based on pycurl/multicur ).
Scrapy-Web crawler framework (based on twisted), does not support Python3.
Pyspider-a powerful crawler System.
Cola-a distributed crawler framework.
Others
Portia-Scrapy-based visual crawler.
Restkit-HTTP resource toolkit for Python. It allows you to easily access HTTP resources and create objects around it.
Demiurge-PyQuery-based crawler microframework.
HTML/XML parser
General
Lxml-efficient HTML/XML processing library written in C language. Supports XPath.
Cssselect-parse the DOM tree and CSS selector.
Pyquery-parse the DOM tree and jQuery selector.
BeautifulSoup-inefficient HTML/XML processing library, pure Python implementation.
Html5lib-generate the DOM of the HTML/XML document according to the WHATWG specification. This specification is used in all browsers.
Feedparser-parse RSS/ATOM feeds.
MarkupSafe-provides secure escape strings for XML, HTML, and XHTML.
Xmltodict-a Python module that makes you feel like processing JSON when processing XML.
Xhtml2pdf-convert HTML/CSS to PDF.
Untangle-it is easy to convert an XML file into a Python object.
Clear
Text processing
A library used to parse and operate simple text.
Difflib-(Python standard library) helps to compare differences.
Levenshtein-quickly calculates Levenshtein distance and string similarity.
Fuzzywuzzy-fuzzy string matching.
Esmre-regular expression accelerator.
Ftfy-automatically organizes Unicode text to reduce fragmentation.
Uniout-print readable characters instead of escaped strings.
Chardet-compatible with Python's 2/3 character encoder.
Xpinyin-a database for converting Chinese characters into pinyin.
Pangu. py-the spacing between CJK and alphanumeric characters in formatted text.
Awesome-slugify-a Python slugify library that can retain unicode.
Python-slugify-a Python slugify library that can convert Unicode to ASCII.
Unicode-slugify-a tool that can generate Unicode slugs.
Pytils-a simple tool for processing Russian strings (including pytils. transcoder. slugify ).
Processing specific format files
Parses and processes libraries of specific text formats.
Tablib-a module that exports data in the XLS, CSV, JSON, YAML, and other formats.
Textract-extract text from various files, such as Word, PowerPoint, and PDF.
Messytables-a tool for parsing messy table data.
Rows-a common data interface that supports many formats (CSV, HTML, XLS, and TXT are currently supported-more will be provided in the future !).
Python-docx-read, query, and modify the Microsoft Word2007/2008 docx file.
Xlwt/xlrd-read written data and format information from an Excel file.
XlsxWriter-a Python module that creates the excel.xlsx file.
Xlwings-a BSD-licensed library that can easily call Python in Excel, and vice versa.
Openpyxl-a library for reading and writing Excel2010 XLSX/XLSM/xltx/XLTM files.
Marmir-extract the Python data structure and convert it to a workbook.
PDFMiner-a tool for extracting information from PDF documents.
PyPDF2-a library that can split, merge, and convert PDF pages.
ReportLab-allows quick creation of rich PDF documents.
Pdftables-extract a table directly from a PDF file.
Python-Markdown-a Markdown of John Gruber implemented in Python.
Mistune-the fastest and fully functional Markdown pure Python parser.
Markdown2-a fast Markdown that is fully implemented using Python.
Natural language processing
Database that handles human language problems.
NLTK-the best platform for compiling Python programs to process human language data.
Pattern-Python network mining module. He has natural language processing tools, machine learning, and others.
TextBlob-provides consistent APIs for deep NLP tasks. It is developed on the shoulders of giants based on NLTK and Pattern.
Jieba-Chinese word segmentation tool.
SnowNLP-Chinese Text Processing library.
Loso-another Chinese dictionary.
Genius-Chinese word segmentation based on conditional random fields.
Langid. py-an independent language recognition system.
Korean-a Korean format library.
Pymorphy2-Russian Morphology Analyzer (word-of-speech tagging + word-form change engine ).
PyPLN-a distributed natural language processing Channel written in Python. The goal of this project is to create a simple method to use NLTK to manage large language libraries through network interfaces.
Browser Automation and simulation
Selenium-Automated Real browsers (Chrome, Firefox, Opera, and IE ).
Ghost. py-encapsulation of PyQt webkit (PyQT is required ).
Spynner-encapsulation of PyQt webkit (PyQT is required ).
Splinter-general API browser simulator (selenium web driver, Django client, Zope ).
Multiple processing
Threading-the thread running of the Python standard library. It is very effective for I/O-intensive tasks. It is useless for CPU-bound tasks because python GIL.
Multiprocessing-the standard Python library runs multiple processes.
Celery-asynchronous task queue/job queue based on distributed message transmission.
The concurrent-futures-concurrent-futures module provides a high-level interface for calling asynchronous execution.
Asynchronous
Asynchronous network programming library
Asyncio-(Python standard library later than Python 3.4) asynchronous I/O, time loop, collaborative programs and tasks.
Twisted-event-driven network engine framework.
Tornado-a network framework and an asynchronous network Library.
Pulsar-Python event-driven concurrency framework.
Diesel-Python green event-based I/O framework.
Gevent-a coroutine-based Python network library using greenlet.
Eventlet-asynchronous framework supported by WSGI.
Tomorrow-the wonderful modifier syntax of asynchronous code.
Queue
Celery-asynchronous task queue/job queue based on distributed message transmission.
Huey-small multi-threaded task queue.
Mrq-Mr. Queue-use the Python distributed job Queue of redis & Gevent.
RQ-Redis-based lightweight task queue manager.
Simpleq-a simple, infinitely scalable queue based on Amazon SQS.
Python-gearman-Gearman Python API.
Cloud computing
Picloud-run Python code on the cloud.
Dominoup.com-cloud executes R, Python, and matlab code.
Email
Email resolution Library
Website and network address operations
Parse/modify the URL and network address Library.
URL
Furl-a small Python library that simplifies URL manipulation.
Purl-a simple unchangeable URL and a clean API for debugging and operations.
Urllib. parse-used to break the gap between components (addressing scheme, network location, path, etc.) in a uniform resource locator (URL) string. to combine components into a URL string, convert "relative URL" into an absolute URL, which is called "basic URL ".
Tldextract-accurately detaches TLD from the URL registration domain and subdomain, and uses the Public Suffix List.
Webpage Content Extraction
Library for extracting Web content.
WebSocket
The library used for WebSocket.
Crossbar-open-source application message passing router (WebSocket and WAMP implemented by Python for Autobahn ).
AutobahnPython-provides the Python implementation of WebSocket protocol and WAMP protocol and is open-source.
WebSocket-for-Python 2 and 3 and PyPy WebSocket client and server Library.
DNS resolution
Computer vision
OpenCV-open-source computer vision library.
SimpleCV-brief introduction to cameras, image processing, feature extraction, format conversion, and highly readable interfaces (based on OpenCV ).
Mahotas-fast computer image processing algorithm (fully implemented using C ++), which uses a numpy-based array as its data type.
Proxy Server
Shadowsocks-a fast tunnel proxy that can help you penetrate the firewall (supports TCP and UDP, TFO, multi-user and smooth restart, and target IP blacklist ).
Tproxy-tproxy is a simple TCP routing proxy (layer 2). it is configured in Python based on Gevent.
List of other Python tools
The above is the details of the Python crawler tool list. For more information, see other related articles in the first PHP community!