This list contains Python Web crawl and data processing related libraries.
Network-related
-
General
-
Urllib-Network library (standard library)
-
Requests-Network Library
-
Grab-network library (based on Pycurl)
-
Pycurl-Network library (binding with Libcurl)
-
URLLIB3-a Python HTTP library with thread-safe connection pooling, file Psot support, and high availability
-
HTTPLIB2-Network Library
-
Robobrowser-A simple, Pythonic library that allows Web pages to be accessed without a standalone browser
-
Mechanicalsoup-A Python library that automates web site interaction
-
Mechanize-Stateful, programmable web browsing library.
-
Socket-Underlying network interface (standard library)
-
Unirest for Python-a set of lightweight HTTP libraries that support multiple languages
-
Hyper-python HTTP/2 Client
-
Pysocks-socksipy continuously updated and maintained version, pointing out bug fixes and some other features that can be used as a replacement for the socket module
-
Asynchronous
-
Treq-twisted-based, requests-like API
-
Aiohttp-asyncio HTTP client/server (PEP-3156)
Web crawler Framework
-
All-powerful crawler
-
Grab-web crawler framework (based on Pycurl/multicurl)
-
Scrapy-web crawler framework (based on twisted)
-
Pyspider-A powerful reptile system
-
Cola-a distributed crawler framework
-
Other
-
Portia-Visual crawler based on Scrapy
-
HTTP repository for Restkit-python. Allows the shadow Tiger to simply access the HTTP resource and use it to create the project
-
Demiurge-a miniature crawler frame based on Pyquery
Html/xml parsing
-
General
-
lxml-efficient html/xml processing library. Supports XPath, written in C language
-
Cssselect-Parse DOM tree and CSS Selector
-
Pyquery-Parse Dom tree and jquery Selector
-
Low-efficiency Html/xml processing library written by Beautifulsoup-python
-
Html5lib-The DOM of the Html/xml document is generated according to the WHATWG specification. The WHATWG specification is now the browser's pass specification
-
Feedparser-parsing Rss/atom information flow
-
Markupsafe-python's xml/html/xhtml Secure escape string tool
-
Xmltodict-let you work with XML just as you do with JSON
-
Xhtml2pdf-html/css to PDF Converter
-
Untangle-translate XML documents into Python projects to simplify processing difficulties
-
Hodor-Configuration-driven wrapper tool that supports lxml and Cssselect
-
Clean
-
Bleach-Clean HTML (demand html5lib)
-
Sanitize-To restore the chaotic world of data
Text Processing
Libraries for parsing and manipulating text
General
Difflib-Differentiated Computing tool (Python standard library)
Levenshtein-Fast calculation of editing distance and string similarity
Fuzzywuzzy-fuzzy string ratio matching
Esmre-Regular Expression accelerator.
Ftfy-Automatic collation of Unicode text to reduce fragmentation
Transformation
Convert Unidecode-unicode to ASCII text
Character encoding
Uniout-Outputs the transfer string as a readable form
Chardet-python 2/3 compatible character encoding detector
Xpinyin-a library for converting Chinese characters to pinyin
PANGU.PY-CJK and alphanumeric text spacing formatting
Slug of
Awesome-slugify-A Python slugify library that can preserve Unicode
Python-slugify-A Python slugify library that speaks Unicode to ASCII
Unicode-slugify-unicode Slugs Build Tool
Pytils-a gadget that handles Russian strings (contains pytils.translit.slugify)
Universal Parser
Ply-python Lex and YACC parsing tools
Pyparsing-a common framework for generating parsers
Names
Python-nameparser-Name resolution component
Phone number
Phonenumbers-processing, formatting, storing, validating global phone numbers
User Agent String
Python-user-agents-Browser User Agent Resolver
HTTP Agent Parser-python HTTP proxy parser
Fake-useragent-Python User agent spoofing based on global browser statistics
User_agent-User agent Data generator
Special format processing
A library that handles special-editing character formatting
General
Tablib-a library that handles tabular data such as XLS, CSV, JSON, Yaml, and more
Textract-Extract text from any document, support Word, PowerPoint, PDF, etc.
Messytables-Messy tabular data parsing
Rows-Universal and beautiful tabular Data processor (existing CSV, HTML, XLS, TXT-will support more) in multiple formats
Office
Python-docx-read, query, and modify Microsoft Word 2007/2008 docx files
XLWT/XLRD-reading and writing data and formatting information from Excel
Xlsxwriter-Python module for wearing an Excel. xlsx file
Xlwings-A BSD-licensed library that is easier for Excel and Python to call each other
OPENPYXL-Libraries that can read and edit Excel 2010XLSX/XLSM/XLTX/XLTM files
Marmir-a library that extracts Python data structures and transforms them into tables
Pdf
Pdfminer-Tools for extracting information from PDF documents
PyPDF2-a library that splits, merges, and transforms PDF files
Reportlab-can quickly create a large number of PDF documents
Pdftables-precise extraction of tables from PDF files
Markdown
Python-markdown-A Python-implemented John Gruber Markdown
Mistune-the fastest, full-featured markdown Pure Python parser
MARKDOWN2-A fast markdown that is fully implemented in Python
Yaml
Pyyaml-A Python parser for Yaml
Css
Cssutils-a Python CSS library
Atom/rss
Feedparser-Universal Feed Parser
Sql
Sqlparse-a non-validating SQL statement Parser
HTTP
HTTP request/Response message parser for HTTP-PARSER-C language implementation
Microformats
Opengraph-A Python module for parsing Open Graph Protocol tags
Portable Actuators
Pefile-a multi-platform module for parsing and processing portable actuators (that is, PE) files
Psd
Psd-tools-Read the Adobe Photoshop PSD (i.e. PE) file to the Python data structure
Natural language Processing
Natural Language Processing Library
- Nltk-python Natural Language Processing leader
- Pattern-python's network mining module. He has natural language processing tools, machine learning and other
- Textblob-provides APIs for in-depth projects on natural languages, with references to NLTK and other
- Jieba-Chinese participle
- SNOWNLP-Chinese Text Processing library
- Loso-Chinese word base
- Genius-Chinese word segmentation based on conditional random domain
- langid.py-Independent language recognition system
- Korean-Korean form library
- Pymorphy2-Russian Morphological Analyzer (POS tagging + inflection engine)
- PYPLN-Distributed natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process large language libraries over a network interface
- Langdetect-python's Google language detection library port
Browser automation and Simulation
-
Browser
-
Selenium-automated real-world browser (Chrome, Firefox, Opera, IE)
-
Ghost.py-qtwebkit Package (Demand PYQT)
-
Spynner-Programmatic web browsing module with AJAX support
-
Splinter-Generic API Browser emulator (Selenium Web driver, Django Client, Zope)
-
Headless Tools
-
Xvfbwrapper-Used to run the displayed Python wrapper in the X virtual frame buffer (XVFB)
Multi-process concurrency
- Threading-python Multi-threaded operation of the standard library. Because of the Python Gil limit, it works well for I/O intensive tasks and is useless for CPU bound tasks
- Multiprocessing-multi-process standard library
- Celery-Asynchronous task queue/job queue based on distributed message delivery
- Concurrent-futures-concurrent.futures module provides an advanced interface for asynchronous execution of callable
Asynchronous
Asynchronous Network Programming Library
- Asyncio-Asynchronous I/O, Time loops, co-programs and tasks (Python standard library of more than 3.4 versions of the python)
- Twisted-event-driven network engine framework
- Tornado-a web framework and an asynchronous network library
- Pulsar-python Event-driven concurrency framework
- Diesel-python's Greenlet-based I/O framework
- Gevent-a Python network library based on the collaboration program, using Greenlet
- Eventlet-Asynchronous framework with WSGI support
- Tomorrow-Magic of async code
Queue
- Celery-Asynchronous task queue/job queue based on distributed message delivery
- Huey-Small Multithreaded task queue
- Mrq-mr. Queue-Python Distributed work task queue using Redis & Gevent
- RQ-A lightweight Redis-based task Queue Manager
- Simpleq-A simple, infinitely scalable, Amazon SQS-based queue
- Python-gearman-gearman's Python API
Cloud computing
- Picloud-Execute python in the cloud
- Dominoup.com-Execute R, Python and matlab code in the cloud
Email
e-mail Processing library
- Flanker-Email and MIME processing library
- Talon-mailgun Library for extracting quotes and signatures for messages
URL and network address operations
URL and network address operations Library
-
Url
-
Furl-A small Python library that makes manipulating URLs simple
-
Purl-A simple, immutable URL and a clean API for debugging and manipulating
-
Urllib.parse-Used to break the partition of a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.) in order to combine components into a URL string and convert "relative URL" to an absolute URL called "Base URL" (Standard library)
-
Tldextract-Use the public suffix list to accurately separate the TLD from the registered domain and subdomain of the URL
-
Network address
-
NETADDR-Python library for displaying and manipulating network addresses
Page Content Extraction
Web Content Extraction Library
-
Text and metadata for HTML pages
-
Newspaper-using Python for news extraction, article extraction, and content curatorial
-
Html2text-Convert HTML to markdown formatted text
-
python-goose-html content/Article Extractor
-
Lassie-humanized Web content search Tool
-
Micawber-a small library that extracts rich content from URLs
-
Sumy-A module that automatically summarizes text files and HTML pages
-
Haul-an extensible image crawler
-
PYTHON-READABILITY-ARC90 fast Python interface for readability tools
-
Scrapely-a library that extracts structured data from an HTML Web page. Given some examples of web pages and data extraction, scrapely builds a parser for all similar web pages
-
Libextract-extracting data from the site
-
Video
-
YOUTUBE-DL-A small command-line tool for downloading videos from YouTube
-
Youtube/youku/niconico video Download tool written by You-get-python3
-
Wiki
-
Wikiteam-Download and save the Wkiks tool
WebSocket
Libraries for WebSocket
- Crossbar-Open source application Messaging Router (Python-implemented WebSocket and Wamp for Autobahn)
- Autobahnpython-Python implementation with WebSocket protocol and WAMP protocol and open source
- WebSocket client and server libraries for Websocket-for-python-python 2 and 3 and PyPy
DNS resolution
- DNSYO-Check your DNS on more than 1500 DNS servers worldwide
- The Pycares-ic-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution
Computer Vision
- OpenCV-Open Source Computer Vision Library
- SIMPLECV-Introduction to camera, image processing, feature extraction, format conversion, readable interface (based on OPENCV)
- Mahotas-Fast computer image processing algorithm (implemented entirely using C + +), fully based on the NumPy array as its data type
Proxy Server
- Shadowsocks-A fast tunnel proxy that can help you penetrate firewalls (TCP and UDP,TFO, multi-user and smooth restart, Destination IP blacklist)
- Tproxy-tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured with Python
Miscellaneous
- User_agent-This module is used to generate a random, efficient Web Navigator configuration and User agent HTTP header
Other
- Awesome-python
- Pycrumbs
- Python-github-projects
- Python_reference
- Pythonidae
156 Python web crawler Resources