This list contains Python Web crawl and data processing related libraries.

    • General

    • Urllib-Network library (standard library)

    • Requests-Network Library

    • Grab-network library (based on Pycurl)

    • Pycurl-Network library (binding with Libcurl)

    • URLLIB3-a Python HTTP library with thread-safe connection pooling, file Psot support, and high availability

    • HTTPLIB2-Network Library

    • Robobrowser-A simple, Pythonic library that allows Web pages to be accessed without a standalone browser

    • Mechanicalsoup-A Python library that automates web site interaction

    • Mechanize-Stateful, programmable web browsing library.

    • Socket-Underlying network interface (standard library)

    • Unirest for Python-a set of lightweight HTTP libraries that support multiple languages

    • Hyper-python HTTP/2 Client

    • Pysocks-socksipy continuously updated and maintained version, pointing out bug fixes and some other features that can be used as a replacement for the socket module

    • Asynchronous

    • Treq-twisted-based, requests-like API

    • Aiohttp-asyncio HTTP client/server (PEP-3156)

Web crawler Framework
    • All-powerful crawler

    • Grab-web crawler framework (based on Pycurl/multicurl)

    • Scrapy-web crawler framework (based on twisted)

    • Pyspider-A powerful reptile system

    • Cola-a distributed crawler framework

    • Other

    • Portia-Visual crawler based on Scrapy

    • HTTP repository for Restkit-python. Allows the shadow Tiger to simply access the HTTP resource and use it to create the project

    • Demiurge-a miniature crawler frame based on Pyquery

Html/xml parsing
    • General

    • lxml-efficient html/xml processing library. Supports XPath, written in C language

    • Cssselect-Parse DOM tree and CSS Selector

    • Pyquery-Parse Dom tree and jquery Selector

    • Low-efficiency Html/xml processing library written by Beautifulsoup-python

    • Html5lib-The DOM of the Html/xml document is generated according to the WHATWG specification. The WHATWG specification is now the browser's pass specification

    • Feedparser-parsing Rss/atom information flow

    • Markupsafe-python's xml/html/xhtml Secure escape string tool

    • Xmltodict-let you work with XML just as you do with JSON

    • Xhtml2pdf-html/css to PDF Converter

    • Untangle-translate XML documents into Python projects to simplify processing difficulties

    • Hodor-Configuration-driven wrapper tool that supports lxml and Cssselect

    • Clean

    • Bleach-Clean HTML (demand html5lib)

    • Sanitize-To restore the chaotic world of data

Text Processing

Libraries for parsing and manipulating text

Difflib-Differentiated Computing tool (Python standard library)
Levenshtein-Fast calculation of editing distance and string similarity
Fuzzywuzzy-fuzzy string ratio matching
Esmre-Regular Expression accelerator.
Ftfy-Automatic collation of Unicode text to reduce fragmentation
Convert Unidecode-unicode to ASCII text
Character encoding
Uniout-Outputs the transfer string as a readable form
Chardet-python 2/3 compatible character encoding detector
Xpinyin-a library for converting Chinese characters to pinyin
PANGU.PY-CJK and alphanumeric text spacing formatting
Slug of
Awesome-slugify-A Python slugify library that can preserve Unicode
Python-slugify-A Python slugify library that speaks Unicode to ASCII
Unicode-slugify-unicode Slugs Build Tool
Pytils-a gadget that handles Russian strings (contains pytils.translit.slugify)
Universal Parser
Ply-python Lex and YACC parsing tools
Pyparsing-a common framework for generating parsers


Python-nameparser-Name resolution component

Phone number

Phonenumbers-processing, formatting, storing, validating global phone numbers

User Agent String

Python-user-agents-Browser User Agent Resolver

HTTP Agent Parser-python HTTP proxy parser

Fake-useragent-Python User agent spoofing based on global browser statistics

User_agent-User agent Data generator

Special format processing

A library that handles special-editing character formatting


Tablib-a library that handles tabular data such as XLS, CSV, JSON, Yaml, and more

Textract-Extract text from any document, support Word, PowerPoint, PDF, etc.

Messytables-Messy tabular data parsing

Rows-Universal and beautiful tabular Data processor (existing CSV, HTML, XLS, TXT-will support more) in multiple formats


Python-docx-read, query, and modify Microsoft Word 2007/2008 docx files

XLWT/XLRD-reading and writing data and formatting information from Excel

Xlsxwriter-Python module for wearing an Excel. xlsx file

Xlwings-A BSD-licensed library that is easier for Excel and Python to call each other

OPENPYXL-Libraries that can read and edit Excel 2010XLSX/XLSM/XLTX/XLTM files

Marmir-a library that extracts Python data structures and transforms them into tables


Pdfminer-Tools for extracting information from PDF documents

PyPDF2-a library that splits, merges, and transforms PDF files

Reportlab-can quickly create a large number of PDF documents

Pdftables-precise extraction of tables from PDF files


Python-markdown-A Python-implemented John Gruber Markdown

Mistune-the fastest, full-featured markdown Pure Python parser

MARKDOWN2-A fast markdown that is fully implemented in Python


Pyyaml-A Python parser for Yaml


Cssutils-a Python CSS library


Feedparser-Universal Feed Parser


Sqlparse-a non-validating SQL statement Parser


HTTP request/Response message parser for HTTP-PARSER-C language implementation


Opengraph-A Python module for parsing Open Graph Protocol tags

Portable Actuators

Pefile-a multi-platform module for parsing and processing portable actuators (that is, PE) files


Psd-tools-Read the Adobe Photoshop PSD (i.e. PE) file to the Python data structure

Natural language Processing

Natural Language Processing Library

    • Nltk-python Natural Language Processing leader
    • Pattern-python's network mining module. He has natural language processing tools, machine learning and other
    • Textblob-provides APIs for in-depth projects on natural languages, with references to NLTK and other
    • Jieba-Chinese participle
    • SNOWNLP-Chinese Text Processing library
    • Loso-Chinese word base
    • Genius-Chinese word segmentation based on conditional random domain
    • langid.py-Independent language recognition system
    • Korean-Korean form library
    • Pymorphy2-Russian Morphological Analyzer (POS tagging + inflection engine)
    • PYPLN-Distributed natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process large language libraries over a network interface
    • Langdetect-python's Google language detection library port
Browser automation and Simulation
    • Browser

    • Selenium-automated real-world browser (Chrome, Firefox, Opera, IE)

    • Ghost.py-qtwebkit Package (Demand PYQT)

    • Spynner-Programmatic web browsing module with AJAX support

    • Splinter-Generic API Browser emulator (Selenium Web driver, Django Client, Zope)

    • Headless Tools

    • Xvfbwrapper-Used to run the displayed Python wrapper in the X virtual frame buffer (XVFB)

Multi-process concurrency
    • Threading-python Multi-threaded operation of the standard library. Because of the Python Gil limit, it works well for I/O intensive tasks and is useless for CPU bound tasks
    • Multiprocessing-multi-process standard library
    • Celery-Asynchronous task queue/job queue based on distributed message delivery
    • Concurrent-futures-concurrent.futures module provides an advanced interface for asynchronous execution of callable

Asynchronous Network Programming Library

    • Asyncio-Asynchronous I/O, Time loops, co-programs and tasks (Python standard library of more than 3.4 versions of the python)
    • Twisted-event-driven network engine framework
    • Tornado-a web framework and an asynchronous network library
    • Pulsar-python Event-driven concurrency framework
    • Diesel-python's Greenlet-based I/O framework
    • Gevent-a Python network library based on the collaboration program, using Greenlet
    • Eventlet-Asynchronous framework with WSGI support
    • Tomorrow-Magic of async code
    • Celery-Asynchronous task queue/job queue based on distributed message delivery
    • Huey-Small Multithreaded task queue
    • Mrq-mr. Queue-Python Distributed work task queue using Redis & Gevent
    • RQ-A lightweight Redis-based task Queue Manager
    • Simpleq-A simple, infinitely scalable, Amazon SQS-based queue
    • Python-gearman-gearman's Python API
Cloud computing
    • Picloud-Execute python in the cloud
    • Dominoup.com-Execute R, Python and matlab code in the cloud

e-mail Processing library

    • Flanker-Email and MIME processing library
    • Talon-mailgun Library for extracting quotes and signatures for messages
URL and network address operations

URL and network address operations Library

    • Url

    • Furl-A small Python library that makes manipulating URLs simple

    • Purl-A simple, immutable URL and a clean API for debugging and manipulating

    • Urllib.parse-Used to break the partition of a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.) in order to combine components into a URL string and convert "relative URL" to an absolute URL called "Base URL" (Standard library)

    • Tldextract-Use the public suffix list to accurately separate the TLD from the registered domain and subdomain of the URL

    • Network address

    • NETADDR-Python library for displaying and manipulating network addresses

Page Content Extraction

Web Content Extraction Library

    • Text and metadata for HTML pages

    • Newspaper-using Python for news extraction, article extraction, and content curatorial

    • Html2text-Convert HTML to markdown formatted text

    • python-goose-html content/Article Extractor

    • Lassie-humanized Web content search Tool

    • Micawber-a small library that extracts rich content from URLs

    • Sumy-A module that automatically summarizes text files and HTML pages

    • Haul-an extensible image crawler

    • PYTHON-READABILITY-ARC90 fast Python interface for readability tools

    • Scrapely-a library that extracts structured data from an HTML Web page. Given some examples of web pages and data extraction, scrapely builds a parser for all similar web pages

    • Libextract-extracting data from the site

    • Video

    • YOUTUBE-DL-A small command-line tool for downloading videos from YouTube

    • Youtube/youku/niconico video Download tool written by You-get-python3

    • Wiki

    • Wikiteam-Download and save the Wkiks tool


Libraries for WebSocket

    • Crossbar-Open source application Messaging Router (Python-implemented WebSocket and Wamp for Autobahn)
    • Autobahnpython-Python implementation with WebSocket protocol and WAMP protocol and open source
    • WebSocket client and server libraries for Websocket-for-python-python 2 and 3 and PyPy
DNS resolution
    • DNSYO-Check your DNS on more than 1500 DNS servers worldwide
    • The Pycares-ic-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution
Computer Vision
    • OpenCV-Open Source Computer Vision Library
    • SIMPLECV-Introduction to camera, image processing, feature extraction, format conversion, readable interface (based on OPENCV)
    • Mahotas-Fast computer image processing algorithm (implemented entirely using C + +), fully based on the NumPy array as its data type
Proxy Server
    • Shadowsocks-A fast tunnel proxy that can help you penetrate firewalls (TCP and UDP,TFO, multi-user and smooth restart, Destination IP blacklist)
    • Tproxy-tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured with Python
    • User_agent-This module is used to generate a random, efficient Web Navigator configuration and User agent HTTP header
    • Awesome-python
    • Pycrumbs
    • Python-github-projects
    • Python_reference
    • Pythonidae

