156 Python web crawler Resources

Source: Internet
Author: User
Tags connection pooling python web crawler nltk



This list contains Python Web crawl and data processing related libraries.


Network-related
    • General

    • Urllib-Network library (standard library)

    • Requests-Network Library

    • Grab-network library (based on Pycurl)

    • Pycurl-Network library (binding with Libcurl)

    • URLLIB3-a Python HTTP library with thread-safe connection pooling, file Psot support, and high availability

    • HTTPLIB2-Network Library

    • Robobrowser-A simple, Pythonic library that allows Web pages to be accessed without a standalone browser

    • Mechanicalsoup-A Python library that automates web site interaction

    • Mechanize-Stateful, programmable web browsing library.

    • Socket-Underlying network interface (standard library)

    • Unirest for Python-a set of lightweight HTTP libraries that support multiple languages

    • Hyper-python HTTP/2 Client

    • Pysocks-socksipy continuously updated and maintained version, pointing out bug fixes and some other features that can be used as a replacement for the socket module

    • Asynchronous

    • Treq-twisted-based, requests-like API

    • Aiohttp-asyncio HTTP client/server (PEP-3156)

Web crawler Framework
    • All-powerful crawler

    • Grab-web crawler framework (based on Pycurl/multicurl)

    • Scrapy-web crawler framework (based on twisted)

    • Pyspider-A powerful reptile system

    • Cola-a distributed crawler framework

    • Other

    • Portia-Visual crawler based on Scrapy

    • HTTP repository for Restkit-python. Allows the shadow Tiger to simply access the HTTP resource and use it to create the project

    • Demiurge-a miniature crawler frame based on Pyquery

Html/xml parsing
    • General

    • lxml-efficient html/xml processing library. Supports XPath, written in C language

    • Cssselect-Parse DOM tree and CSS Selector

    • Pyquery-Parse Dom tree and jquery Selector

    • Low-efficiency Html/xml processing library written by Beautifulsoup-python

    • Html5lib-The DOM of the Html/xml document is generated according to the WHATWG specification. The WHATWG specification is now the browser's pass specification

    • Feedparser-parsing Rss/atom information flow

    • Markupsafe-python's xml/html/xhtml Secure escape string tool

    • Xmltodict-let you work with XML just as you do with JSON

    • Xhtml2pdf-html/css to PDF Converter

    • Untangle-translate XML documents into Python projects to simplify processing difficulties

    • Hodor-Configuration-driven wrapper tool that supports lxml and Cssselect

    • Clean

    • Bleach-Clean HTML (demand html5lib)

    • Sanitize-To restore the chaotic world of data

Text Processing


Libraries for parsing and manipulating text

General
Difflib-Differentiated Computing tool (Python standard library)
Levenshtein-Fast calculation of editing distance and string similarity
Fuzzywuzzy-fuzzy string ratio matching
Esmre-Regular Expression accelerator.
Ftfy-Automatic collation of Unicode text to reduce fragmentation
Transformation
Convert Unidecode-unicode to ASCII text
Character encoding
Uniout-Outputs the transfer string as a readable form
Chardet-python 2/3 compatible character encoding detector
Xpinyin-a library for converting Chinese characters to pinyin
PANGU.PY-CJK and alphanumeric text spacing formatting
Slug of
Awesome-slugify-A Python slugify library that can preserve Unicode
Python-slugify-A Python slugify library that speaks Unicode to ASCII
Unicode-slugify-unicode Slugs Build Tool
Pytils-a gadget that handles Russian strings (contains pytils.translit.slugify)
Universal Parser
Ply-python Lex and YACC parsing tools
Pyparsing-a common framework for generating parsers

Names

Python-nameparser-Name resolution component

Phone number

Phonenumbers-processing, formatting, storing, validating global phone numbers

User Agent String

Python-user-agents-Browser User Agent Resolver

HTTP Agent Parser-python HTTP proxy parser

Fake-useragent-Python User agent spoofing based on global browser statistics

User_agent-User agent Data generator

Special format processing


A library that handles special-editing character formatting


General

Tablib-a library that handles tabular data such as XLS, CSV, JSON, Yaml, and more

Textract-Extract text from any document, support Word, PowerPoint, PDF, etc.

Messytables-Messy tabular data parsing

Rows-Universal and beautiful tabular Data processor (existing CSV, HTML, XLS, TXT-will support more) in multiple formats

Office

Python-docx-read, query, and modify Microsoft Word 2007/2008 docx files

XLWT/XLRD-reading and writing data and formatting information from Excel

Xlsxwriter-Python module for wearing an Excel. xlsx file

Xlwings-A BSD-licensed library that is easier for Excel and Python to call each other

OPENPYXL-Libraries that can read and edit Excel 2010XLSX/XLSM/XLTX/XLTM files

Marmir-a library that extracts Python data structures and transforms them into tables

Pdf

Pdfminer-Tools for extracting information from PDF documents

PyPDF2-a library that splits, merges, and transforms PDF files

Reportlab-can quickly create a large number of PDF documents

Pdftables-precise extraction of tables from PDF files

Markdown

Python-markdown-A Python-implemented John Gruber Markdown

Mistune-the fastest, full-featured markdown Pure Python parser

MARKDOWN2-A fast markdown that is fully implemented in Python

Yaml

Pyyaml-A Python parser for Yaml

Css

Cssutils-a Python CSS library

Atom/rss

Feedparser-Universal Feed Parser

Sql

Sqlparse-a non-validating SQL statement Parser

HTTP

HTTP request/Response message parser for HTTP-PARSER-C language implementation

Microformats

Opengraph-A Python module for parsing Open Graph Protocol tags

Portable Actuators

Pefile-a multi-platform module for parsing and processing portable actuators (that is, PE) files

Psd

Psd-tools-Read the Adobe Photoshop PSD (i.e. PE) file to the Python data structure

Natural language Processing


Natural Language Processing Library


    • Nltk-python Natural Language Processing leader
    • Pattern-python's network mining module. He has natural language processing tools, machine learning and other
    • Textblob-provides APIs for in-depth projects on natural languages, with references to NLTK and other
    • Jieba-Chinese participle
    • SNOWNLP-Chinese Text Processing library
    • Loso-Chinese word base
    • Genius-Chinese word segmentation based on conditional random domain
    • langid.py-Independent language recognition system
    • Korean-Korean form library
    • Pymorphy2-Russian Morphological Analyzer (POS tagging + inflection engine)
    • PYPLN-Distributed natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process large language libraries over a network interface
    • Langdetect-python's Google language detection library port
Browser automation and Simulation
    • Browser

    • Selenium-automated real-world browser (Chrome, Firefox, Opera, IE)

    • Ghost.py-qtwebkit Package (Demand PYQT)

    • Spynner-Programmatic web browsing module with AJAX support

    • Splinter-Generic API Browser emulator (Selenium Web driver, Django Client, Zope)

    • Headless Tools

    • Xvfbwrapper-Used to run the displayed Python wrapper in the X virtual frame buffer (XVFB)

Multi-process concurrency
    • Threading-python Multi-threaded operation of the standard library. Because of the Python Gil limit, it works well for I/O intensive tasks and is useless for CPU bound tasks
    • Multiprocessing-multi-process standard library
    • Celery-Asynchronous task queue/job queue based on distributed message delivery
    • Concurrent-futures-concurrent.futures module provides an advanced interface for asynchronous execution of callable
Asynchronous


Asynchronous Network Programming Library


    • Asyncio-Asynchronous I/O, Time loops, co-programs and tasks (Python standard library of more than 3.4 versions of the python)
    • Twisted-event-driven network engine framework
    • Tornado-a web framework and an asynchronous network library
    • Pulsar-python Event-driven concurrency framework
    • Diesel-python's Greenlet-based I/O framework
    • Gevent-a Python network library based on the collaboration program, using Greenlet
    • Eventlet-Asynchronous framework with WSGI support
    • Tomorrow-Magic of async code
Queue
    • Celery-Asynchronous task queue/job queue based on distributed message delivery
    • Huey-Small Multithreaded task queue
    • Mrq-mr. Queue-Python Distributed work task queue using Redis & Gevent
    • RQ-A lightweight Redis-based task Queue Manager
    • Simpleq-A simple, infinitely scalable, Amazon SQS-based queue
    • Python-gearman-gearman's Python API
Cloud computing
    • Picloud-Execute python in the cloud
    • Dominoup.com-Execute R, Python and matlab code in the cloud
Email


e-mail Processing library


    • Flanker-Email and MIME processing library
    • Talon-mailgun Library for extracting quotes and signatures for messages
URL and network address operations


URL and network address operations Library


    • Url

    • Furl-A small Python library that makes manipulating URLs simple

    • Purl-A simple, immutable URL and a clean API for debugging and manipulating

    • Urllib.parse-Used to break the partition of a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.) in order to combine components into a URL string and convert "relative URL" to an absolute URL called "Base URL" (Standard library)

    • Tldextract-Use the public suffix list to accurately separate the TLD from the registered domain and subdomain of the URL

    • Network address

    • NETADDR-Python library for displaying and manipulating network addresses

Page Content Extraction


Web Content Extraction Library


    • Text and metadata for HTML pages

    • Newspaper-using Python for news extraction, article extraction, and content curatorial

    • Html2text-Convert HTML to markdown formatted text

    • python-goose-html content/Article Extractor

    • Lassie-humanized Web content search Tool

    • Micawber-a small library that extracts rich content from URLs

    • Sumy-A module that automatically summarizes text files and HTML pages

    • Haul-an extensible image crawler

    • PYTHON-READABILITY-ARC90 fast Python interface for readability tools

    • Scrapely-a library that extracts structured data from an HTML Web page. Given some examples of web pages and data extraction, scrapely builds a parser for all similar web pages

    • Libextract-extracting data from the site

    • Video

    • YOUTUBE-DL-A small command-line tool for downloading videos from YouTube

    • Youtube/youku/niconico video Download tool written by You-get-python3

    • Wiki

    • Wikiteam-Download and save the Wkiks tool

WebSocket


Libraries for WebSocket


    • Crossbar-Open source application Messaging Router (Python-implemented WebSocket and Wamp for Autobahn)
    • Autobahnpython-Python implementation with WebSocket protocol and WAMP protocol and open source
    • WebSocket client and server libraries for Websocket-for-python-python 2 and 3 and PyPy
DNS resolution
    • DNSYO-Check your DNS on more than 1500 DNS servers worldwide
    • The Pycares-ic-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution
Computer Vision
    • OpenCV-Open Source Computer Vision Library
    • SIMPLECV-Introduction to camera, image processing, feature extraction, format conversion, readable interface (based on OPENCV)
    • Mahotas-Fast computer image processing algorithm (implemented entirely using C + +), fully based on the NumPy array as its data type
Proxy Server
    • Shadowsocks-A fast tunnel proxy that can help you penetrate firewalls (TCP and UDP,TFO, multi-user and smooth restart, Destination IP blacklist)
    • Tproxy-tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured with Python
Miscellaneous
    • User_agent-This module is used to generate a random, efficient Web Navigator configuration and User agent HTTP header
Other
    • Awesome-python
    • Pycrumbs
    • Python-github-projects
    • Python_reference
    • Pythonidae


156 Python web crawler Resources


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.