List of tools for Python crawlers

Last Update:2017-07-11 Source: Internet

Author: User

Tags xml parser yaml parser nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0x00 Network

1) General purpose

Urllib-Network library (STDLIB).

Requests-Network library.

grab– Network library (based on Pycurl).

pycurl– Network Library (binding Libcurl).

Urllib3–python HTTP library, secure connection pool, support file post, high availability.

httplib2– Network Library.

robobrowser– a simple, Python-style Python library that allows you to browse the Web without a separate browser.

Mechanicalsoup-a Python library that automatically interacts with Web sites.

Mechanize-Stateful, programmable web browsing library.

socket– the underlying network interface (STDLIB).

Unirest for Python–unirest is a set of lightweight HTTP libraries that can be used in multiple languages.

Hyper–python the HTTP/2 client.

PYSOCKS–SOCKSIPY updates and actively maintains the version, including bug fixes and some other features. As a direct replacement of the socket module.

2) asynchronous

The treq– is similar to the requests API (based on twisted).

Aiohttp–asyncio HTTP client/server (PEP-3156).

0X01 Web crawler Framework

1) full-featured crawler

grab– Web crawler Framework (based on Pycurl/multicur).

scrapy– Web crawler framework (based on twisted), Python3 is not supported.

pyspider– a powerful reptile system.

cola– a distributed crawler framework.

2) Other

Portia– is based on scrapy visual crawler.

The HTTP Resource kit for Restkit–python. It allows you to easily access HTTP resources and create objects around it.

Demiurge– is based on the Pyquery crawler micro-frame.

0x02 Html/xml Parser

1) General purpose

Lxml–c language to write efficient html/xml processing library. XPath is supported.

cssselect– Parse dom tree and CSS selector.

pyquery– parse the DOM tree and jquery selector.

beautifulsoup– inefficient html/xml processing library, pure Python implementation.

html5lib– generates the DOM of the Html/xml document according to the WHATWG specification. This specification is used in all current browsers.

feedparser– parsing Rss/atom feeds.

Markupsafe– provides a secure escape string for xml/html/xhtml.

xmltodict– A Python module that allows you to feel like you are working with JSON when working with XML.

xhtml2pdf– convert Html/css to PDF.

The untangle– easily transforms an XML file into a Python object.

2) Clean up

bleach– Clean up HTML (requires html5lib).

Sanitize– brings clarity to the chaotic world of data.

0x03 Text Processing

A library for parsing and manipulating simple text.

1) General purpose

difflib– (Python standard library) helps with differentiated comparisons.

levenshtein– quickly calculates Levenshtein distance and string similarity.

fuzzywuzzy– fuzzy string Matching.

esmre– the regular expression accelerator.

ftfy– automatically organizes Unicode text to reduce fragmentation.

2) Conversion

unidecode– convert Unicode text to ASCII.

3) Character encoding

uniout– prints readable characters instead of escaped strings.

chardet– is compatible with Python's 2/3 character encoder.

xpinyin– a library to convert Chinese characters to pinyin.

pangu.py– the spacing between CJK and alphanumeric in formatted text.

4) Slug

awesome-slugify– a Python slugify library that can preserve Unicode.

python-slugify– a Python slugify library that can convert Unicode to ASCII.

unicode-slugify– a tool that can generate Unicode slugs.

pytils– simple tools (including pytils.translit.slugify) for handling Russian strings.

5) Universal Parser

Python implementations of the Ply–lex and YACC parsing tools.

pyparsing– a generic framework-generated parser.

6) Person's name

Python-nameparser-the component that parses the name of the person.

7) Phone number

Phonenumbers-Parse, format, store and validate international phone numbers.

8) User Agent string

python-user-agents– the parser for the browser user agent.

HTTP Agent Parser–python HTTP proxy parser.

0X04 specific format file processing

A library that parses and processes a specific text format.

1) General purpose

tablib– A module that exports data to XLS, CSV, JSON, Yaml, and more.

textract– extracts text from a variety of files, such as Word, PowerPoint, PDF, and more.

messytables– tools to parse confusing tabular data.

rows– a common data interface, supported by a lot of formats (currently support csv,html,xls,txt– will provide more!) ）。

2) Office

python-docx– reads, queries, and modifies the docx files of Microsoft word2007/2008.

xlwt/xlrd– reads write data and format information from an Excel file.

xlsxwriter– A Python module that creates a excel.xlsx file.

xlwings– a BSD-licensed library that makes it easy to call Python in Excel and vice versa.

openpyxl– a library for reading and writing Excel2010 XLSX/XLSM/XLTX/XLTM files.

marmir– extracts the Python data structure and converts it into a spreadsheet.

3) PDF

pdfminer– a tool that extracts information from a PDF document.

pypdf2– a library that splits, merges, and transforms PDF pages.

reportlab– allows you to quickly create rich PDF documents.

pdftables– extracts the table directly from the PDF file.

4) Markdown

python-markdown– a Python-implemented John Gruber Markdown.

Mistune– is the fastest, full-featured markdown Pure Python parser.

markdown2– a fast markdown that is fully implemented in Python.

5) YAML

pyyaml– a Python yaml parser.

6) CSS

cssutils– a Python CSS library.

7) Atom/rss

feedparser– a generic feed parser.

8) SQL

sqlparse– a non-validating SQL statement parser.

9) HTTP

The HTTP request/Response message parser implemented by the Http-parser–c language.

10) Micro-format

opengraph– a python module used to parse the Open Graph protocol tag.

11) Portable Actuator

pefile– a multi-platform module for parsing and processing portable actuators (that is, PE) files.

) PSD

psd-tools– reads the Adobe Photoshop PSD (that is, the PE) file to the Python data structure.

0X05 Natural Language Processing

A library for dealing with human language problems.

NLTK-the best platform for writing Python programs to handle human language data.

Pattern–python's network mining module. He has natural language processing tools, machine learning and others.

Textblob– provides a consistent API for in-depth natural language processing tasks. Developed on the shoulders of NLTK and pattern giants.

jieba– Chinese word breaker tool.

snownlp– Chinese Text Processing library.

loso– another Chinese word thesaurus.

genius– Chinese Word segmentation based on conditional random domain.

langid.py– independent language recognition system.

korean– a Korean morphological library.

pymorphy2– Russian Morphological Analyzer (POS tagging + inflection engine).

pypln– a distributed Natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process a large language library over a network interface.

0x06 Browser automation and simulation

selenium– Automation Real Browser (Chrome browser, Mozilla Firefox, Opera browser, ie browser).

ghost.py– PYQT WebKit package (requires PYQT).

spynner– PYQT WebKit package (requires PYQT).

splinter– Generic API Browser emulator (Selenium Web driver, Django Client, Zope).

0X07 Multi-treatment

Threading–python the standard library thread. Works well for I/O intensive tasks. The task for CPU binding is useless because of the Python GIL.

The multiprocessing– standard Python library runs multiple processes.

celery– Asynchronous task queue/job queue based on distributed message delivery.

The Concurrent-futures–concurrent-futures module provides a high-level interface for invoking asynchronous execution.

0x08 Asynchronous

Asynchronous Network Programming Library

asyncio– (Python standard library above Python 3.4 + version) asynchronous I/O, Time loops, co-programs and tasks.

twisted– an event-driven network engine framework.

tornado– a network framework and an asynchronous network library.

Pulsar–python Event-driven concurrency framework.

Diesel–python Green-Event-based I/O framework.

gevent– a Greenlet-based Python network library that uses the.

Eventlet– has an asynchronous framework supported by WSGI.

tomorrow– the wonderful modifier syntax for asynchronous code.

0x09 queue

celery– Asynchronous task queue/job queue based on distributed message delivery.

huey– Small multithreaded task queue.

MRQ–MR. queue– uses the Python distributed task queue for Redis & Gevent.

rq– a lightweight, Redis-based task Queue Manager.

Simpleq– is a simple, infinitely extensible, Amazon SQS-based queue.

Python-gearman–gearman's Python API.

0x0A Cloud Computing

Execute Python code picloud– the cloud.

Execute r,python and MATLAB code dominoup.com– the cloud.

0x0B Email

e-Mail Parsing library

flanker– e-mail address and MIME parsing library.

The Talon–mailgun library is used to extract quotes and signatures for messages.

0x0C URL and network address operations

Parse/modify URLs and network address libraries.

1) URL

furl– A small Python library, making manipulating URLs simple.

purl– a simple, immutable URL and a clean API for debugging and manipulation.

urllib.parse– is used to break the partition of a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.), in order to combine components into a URL string, and convert the "relative URL" to an absolute URL, called the "base url".

tldextract– accurately detaches the TLD from the registered domain and subdomain of the URL, using the public suffix list.

2) Network Address

netaddr– a python library for displaying and manipulating network addresses.

0x0d Page Content Extraction

A library that extracts the contents of a Web page.

1) text and meta-data for HTML pages

newspaper– uses Python for news extraction, article extraction, and content curatorial.

html2text– HTML to markdown formatted text.

python-goose–html content/Article extractor.

lassie– user-friendly web content retrieval Tool

micawber– a small library that extracts rich content from URLs.

Sumy-A module that automatically summarizes text files and HTML pages

haul– an extensible image crawler.

PYTHON-READABILITY–ARC90 fast Python interface for readability tools.

scrapely– extracts a library of structured data from an HTML Web page. Given some examples of web pages and data extraction, scrapely builds a parser for all similar web pages.

2) Video

youtube-dl– a small command-line program to download videos from YouTube.

You-get–python3 YouTube, Youku/NicoNico Video Downloader.

3) Wiki

wikiteam– Download and save the Wikis tool.

0x0E WebSocket

The library used for WebSocket.

crossbar– Open-Source application Messaging routers (Python-implemented WebSocket and Wamp for Autobahn).

autobahnpython– provides Python implementations of the WebSocket protocol and WAMP protocol and open source.

Websocket-for-python–python 2 and 3 as well as PyPy's WebSocket client and server libraries.

0x11 DNS Parsing

dnsyo– checks your DNS on more than 1500 DNS servers worldwide.

The Pycares–c-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution.

0x12 Computer Vision

opencv– Open source Computer Vision Library.

simplecv– is an introduction to camera, image processing, feature extraction, format conversion, and a readable interface (based on OPENCV).

The mahotas– fast computer image processing algorithm (implemented entirely using C + +) is completely based on the NumPy array as its data type.

OX13 Proxy Server

shadowsocks– a fast tunnel proxy that can help you penetrate firewalls (TCP and UDP,TFO, multi-user and smooth restart, Destination IP blacklist).

Tproxy–tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured with Python.

0x14 Other Python tools list

Awesome-python

Pycrumbs

Python-github-projects

Python_reference

Pythonidae

List of tools for Python crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More