This list contains Python libraries for crawling and data processing with Web pages
Internet
- General
- Urllib-Network library (STDLIB).
- Requests-Network library.
- grab– Network library (based on Pycurl).
- pycurl– Network Library (binding Libcurl).
- Urllib3–python HTTP library, secure connection pool, support file post, high availability.
- httplib2– Network Library.
- robobrowser– a simple, Python-style Python library that allows you to browse the Web without a separate browser.
- Mechanicalsoup-a Python library that automatically interacts with Web sites.
- Mechanize-Stateful, programmable web browsing library.
- socket– the underlying network interface (STDLIB).
- Unirest for Python–unirest is a set of lightweight HTTP libraries that can be used in multiple languages.
- Hyper–python the HTTP/2 client.
- PYSOCKS–SOCKSIPY updates and actively maintains the version, including bug fixes and some other features. As a direct replacement of the socket module.
- Asynchronous
- The treq– is similar to the requests API (based on twisted).
- Aiohttp–asyncio HTTP client/server (PEP-3156).
Web crawler Framework
- Full-Featured Crawler
- grab– Web crawler Framework (based on Pycurl/multicur).
- scrapy– Web crawler framework (based on twisted), Python3 is not supported.
- pyspider– a powerful reptile system.
- cola– a distributed crawler framework.
- Other
- Portia– is based on scrapy visual crawler.
- The HTTP Resource kit for Restkit–python. It allows you to easily access HTTP resources and create objects around it.
- Demiurge– is based on the Pyquery crawler micro-frame.
Html/xml Parser
- General
- Lxml–c language to write efficient html/xml processing library. XPath is supported.
- cssselect– Parse dom tree and CSS selector.
- pyquery– parse the DOM tree and jquery selector.
- beautifulsoup– inefficient html/xml processing library, pure Python implementation.
- html5lib– generates the DOM of the Html/xml document according to the WHATWG specification. This specification is used in all current browsers.
- feedparser– parsing Rss/atom feeds.
- Markupsafe– provides a secure escape string for xml/html/xhtml.
- xmltodict– A Python module that allows you to feel like you are working with JSON when working with XML.
- xhtml2pdf– convert Html/css to PDF.
- The untangle– easily transforms an XML file into a Python object.
- Clean
- bleach– Clean up HTML (requires html5lib).
- Sanitize– brings clarity to the chaotic world of data.
Text Processing
A library for parsing and manipulating simple text.
- difflib– (Python standard library) helps with differentiated comparisons.
- levenshtein– quickly calculates Levenshtein distance and string similarity.
- fuzzywuzzy– fuzzy string Matching.
- esmre– the regular expression accelerator.
- ftfy– automatically organizes Unicode text to reduce fragmentation.
- unidecode– convert Unicode text to ASCII.
- uniout– prints readable characters instead of escaped strings.
- chardet– is compatible with Python's 2/3 character encoder.
- xpinyin– a library to convert Chinese characters to pinyin.
- pangu.py– the spacing between CJK and alphanumeric in formatted text.
- awesome-slugify– a Python slugify library that can preserve Unicode.
- python-slugify– a Python slugify library that can convert Unicode to ASCII.
- unicode-slugify– a tool that can generate Unicode slugs.
- pytils– simple tools (including pytils.translit.slugify) for handling Russian strings.
- Python implementations of the Ply–lex and YACC parsing tools.
- pyparsing– a generic framework-generated parser.
- Python-nameparser-the component that parses the name of the person.
- Phonenumbers-Parse, format, store and validate international phone numbers.
- python-user-agents– the parser for the browser user agent.
- HTTP Agent Parser–python HTTP proxy parser.
Specific format file processing
A library that parses and processes a specific text format.
- tablib– A module that exports data to XLS, CSV, JSON, Yaml, and more.
- textract– extracts text from a variety of files, such as Word, PowerPoint, PDF, and more.
- messytables– tools to parse confusing tabular data.
- rows– a common data interface, supported by a lot of formats (currently support csv,html,xls,txt– will provide more!) )。
- python-docx– reads, queries, and modifies the docx files of Microsoft word2007/2008.
- xlwt/xlrd– reads write data and format information from an Excel file.
- xlsxwriter– A Python module that creates a excel.xlsx file.
- xlwings– a BSD-licensed library that makes it easy to call Python in Excel and vice versa.
- openpyxl– a library for reading and writing Excel2010 XLSX/XLSM/XLTX/XLTM files.
- marmir– extracts the Python data structure and converts it into a spreadsheet.
- pdfminer– a tool that extracts information from a PDF document.
- pypdf2– a library that splits, merges, and transforms PDF pages.
- reportlab– allows you to quickly create rich PDF documents.
- pdftables– extracts the table directly from the PDF file.
- python-markdown– a Python-implemented John Gruber Markdown.
- Mistune– is the fastest, full-featured markdown Pure Python parser.
- markdown2– a fast markdown that is fully implemented in Python.
- pyyaml– a Python yaml parser.
- cssutils– a Python CSS library.
- feedparser– a generic feed parser.
- sqlparse– a non-validating SQL statement parser.
- The HTTP request/Response message parser implemented by the Http-parser–c language.
- opengraph– a python module used to parse the Open Graph protocol tag.
- pefile– a multi-platform module for parsing and processing portable actuators (that is, PE) files.
- psd-tools– reads the Adobe Photoshop PSD (that is, the PE) file to the Python data structure.
Natural language Processing
A library for dealing with human language problems.
- NLTK-the best platform for writing Python programs to handle human language data.
- Pattern–python's network mining module. He has natural language processing tools, machine learning and others.
- Textblob– provides a consistent API for in-depth natural language processing tasks. Developed on the shoulders of NLTK and pattern giants.
- jieba– Chinese word breaker tool.
- snownlp– Chinese Text Processing library.
- loso– another Chinese word thesaurus.
- genius– Chinese Word segmentation based on conditional random domain.
- langid.py– independent language recognition system.
- korean– a Korean morphological library.
- pymorphy2– Russian Morphological Analyzer (POS tagging + inflection engine).
- pypln– a distributed Natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process a large language library over a network interface.
Browser automation and Simulation
- selenium– Automation Real Browser (Chrome browser, Mozilla Firefox, Opera browser, ie browser).
- ghost.py– PYQT WebKit package (requires PYQT).
- spynner– PYQT WebKit package (requires PYQT).
- splinter– Generic API Browser emulator (Selenium Web driver, Django Client, Zope).
Multi-processing
- Threading–python the standard library thread. Works well for I/O intensive tasks. The task for CPU binding is useless because of the Python GIL.
- The multiprocessing– standard Python library runs multiple processes.
- celery– Asynchronous task queue/job queue based on distributed message delivery.
- The Concurrent-futures–concurrent-futures module provides a high-level interface for invoking asynchronous execution.
Asynchronous
Asynchronous Network Programming Library
- asyncio– (Python standard library above Python 3.4 + version) asynchronous I/O, Time loops, co-programs and tasks.
- twisted– an event-driven network engine framework.
- tornado– a network framework and an asynchronous network library.
- Pulsar–python Event-driven concurrency framework.
- Diesel–python Green-Event-based I/O framework.
- gevent– a Greenlet-based Python network library that uses the.
- Eventlet– has an asynchronous framework supported by WSGI.
- tomorrow– the wonderful modifier syntax for asynchronous code.
Queue
- celery– Asynchronous task queue/job queue based on distributed message delivery.
- huey– Small multithreaded task queue.
- MRQ–MR. queue– uses the Python distributed task queue for Redis & Gevent.
- rq– a lightweight, Redis-based task Queue Manager.
- Simpleq– is a simple, infinitely extensible, Amazon SQS-based queue.
- Python-gearman–gearman's Python API.
Cloud computing
- Execute Python code picloud– the cloud.
- Execute r,python and MATLAB code dominoup.com– the cloud.
Email
e-Mail Parsing library
- flanker– e-mail address and MIME parsing library.
- The Talon–mailgun library is used to extract quotes and signatures for messages.
URL and network address operations
Parse/modify URLs and network address libraries.
- Url
- furl– A small Python library, making manipulating URLs simple.
- purl– a simple, immutable URL and a clean API for debugging and manipulation.
- urllib.parse– is used to break the partition of a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.), in order to combine components into a URL string, and convert the "relative URL" to an absolute URL, called the "base url".
- tldextract– accurately detaches the TLD from the registered domain and subdomain of the URL, using the public suffix list.
- Network address
- netaddr– a python library for displaying and manipulating network addresses.
Page Content Extraction
A library that extracts the contents of a Web page.
- Text and metadata for HTML pages
- newspaper– uses Python for news extraction, article extraction, and content curatorial.
- html2text– HTML to markdown formatted text.
- python-goose–html content/Article extractor.
- lassie– user-friendly web content retrieval Tool
- micawber– a small library that extracts rich content from URLs.
- Sumy-A module that automatically summarizes text files and HTML pages
- haul– an extensible image crawler.
- PYTHON-READABILITY–ARC90 fast Python interface for readability tools.
- scrapely– extracts a library of structured data from an HTML Web page. Given some examples of web pages and data extraction, scrapely builds a parser for all similar web pages.
- Video
- youtube-dl– a small command-line program to download videos from YouTube.
- You-get–python3 YouTube, Youku/NicoNico Video Downloader.
- Wiki
- wikiteam– Download and save the Wikis tool.
WebSocket
The library used for WebSocket.
- crossbar– Open-Source application Messaging routers (Python-implemented WebSocket and Wamp for Autobahn).
- autobahnpython– provides Python implementations of the WebSocket protocol and WAMP protocol and open source.
- Websocket-for-python–python 2 and 3 as well as PyPy's WebSocket client and server libraries.
DNS resolution
- dnsyo– checks your DNS on more than 1500 DNS servers worldwide.
- The Pycares–c-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution.
Computer Vision
- opencv– Open source Computer Vision Library.
- simplecv– is an introduction to camera, image processing, feature extraction, format conversion, and a readable interface (based on OPENCV).
- The mahotas– fast computer image processing algorithm (implemented entirely using C + +) is completely based on the NumPy array as its data type.
Proxy Server
- shadowsocks– a fast tunnel proxy that can help you penetrate firewalls (TCP and UDP,TFO, multi-user and smooth restart, Destination IP blacklist).
- Tproxy–tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured with Python.
List of other Python tools
- Awesome-python
- Pycrumbs
- Python-github-projects
- Python_reference
- Pythonidae
Python crawler tool list with github code download link