Python crawler tool list, python crawler list
This list contains Python libraries for web page capturing and data processing.
Network
- General
- Urllib-network library (stdlib ).
- Requests-network library.
- Grab-network library (based on pycurl ).
- Pycurl-network library (bound to libcurl ).
- Urllib3-Python HTTP library, secure connection pool, support for file post, high availability.
- Httplib2-network library.
- RoboBrowser-a simple and Python-style Python library that allows you to browse Web pages without an independent browser.
- Machicalsoup-a Python library that automatically interacts with the website.
- Machize-stateful and programmable Web browser library.
- Socket-underlying network interface (stdlib ).
- Unirest for Python-Unirest is a lightweight HTTP library that can be used in multiple languages.
- The HTTP/2 client of hyper-Python.
- PySocks-SocksiPy is updated and actively maintained, including bug fixes and other features. Directly replace the socket module.
- Asynchronous
- Treq-similar to requests APIs (based on twisted ).
- Aiohttp-asyncio's HTTP client/server (PEP-3156 ).
Web Crawler framework
- Full-featured Crawlers
- Grab-web crawler framework (based on pycurl/multicur ).
- Scrapy-web crawler framework (based on twisted), does not support Python3.
- Pyspider-a powerful crawler system.
- Cola-a distributed crawler framework.
- Others
- Portia-Scrapy-Based Visual crawler.
- Restkit-HTTP resource toolkit for Python. It allows you to easily access HTTP resources and create objects around it.
- Demiurge-PyQuery-based crawler microframework.
HTML/XML Parser
- General
- Lxml-efficient HTML/XML processing library written in C language. Supports XPath.
- Cssselect-parse the DOM tree and CSS selector.
- Pyquery-parse the DOM tree and jQuery selector.
- BeautifulSoup-inefficient HTML/XML processing library, pure Python implementation.
- Html5lib-generate the DOM of the HTML/XML document according to the WHATWG specification. This specification is used in all browsers.
- Feedparser-Parse RSS/ATOM feeds.
- MarkupSafe-Provides secure escape strings for XML, HTML, and XHTML.
- Xmltodict-a Python module that makes you feel like processing JSON when processing XML.
- Xhtml2pdf-convert HTML/CSS to PDF.
- Untangle-it is easy to convert an XML file into a Python object.
- Clear
- Bleach-clear HTML (html5lib is required ).
- Sanitize-brings clarity to the chaotic data world.
Text Processing
A library used to parse and operate simple text.
- Difflib-(Python standard library) helps to compare differences.
- Levenshtein-quickly calculates Levenshtein distance and string similarity.
- Fuzzywuzzy-fuzzy string matching.
- Esmre-Regular Expression accelerator.
- Ftfy-automatically organizes Unicode text to reduce fragmentation.
- Unidecode-convert Unicode text to ASCII.
- Uniout-print readable characters instead of escaped strings.
- Chardet-compatible with Python's 2/3 character encoder.
- Xpinyin-a database for converting Chinese characters into pinyin.
- Pangu. py-the spacing between CJK and alphanumeric characters in formatted text.
- Awesome-slugify-a Python slugify library that can retain unicode.
- Python-slugify-a Python slugify library that can convert Unicode to ASCII.
- Unicode-slugify-a tool that can generate Unicode slugs.
- Pytils-a simple tool for processing Russian strings (including pytils. Transcoder. slugify ).
- Python Implementation of PLY-lex and yacc parsing tools.
- Pyparsing-generate a syntax analyzer for a general framework.
- Python-nameparser-component of the parser name.
- Phonenumbers-parse, format, store and verify the international phone number.
- Python-user-agents-the parser of the browser user proxy.
- HTTP Agent Parser-Python HTTP Proxy analyzer.
Processing specific format files
Parses and processes libraries of specific text formats.
- Tablib-a module that exports data in the XLS, CSV, JSON, YAML, and other formats.
- Textract-extract text from various files, such as Word, PowerPoint, and PDF.
- Messytables-A Tool for parsing messy table data.
- Rows-a common data interface that supports many formats (CSV, HTML, XLS, and TXT are currently supported-more will be provided in the future !).
- Python-docx-read, query, and modify the Microsoft Word2007/2008 docx file.
- Xlwt/xlrd-read written data and format information from an Excel file.
- XlsxWriter-a Python module that creates the excel.xlsx file.
- Xlwings-a BSD-licensed library that can easily call Python in Excel, and vice versa.
- Openpyxl-a library for reading and writing Excel2010 XLSX/XLSM/xltx/XLTM files.
- Marmir-extract the Python data structure and convert it to a workbook.
- PDFMiner-A Tool for extracting information from PDF documents.
- PyPDF2-a library that can split, merge, and convert PDF pages.
- ReportLab-allows quick creation of rich PDF documents.
- Pdftables-extract a table directly from a PDF file.
- Python-Markdown-a Markdown of John Gruber implemented in Python.
- Mistune-the fastest and fully functional Markdown pure Python parser.
- Markdown2-a fast Markdown that is fully implemented using Python.
- PyYAML-a Python YAML parser.
- Cssutils-a Python CSS library.
- Feedparser-a common feed parser.
- Sqlparse-an unverified SQL statement analyzer.
- Http Request/Response Message parser implemented by HTTP-parser-C.
- Opengraph-a Python module used to parse Open Graph protocol labels.
- Pefile-a multi-platform module used to parse and process portable executable (PE) files.
- Psd-tools-read the Adobe Photoshop PSD (PE) file to the Python data structure.
Natural Language Processing
Database that handles human language problems.
- NLTK-the best platform for compiling Python programs to process human language data.
- Pattern-Python network mining module. He has natural language processing tools, machine learning, and others.
- TextBlob-provides consistent APIs for deep NLP tasks. It is developed on the shoulders of giants Based on NLTK and Pattern.
- Jieba-Chinese word segmentation tool.
- SnowNLP-Chinese Text Processing library.
- Loso-another Chinese dictionary.
- Genius-Chinese Word Segmentation Based on Conditional Random Fields.
- Langid. py-an independent language recognition system.
- Korean-a Korean format library.
- Pymorphy2-Russian morphology analyzer (word-of-speech tagging + word-form change engine ).
- PyPLN-a distributed natural language processing channel written in Python. The goal of this project is to create a simple method to use NLTK to manage large language libraries through network interfaces.
Browser automation and Simulation
- Selenium-automated real browsers (Chrome, Firefox, opera, and IE ).
- Ghost. py-encapsulation of PyQt webkit (PyQT is required ).
- Spynner-encapsulation of PyQt webkit (PyQT is required ).
- Splinter-General API browser simulator (selenium web driver, Django client, Zope ).
Multiple Processing
- Threading-the thread running of the Python standard library. It is very effective for I/O-intensive tasks. It is useless for CPU-bound tasks because python GIL.
- Multiprocessing-the standard Python library runs multiple processes.
- Celery-asynchronous task queue/Job Queue Based on Distributed message transmission.
- The concurrent-futures-concurrent-futures module provides a high-level interface for calling asynchronous execution.
Asynchronous
Asynchronous Network programming library
- Asyncio-(Python standard library later than Python 3.4) asynchronous I/O, time loop, collaborative programs and tasks.
- Twisted-event-driven network engine framework.
- Tornado-a network framework and an Asynchronous Network Library.
- Pulsar-Python event-driven concurrency framework.
- Diesel-Python Green Event-based I/O framework.
- Gevent-A coroutine-based Python network library using greenlet.
- Eventlet-asynchronous framework supported by WSGI.
- Tomorrow-the wonderful modifier Syntax of asynchronous code.
Queue
- Celery-asynchronous task queue/Job Queue Based on Distributed message transmission.
- Huey-small multi-threaded task queue.
- Mrq-Mr. Queue-use the Python Distributed Job Queue of redis & Gevent.
- RQ-Redis-based lightweight task queue manager.
- Simpleq-a simple, infinitely scalable Queue Based on Amazon SQS.
- Python-gearman-Gearman Python API.
Cloud computing
- Picloud-run Python code on the cloud.
- Dominoup.com-cloud executes R, Python, and matlab code.
Email
Email resolution Library
- Flanker-email address and Mime parsing library.
- The Talon-Mailgun library is used to extract the quote and signature of a message.
Website and network address operations
Parse/modify the URL and network address library.
- URL
- Furl-a small Python library that simplifies URL manipulation.
- Purl-a simple unchangeable URL and a clean API for debugging and operations.
- Urllib. parse-used to break the gap between components (addressing scheme, network location, path, etc.) in a uniform resource locator (URL) string. To combine components into a URL string, convert "relative URL" into an absolute URL, which is called "Basic URL ".
- Tldextract-accurately detaches TLD from the URL registration domain and subdomain, and uses the public Suffix List.
- Network Address
- Netaddr-Python library for displaying and manipulating network addresses.
Webpage content extraction
Library for extracting Web content.
- Text and metadata of HTML pages
- Newspaper-extract news, articles, and develop content using Python.
- Html2text-convert HTML into Markdown text.
- Python-goose-HTML content/Article Extraction Tool.
- Lassie-a user-friendly web content retrieval tool
- Micawber-a small library that extracts rich content from the website.
- Sumy-a module that automatically summarizes text files and HTML webpages
- Haul-a scalable image crawler.
- Python-readability-arc90 quick Python interface of readability tool.
- Scrapely-database for extracting structured data from HTML webpages. Some examples of Web pages and data extraction are provided. scrapely builds a analyzer for all similar Web pages.
- Video
- Youtube-dl-a small command line program for downloading videos from YouTube.
- You-get-Python3 video download tool for YouTube and Youku/Niconico.
- Wikipedia
- WikiTeam-download and save the wikis tool.
WebSocket
The library used for WebSocket.
- Crossbar-open-source application message passing router (WebSocket and WAMP implemented by Python for Autobahn ).
- AutobahnPython-provides the Python Implementation of WebSocket protocol and WAMP Protocol and is open-source.
- WebSocket-for-Python 2 and 3 and PyPy WebSocket client and server library.
DNS resolution
- Dnsyo-check your DNS on more than 1500 DNS servers worldwide.
- Pycares-c-ares interface. C-ares is a c language library for DNS requests and asynchronous name resolution.
Computer Vision
- OpenCV-open-source computer vision library.
- SimpleCV-Brief Introduction to cameras, image processing, feature extraction, format conversion, and highly readable interfaces (based on OpenCV ).
- Mahotas-fast computer image processing algorithm (fully implemented using C ++), which uses a numpy-based array as its data type.
Proxy Server
- Shadowsocks-a fast tunnel proxy that can help you penetrate the firewall (supports TCP and UDP, TFO, multi-user and smooth restart, and target IP blacklist ).
- Tproxy-tproxy is a simple TCP routing proxy (Layer 2). It is configured in Python Based on Gevent.
List of other Python tools
- Awesome-python
- Pycrumbs
- Python-github-projects
- Python_reference
- Pythonidae
- All-round programmers exchange QQ Group 290551701 and gather many Internet elites, Technical Directors, architects, and project managers! Open-source technology research, welcome to the industry, Daniel and beginners interested in IT industry personnel!