List of tools for Python crawlers

Source: Internet
Author: User

This list contains Python libraries for crawling and data processing with Web pages

Internet
    • General
      • Urllib-Network library (STDLIB).
      • Requests-Network library.
      • Grab-network library (based on Pycurl).
      • Pycurl-Network library (bind Libcurl).
      • Urllib3-python HTTP library, secure connection pool, support file post, high availability.
      • HTTPLIB2-Network library.
      • Robobrowser-A simple, Python-style Python library that lets you browse the web without a separate browser.
      • Mechanicalsoup-a Python library that automatically interacts with Web sites.
      • Mechanize-Stateful, programmable web browsing library.
      • Socket-The underlying network interface (STDLIB).
      • Unirest for Python-unirest is a set of lightweight HTTP libraries that can be used in multiple languages.
      • Hyper-python the HTTP/2 client.
      • PYSOCKS-SOCKSIPY updates and actively maintains the version, including bug fixes and some other features. As a direct replacement of the socket module.
    • Asynchronous
      • Treq-API similar to requests (based on twisted).
      • Aiohttp-asyncio HTTP client/server (PEP-3156).
Web crawler Framework
    • Full-Featured Crawler
      • Grab-web crawler framework (based on Pycurl/multicur).
      • Scrapy-web crawler framework (based on twisted), Python3 is not supported.
      • Pyspider-A powerful reptile system.
      • Cola-a distributed crawler framework.
    • Other
      • Portia-a visual crawler based on Scrapy.
      • The HTTP Resource kit for Restkit-python. It allows you to easily access HTTP resources and create objects around it.
      • Demiurge-a reptile micro-frame based on pyquery.
Html/xml Parser
    • General
      • Lxml-c language to write efficient html/xml processing library. XPath is supported.
      • Cssselect-Parse DOM tree and CSS selector.
      • Pyquery-Parse Dom tree and jquery selector.
      • BeautifulSoup-Low-efficiency html/xml processing Library, pure Python implementation.
      • Html5lib-The DOM of the Html/xml document is generated according to the WHATWG specification. This specification is used in all current browsers.
      • Feedparser-Parse Rss/atom feeds.
      • Markupsafe-Provides a safe escape string for the xml/html/xhtml.
      • Xmltodict-A Python module that lets you feel like you are working with JSON when working with XML.
      • Xhtml2pdf-Convert Html/css to PDF.
      • Untangle-Easily convert an XML file to a Python object.
    • Clean
      • Bleach-Cleans up HTML (requires html5lib).
      • Sanitize-brings clarity to the chaotic world of data.
Text Processing

A library for parsing and manipulating simple text.

    • General
      • Difflib-(Python standard library) helps with differentiated comparisons.
      • Levenshtein-Fast calculation of Levenshtein distance and string similarity.
      • Fuzzywuzzy-fuzzy string matching.
      • Esmre-Regular Expression accelerator.
      • Ftfy-Automatically defragment Unicode text to reduce fragmentation.
    • Transformation
      • Unidecode-Convert Unicode text to ASCII.
    • Character encoding
      • Uniout-Prints a readable character instead of an escaped string.
      • Chardet-compatible with Python's 2/3 character encoder.
      • Xpinyin-a library for converting Chinese characters to pinyin.
      • pangu.py-the spacing between CJK and alphanumeric in formatted text.
    • Slug of
      • Awesome-slugify-A Python slugify library that can preserve Unicode.
      • Python-slugify-A Python slugify library that can convert Unicode to ASCII.
      • Unicode-slugify-a tool that can generate Unicode slugs.
      • Pytils-A simple tool that handles Russian strings (including pytils.translit.slugify).
    • Universal Parser
      • Python implementations of the Ply-lex and YACC parsing tools.
      • Pyparsing-Generates a parser for a generic framework.
    • Man's name
      • Python-nameparser-the component that parses the name of the person.
    • Phone number
      • Phonenumbers-Parse, format, store and validate international phone numbers.
    • User Agent String
      • Python-user-agents-parser for browser user agent.
      • HTTP Agent Parser-python HTTP proxy parser.
Specific format file processing

A library that parses and processes a specific text format.

    • General
      • Tablib-A module that exports data to XLS, CSV, JSON, Yaml, and more.
      • Textract-Extract text from a variety of files, such as Word, PowerPoint, PDF, and more.
      • Messytables-tools to parse confusing tabular data.
      • Rows-a common data interface, supported by a lot of formats (currently support csv,html,xls,txt– will provide more in the future!) )。
    • Office
      • Python-docx-read, query, and modify the docx file for Microsoft word2007/2008.
      • XLWT/XLRD-Reads write data and format information from an Excel file.
      • Xlsxwriter-A Python module that creates a excel.xlsx file.
      • Xlwings-A BSD-licensed library that makes it easy to call Python in Excel, and vice versa.
      • OPENPYXL-a library for reading and writing Excel2010 XLSX/XLSM/XLTX/XLTM files.
      • Marmir-Extracts the Python data structure and converts it into a spreadsheet.
    • Pdf
      • Pdfminer-a tool for extracting information from PDF documents.
      • PyPDF2-a library that splits, merges, and transforms PDF pages.
      • Reportlab-Allows you to quickly create rich PDF documents.
      • Pdftables-Extracts the table directly from the PDF file.
    • Markdown
      • Python-markdown-A Markdown of John Gruber, implemented in Python.
      • Mistune-the fastest, full-featured markdown Pure Python parser.
      • MARKDOWN2-A fast markdown that is fully implemented in Python.
    • Yaml
      • Pyyaml-A Python parser for Yaml.
    • Css
      • Cssutils-a Python CSS library.
    • Atom/rss
      • Feedparser-a generic feed parser.
    • Sql
      • Sqlparse-a non-validating SQL statement parser.
    • HTTP
    • HTTP
      • The HTTP request/Response message parser implemented by the Http-parser-c language.
    • Micro format
      • Opengraph-A Python module used to parse the Open Graph Protocol label.
    • Portable Actuators
      • Pefile-a multi-platform module for parsing and processing portable actuators (that is, PE) files.
    • Psd
      • Psd-tools-Reads the Adobe Photoshop PSD (that is, the PE) file into the Python data structure.
Natural language Processing

A library for dealing with human language problems.

    • NLTK-the best platform for writing Python programs to handle human language data.
    • Pattern-python's network mining module. He has natural language processing tools, machine learning and others.
    • Textblob-Provides a consistent API for in-depth natural language processing tasks. Developed on the shoulders of NLTK and pattern giants.
    • Jieba-Chinese word breaker tool.
    • SNOWNLP-Chinese Text Processing library.
    • Loso-another Chinese word store.
    • Genius-Chinese word segmentation based on conditional random domain.
    • langid.py-an independent language recognition system.
    • Korean-a Korean morphological library.
    • Pymorphy2-Russian Morphological Analyzer (POS tagging + inflection engine).
    • pypln– a distributed Natural language processing channel written in Python. The goal of this project is to create an easy way to use NLTK to process a large language library over a network interface.
Browser automation and Simulation
    • Selenium-Automated real browser (Chrome browser, Mozilla Firefox, Opera browser, ie browser).
    • ghost.py-Encapsulation of the WebKit of the PYQT (requires PYQT).
    • Spynner-Encapsulation of the WebKit of the PYQT (requires PYQT).
    • Splinter-Generic API Browser emulator (Selenium Web driver, Django Client, Zope).
Multi-processing
    • Threading-python the standard library thread. Works well for I/O intensive tasks. The task for CPU binding is useless because of the Python GIL.
    • Multiprocessing-The standard Python library runs multiple processes.
    • Celery-Asynchronous task queue/job queue based on distributed message delivery.
    • The Concurrent-futures-concurrent-futures module provides a high-level interface for invoking asynchronous execution.
Asynchronous

Asynchronous Network Programming Library

    • Asyncio-(in Python 3.4 + version above the Python standard library) asynchronous I/O, Time loops, co-programs and tasks.
    • Twisted-an event-driven network engine framework.
    • Tornado-A network framework and an asynchronous network library.
    • Pulsar-python Event-driven concurrency framework.
    • Diesel-python Green-Event-based I/O framework.
    • gevent– a Greenlet-based Python network library that uses the.
    • Eventlet-Asynchronous framework with WSGI support.
    • Tomorrow-a wonderfully decorated syntax for asynchronous code.
Queue
    • Celery-Asynchronous task queue/job queue based on distributed message delivery.
    • Huey-Small multithreaded task queue.
    • MRQ-MR. queue– uses the Python distributed task queue for Redis & Gevent.
    • RQ-A Redis-based lightweight task Queue Manager.
    • Simpleq-A simple, infinitely extensible, Amazon SQS-based queue.
    • Python-gearman-gearman's Python API.
Cloud computing
    • Picloud-Execute Python code in the cloud.
    • Dominoup.com-Perform r,python and MATLAB code in the cloud.
Email

e-Mail Parsing library

    • Flanker-e-mail address and MIME parsing library.
    • The Talon-mailgun library is used to extract quotes and signatures for messages.
URL and network address operations

Parse/modify URLs and network address libraries.

    • Url
      • Furl-A small Python library that makes manipulating URLs simple.
      • Purl-A simple, immutable URL and a clean API for debugging and manipulating.
      • Urllib.parse-Used to break the partition of a Uniform Resource Locator (URL) string between components (addressing scheme, network location, path, etc.), in order to combine components into a URL string, and convert the "relative URL" to an absolute URL, called the "base url".
      • Tldextract-the TLD is accurately separated from the registered domain and subdomain of the URL, using a public suffix list.
    • Network address
      • netaddr– a python library for displaying and manipulating network addresses.

Page Content Extraction

A library that extracts the contents of a Web page.

    • Text and metadata for HTML pages
      • newspaper– uses Python for news extraction, article extraction, and content curatorial.
      • html2text– HTML to markdown formatted text.
      • python-goose–html content/Article extractor.
      • Lassie-humanized Web content search Tool
      • Micawber-a small library that extracts rich content from URLs.
      • Sumy-A module that automatically summarizes text files and HTML pages
      • Haul-an extensible image crawler.
      • PYTHON-READABILITY-ARC90 fast Python interface for readability tools.
      • Scrapely-a library that extracts structured data from an HTML Web page. Given some examples of web pages and data extraction, scrapely builds a parser for all similar web pages.
    • Video
      • YOUTUBE-DL-A small command-line program to download videos from YouTube.
      • You-get-python3 YouTube, Youku/NicoNico Video Downloader.
    • Wiki
      • Wikiteam-Download and save the Wikis tool.
WebSocket

The library used for WebSocket.

    • Crossbar-Open source application Messaging Router (Python-implemented for Autobahn WebSocket and Wamp).
    • Autobahnpython-Provides Python implementations of the WebSocket protocol and WAMP protocol and open source.
    • Websocket-for-python-python 2 and 3 as well as PyPy's WebSocket client and server libraries.
DNS resolution
    • DNSYO-Check your DNS on more than 1500 DNS servers worldwide.
    • The Pycares-c-ares interface. C-ares is the C language library for DNS request and asynchronous name resolution.
Computer Vision
    • OpenCV-Open source computer Vision Library.
    • simplecv– is an introduction to camera, image processing, feature extraction, format conversion, and a readable interface (based on OPENCV).
    • The mahotas– fast computer image processing algorithm (implemented entirely using C + +) is completely based on the NumPy array as its data type.
Proxy Server
    • shadowsocks– a fast tunnel proxy that can help you penetrate firewalls (TCP and UDP,TFO, multi-user and smooth restart, Destination IP blacklist).
    • Tproxy-tproxy is a simple TCP routing agent (layer 7th), based on Gevent, configured with Python.
List of other Python tools
    • Awesome-python
    • Pycrumbs
    • Python-github-projects
    • Python_reference
    • Pythonidae
    • All-in-one programmer Exchange QQ Group 290551701, gather a lot of Internet elite, technical director, architect, Project Manager! Open source technology research, Welcome to the industry, Daniel and beginners are interested in engaging in IT industry personnel to enter!

List of tools for Python crawlers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.