Python crawler tools

Source: Internet
Author: User
Tags alphanumeric characters yaml parser nltk
When it comes to crawlers, a python crawler package is processed from the Internet. Python library that contains web page capture and data processing. if you need a Python library, refer to the following network

  • General

    • Urllib-Network Library (stdlib ).

    • Requests-Network Library.

    • Grab-Network Library (based on pycurl ).

    • Pycurl-Network Library (bound to libcurl ).

    • Urllib3-Python HTTP library, secure connection pool, support for file post, high availability.

    • Httplib2-Network Library.

    • RoboBrowser-a simple and Python-style Python library that allows you to browse web pages without an independent browser.

    • Machicalsoup-a Python library that automatically interacts with the website.

    • Machize-Stateful and programmable Web browser Library.

    • Socket-underlying network interface (stdlib ).

    • Unirest for Python-Unirest is a lightweight HTTP library that can be used in multiple languages.

    • The HTTP/2 client of hyper-Python.

    • PySocks-SocksiPy is updated and actively maintained, including bug fixes and other features. Directly replace the socket module.

  • Asynchronous

    • Treq-similar to requests APIs (based on twisted ).

    • Aiohttp-asyncio's HTTP client/server (PEP-3156 ).

Web crawler framework
  • Full-featured crawlers

    • Grab-Web crawler framework (based on pycurl/multicur ).

    • Scrapy-Web crawler framework (based on twisted), does not support Python3.

    • Pyspider-a powerful crawler System.

    • Cola-a distributed crawler framework.

  • Others

    • Portia-Scrapy-based visual crawler.

    • Restkit-HTTP resource toolkit for Python. It allows you to easily access HTTP resources and create objects around it.

    • Demiurge-PyQuery-based crawler microframework.

HTML/XML parser
  • General

    • Lxml-efficient HTML/XML processing library written in C language. Supports XPath.

    • Cssselect-parse the DOM tree and CSS selector.

    • Pyquery-parse the DOM tree and jQuery selector.

    • BeautifulSoup-inefficient HTML/XML processing library, pure Python implementation.

    • Html5lib-generate the DOM of the HTML/XML document according to the WHATWG specification. This specification is used in all browsers.

    • Feedparser-parse RSS/ATOM feeds.

    • MarkupSafe-provides secure escape strings for XML, HTML, and XHTML.

    • Xmltodict-a Python module that makes you feel like processing JSON when processing XML.

    • Xhtml2pdf-convert HTML/CSS to PDF.

    • Untangle-it is easy to convert an XML file into a Python object.

  • Clear

    • Bleach-clear HTML (html5lib is required ).

    • Sanitize-brings clarity to the chaotic data world.

Text processing

A library used to parse and operate simple text.

  • General

  • Difflib-(Python standard library) helps to compare differences.

  • Levenshtein-quickly calculates Levenshtein distance and string similarity.

  • Fuzzywuzzy-fuzzy string matching.

  • Esmre-regular expression accelerator.

  • Ftfy-automatically organizes Unicode text to reduce fragmentation.

  • Conversion

  • Unidecode-convert Unicode text to ASCII.

  • Character encoding

  • Uniout-print readable characters instead of escaped strings.

  • Chardet-compatible with Python's 2/3 character encoder.

  • Xpinyin-a database for converting Chinese characters into pinyin.

  • Pangu. py-the spacing between CJK and alphanumeric characters in formatted text.

  • Slug

  • Awesome-slugify-a Python slugify library that can retain unicode.

  • Python-slugify-a Python slugify library that can convert Unicode to ASCII.

  • Unicode-slugify-a tool that can generate Unicode slugs.

  • Pytils-a simple tool for processing Russian strings (including pytils. transcoder. slugify ).

  • General parser

  • Python implementation of PLY-lex and yacc parsing tools.

  • Pyparsing-generate a syntax analyzer for a general framework.

  • Person's name

  • Python-nameparser-component of the parser name.

  • Phone Number

  • Phonenumbers-parse, format, store and verify the international phone number.

  • User proxy string

  • Python-user-agents-the parser of the browser user proxy.

  • HTTP Agent Parser-Python HTTP proxy analyzer.

Processing specific format files

Parses and processes libraries of specific text formats.

  • General

  • Tablib-a module that exports data in the XLS, CSV, JSON, YAML, and other formats.

  • Textract-extract text from various files, such as Word, PowerPoint, and PDF.

  • Messytables-a tool for parsing messy table data.

  • Rows-a common data interface that supports many formats (CSV, HTML, XLS, and TXT are currently supported-more will be provided in the future !).

  • Office

  • Python-docx-read, query, and modify the Microsoft Word2007/2008 docx file.

  • Xlwt/xlrd-read written data and format information from an Excel file.

  • XlsxWriter-a Python module that creates the excel.xlsx file.

  • Xlwings-a BSD-licensed library that can easily call Python in Excel, and vice versa.

  • Openpyxl-a library for reading and writing Excel2010 XLSX/XLSM/xltx/XLTM files.

  • Marmir-extract the Python data structure and convert it to a workbook.

  • PDF

  • PDFMiner-a tool for extracting information from PDF documents.

  • PyPDF2-a library that can split, merge, and convert PDF pages.

  • ReportLab-allows quick creation of rich PDF documents.

  • Pdftables-extract a table directly from a PDF file.

  • Markdown

  • Python-Markdown-a Markdown of John Gruber implemented in Python.

  • Mistune-the fastest and fully functional Markdown pure Python parser.

  • Markdown2-a fast Markdown that is fully implemented using Python.

  • YAML

  • PyYAML-a Python YAML parser.

  • CSS

  • Cssutils-a Python CSS Library.

  • ATOM/RSS

  • Feedparser-a common feed parser.

  • SQL

  • Sqlparse-an unverified SQL statement analyzer.

  • HTTP

  • HTTP

  • Http request/response message parser implemented by HTTP-parser-c.

  • Microformats

  • Opengraph-a Python module used to parse Open Graph protocol labels.

  • Portable execution body

  • Pefile-a multi-platform module used to parse and process Portable Executable (PE) files.

  • PSD

  • Psd-tools-read the Adobe Photoshop PSD (PE) file to the Python data structure.

Natural language processing

Database that handles human language problems.

  • NLTK-the best platform for compiling Python programs to process human language data.

  • Pattern-Python network mining module. He has natural language processing tools, machine learning, and others.

  • TextBlob-provides consistent APIs for deep NLP tasks. It is developed on the shoulders of giants based on NLTK and Pattern.

  • Jieba-Chinese word segmentation tool.

  • SnowNLP-Chinese Text Processing library.

  • Loso-another Chinese dictionary.

  • Genius-Chinese word segmentation based on conditional random fields.

  • Langid. py-an independent language recognition system.

  • Korean-a Korean format library.

  • Pymorphy2-Russian Morphology Analyzer (word-of-speech tagging + word-form change engine ).

  • PyPLN-a distributed natural language processing Channel written in Python. The goal of this project is to create a simple method to use NLTK to manage large language libraries through network interfaces.

Browser Automation and simulation
  • Selenium-Automated Real browsers (Chrome, Firefox, Opera, and IE ).

  • Ghost. py-encapsulation of PyQt webkit (PyQT is required ).

  • Spynner-encapsulation of PyQt webkit (PyQT is required ).

  • Splinter-general API browser simulator (selenium web driver, Django client, Zope ).

Multiple processing
  • Threading-the thread running of the Python standard library. It is very effective for I/O-intensive tasks. It is useless for CPU-bound tasks because python GIL.

  • Multiprocessing-the standard Python library runs multiple processes.

  • Celery-asynchronous task queue/job queue based on distributed message transmission.

  • The concurrent-futures-concurrent-futures module provides a high-level interface for calling asynchronous execution.

Asynchronous

Asynchronous network programming library

  • Asyncio-(Python standard library later than Python 3.4) asynchronous I/O, time loop, collaborative programs and tasks.

  • Twisted-event-driven network engine framework.

  • Tornado-a network framework and an asynchronous network Library.

  • Pulsar-Python event-driven concurrency framework.

  • Diesel-Python green event-based I/O framework.

  • Gevent-a coroutine-based Python network library using greenlet.

  • Eventlet-asynchronous framework supported by WSGI.

  • Tomorrow-the wonderful modifier syntax of asynchronous code.

Queue
  • Celery-asynchronous task queue/job queue based on distributed message transmission.

  • Huey-small multi-threaded task queue.

  • Mrq-Mr. Queue-use the Python distributed job Queue of redis & Gevent.

  • RQ-Redis-based lightweight task queue manager.

  • Simpleq-a simple, infinitely scalable queue based on Amazon SQS.

  • Python-gearman-Gearman Python API.

Cloud computing
  • Picloud-run Python code on the cloud.

  • Dominoup.com-cloud executes R, Python, and matlab code.

Email

Email resolution Library

  • Flanker-email address and Mime parsing library.

  • The Talon-Mailgun library is used to extract the quote and signature of a message.

Website and network address operations

Parse/modify the URL and network address Library.

  • URL

    • Furl-a small Python library that simplifies URL manipulation.

    • Purl-a simple unchangeable URL and a clean API for debugging and operations.

    • Urllib. parse-used to break the gap between components (addressing scheme, network location, path, etc.) in a uniform resource locator (URL) string. to combine components into a URL string, convert "relative URL" into an absolute URL, which is called "basic URL ".

    • Tldextract-accurately detaches TLD from the URL registration domain and subdomain, and uses the Public Suffix List.

  • Network address

    • Netaddr-Python library for displaying and manipulating network addresses.

Webpage Content Extraction

Library for extracting Web content.

  • Text and metadata of HTML pages

    • Newspaper-extract news, articles, and develop content using Python.

    • Html2text-convert HTML into Markdown text.

    • Python-goose-HTML content/article extraction tool.

    • Lassie-a user-friendly web content retrieval tool

    • Micawber-a small library that extracts rich content from the website.

    • Sumy-a module that automatically summarizes text files and HTML webpages

    • Haul-a scalable image crawler.

    • Python-readability-arc90 quick Python interface of readability tool.

    • Scrapely-database for extracting structured data from HTML webpages. Some examples of Web pages and data extraction are provided. scrapely builds a analyzer for all similar Web pages.

  • Video

    • Youtube-dl-a small command line program for downloading videos from YouTube.

    • You-get-Python3 video download tool for YouTube and Youku/Niconico.

  • Wikipedia

    • WikiTeam-download and save the wikis tool.

WebSocket

The library used for WebSocket.

  • Crossbar-open-source application message passing router (WebSocket and WAMP implemented by Python for Autobahn ).

  • AutobahnPython-provides the Python implementation of WebSocket protocol and WAMP protocol and is open-source.

  • WebSocket-for-Python 2 and 3 and PyPy WebSocket client and server Library.

DNS resolution
  • Dnsyo-check your DNS on more than 1500 DNS servers worldwide.

  • Pycares-c-ares interface. C-ares is a c language library for DNS requests and asynchronous name resolution.

Computer vision
  • OpenCV-open-source computer vision library.

  • SimpleCV-brief introduction to cameras, image processing, feature extraction, format conversion, and highly readable interfaces (based on OpenCV ).

  • Mahotas-fast computer image processing algorithm (fully implemented using C ++), which uses a numpy-based array as its data type.

Proxy Server
  • Shadowsocks-a fast tunnel proxy that can help you penetrate the firewall (supports TCP and UDP, TFO, multi-user and smooth restart, and target IP blacklist ).

  • Tproxy-tproxy is a simple TCP routing proxy (layer 2). it is configured in Python based on Gevent.

List of other Python tools
  • Awesome-python

  • Pycrumbs

  • Python-github-projects

  • Python_reference

  • Pythonidae

    Recommended development tools:

    First PHP community toolbox: php development tool free download

The above is the details of the Python crawler tool list. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.