How Python crawlers get started

Source: Internet
Author: User
Tags http cookie python web crawler

Learning Reptiles is a gradual process, as 0 basic small white, broadly divided into three stages, the first stage is to get started, master the basic knowledge necessary, the second stage is imitation, followed by other people's crawler code, to understand each line of code, the third stage is their own hands, this stage you began to have their own problem-solving ideas, The crawler system can be designed independently.

Crawler technologies include, but are not limited to, proficiency in a programming language (here in Python for example) HTML knowledge, the basic knowledge of HTTP/HTTPS protocol, regular expressions, database knowledge, the use of common grasping tools, the use of Crawler framework, involving large-scale crawler, but also need to understand the concept of distributed, Message Queuing, common data structures and algorithms, caches, and even machine learning applications are supported by many technologies behind large-scale systems. Crawler just to obtain data, analysis, mining these data is value, so it can also extend to data analysis, data mining and other fields, to make decisions for enterprises, so as a crawler engineer, is promising.

Then do you have to learn all the knowledge above before you can start to write crawlers? Of course not, learning is a lifelong thing, as long as you will write Python code, the direct start crawler, like learning a car, as long as you can start the road, of course, writing code can be more safe than driving.

To write crawlers in Python, you first need to be python, to understand the basic syntax, and to know how to use functions, classes, and commonly used data structures such as list, dict, even if the basic introduction. Then you need to understand that html,html is a document tree structure, there is an HTML 30-minute introductory tutorial enough. Then there is the knowledge of HTTP, the basic principle of the crawler is the process of downloading data from a remote server through a network request, and the technology behind this network request is based on the HTTP protocol. As an entry crawler, you need to understand the basic principles of the HTTP protocol, although the HTTP specification with a book is not complete, but the depth of the content can be put slowly to see later, theory and practice combined.

The network request framework is the implementation of the HTTP protocol, such as the famous Network request library requests is a network library that simulates the browser sending HTTP requests. After understanding the HTTP protocol, you can specifically targeted learning and network-related modules, such as Python from the Urllib, Urllib2 (Python3 urllib), Httplib,cookie and other content, of course, you can skip these directly, Learn how to use requests directly, if you are familiar with the basic content of the HTTP protocol. One of the books we have to recommend here is "graphic http". Data crawl down, most of the situation is HTML text, there are a few are based on XML format or Json format data, in order to properly handle this data, you have to familiarize yourself with the solution of each data type, such as JSON data can directly use Python's own module JSON, for HTML data, You can use BeautifulSoup, lxml and other libraries to process, for XML data, in addition to the use of untangle, xmltodict and other third-party libraries.

Starter Crawler, learning regular expressions is not necessary, you can learn when you really need, such as when you crawl the data back, you need to clean the data, when you find that the use of regular string manipulation method is not able to handle, then you can try to understand the regular expression, Often it can play a multiplier effect. The Python re module can be used to process regular expressions. Several tutorials are also recommended here: Regular Expressions 30-minute introductory tutorial Python regular expression Guide regular expression full guide

After the data has been cleaned, you can use file storage, such as CSV files, database storage, simple SQLite, professional MySQL, or distributed document Database MongoDB, which are very friendly to Python and have ready-made library support. Python operation MySQL database connect to database via Python

From the data capture to the basic process of cleaning and storage is gone, but also a basic introduction, the next is the test of the internal strength of the time, many sites have anti-crawler strategy, they try to prevent you to use abnormal means to obtain data, such as there will be a variety of weird verification code limit your request operation, the request speed limit , limit the IP, even encrypt the data, in short, to improve the cost of data acquisition. At this point you need to master more knowledge, you need to understand the HTTP protocol, you need to understand the common encryption and decryption algorithm, you have to understand the HTTP Cookie,http proxy, HTTP in the various headers. Reptiles and anti-reptile is love to kill a pair, the road high once outsmart. How to deal with anti-crawler there is no established unified solution, relying on your experience and the knowledge system you have mastered. This is not the height that can be achieved with a 21-day introductory tutorial.

Data structures and algorithms

For large-scale crawlers, usually starting from a URL to crawl, and then the page to resolve the URL of the URL to be crawled into the collection of URLs, we need to use the queue or priority queue to distinguish some of the site priority crawl, some sites crawl behind. Each crawl to a page, is to use the depth first or breadth first algorithm crawl to remove a link. Each time a network request is initiated, it involves a DNS parsing process (converting URLs to IP) in order to avoid repeated DNS parsing, we need to cache the parsed IP. URL so much, how to determine which URLs have been crawled, which have not crawled, simple point is to use the dictionary structure to store the crawled URLs, but if you touch a huge number of URLs, the dictionary occupies a large amount of memory, at this time you need to consider using Bloom filter (Bron filter), Crawling data with one thread is inefficient, and if you increase the efficiency of the crawler, whether it is using multi-threading, multi-process or co-processes, or distributed operations.

About Practice

Online Crawler Tutorials abound, the principle is basically the same, but is to change a different site to crawl, you can follow the online tutorial learning simulation login A website, analog punch, such as climbing a watercress film, books and so on. Through constant practice, from encountering problems to solving problems, such a harvest can not be compared to reading.

Crawler Common Library

Urllib, Urlib2 (urllib in Python) Python built-in Network request library

URLLIB3: Thread-Safe HTTP Network Request Library

Requests: Use the widest range of Network request libraries, compatible with PY2 and py3

Grequests: Asynchronous requests

beautifulsoup:html, XML Operations parsing Library

lxml: Another way to work with HTML and XML

Tornado: Asynchronous Network framework

Gevent: Asynchronous Network framework

Scrapy: The most popular reptile frame

Pyspider: Reptile Frame

Xmltodict:xml Convert to Dictionary

Pyquery: manipulating HTML like jquery

Jieba: Participle

Sqlalchemy:orm Frame

Celery: Message Queuing

RQ: Simple Message Queuing

Python-goose: Extracting text from HTML

Books

"Graphic http"

"HTTP authoritative guide"

"Computer network: top-down approach"

Writing web crawler with Python

"Python Network data Acquisition"

"Mastering Regular Expressions"

"Getting started with Python to practice"

"Write your own web crawler"

"Crypto101"

"Graphic Cryptography Technology"

Tutorial

Python Crawler Learning Series Tutorials

Python starter Web crawler Essentials Edition

Python web crawler

Crawler Starter Series

How Python crawlers get started

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.