Python Crawler Starter Learning Program

Source: Internet
Author: User
Tags repetition xpath

Python Crawler Starter Learning Program

This is a purely personal experience, containing water-blowing learning summary. There are many online crawler tutorials, but there is not a complete learning program , this article is for beginners to give a complete set of learning programs. If it is the same self-learning ability, not much knowledge of Python, the reptile ignorant of the small white, read this article will be able to harvest.

1. Learning Content

Recently, it took about 7 days to quickly learn about the popular Python crawler tools. Mastered the basic knowledge of reptiles, including requests, Urllib, BeautifulSoup, CSS selectors, and the current very fire pyspider,scrapy Framework, Selenium Simulator basic use.

The completion of the Python crawler, the current can be the help of a variety of documents to complete the small-scale crawler, for site anti-crawling can also use Verification code identification, analog landing and proxy settings three ways to solve. However, the programming speed is not very high, the road to programming is not a shortcut, in the future need to constantly coding.

The next period of time intends to pause the study of the crawler, to the Linux system to learn, so this article records the learning process of experience, is a summary of the 7 days of study.

2. How did I learn? 2.1 Take a few detours

Have to say at the beginning of a lot of detours. I started with a blog, at that time a still count of a serious blog (since I think, now it seems to write the general) on the way down, follow the blog step by Step learning, blog missing on Google. The big problem in this process is that the knowledge is not enough, the order of knowledge is not strictly combed, I often have to consult a large amount of information to complete the missing things, and this process wasted a lot of time. And some places I have to try to infer a function, a variable represents the original intention, it is conceivable that the quality of learning is not high.

2.2 Official Start 2.2.1 Re-determination of data sources

After I chose to re-learn, the use of the data source is the domestic crawler bloggers "static Find" a book "Python3 Network Crawler Development Practice", which is a very solid book, the content is detailed, this is more important for beginners. Now I have not finished this book, one of the app crawl, distributed two parts did not study.

2.2.2 Learning Style

This part in accordance with the "Python3 Network Crawler Development Combat" chapter order to talk about my personal experience.

First mention a few "global suggestions":

    1. English is English, do not use translation, accuracy is worth paying
    2. Official documentation is always the best answer tool, but the patience required
    3. Just make the necessary coding attempt, the brain is not as good as hands, because a lot of knowledge comes easy to blindly try, without thinking the attempt will waste time
    4. Do not practice for practice, practice is to understand the code. If using Pycharm,ctrl+b is worth owning
2.2.3 Basic Library

The first few chapters mainly introduce the use of basic library, this Part I think to focus on the understanding of the crawler process, tangled in the specific API is meaningless, API proficiency can be mastered in the project, the beginning can be looked at the document written. Abstract understanding is precious, the API represents the abstract meaning extracted out to comb, this will be the necessary premise of the following chapters.

2.2.4 Parsing Library

The crawler's process is roughly to get the URL, crawl the content on the URL, parse the content to extract new URLs and the required data. So parsing is an essential part of crawler authoring. Parsing is simply a matter of filtering the content of the text, which is actually a regular expression. So the regular expression can be used for parsing. But it is troublesome to use, so there are various convenient and useful means to parse the text. The mainstream approach includes XPath, CSS selectors. I recommend CSS selectors, and after experiencing XPath, regular, CSS selectors, the CSS wins.

The difficulty of learning it:

    1. Is that I have not touched the front end, need to add a lot of prior knowledge to know how to learn CSS, here I will send a summary of the nature of the blog to give the information after combing.
    2. The validation required after the completion of the CSS expression requires an appropriate method, which I recommend using the browser's own web analytics tool. Press your f12,ctrl+f to open the search box, and most browsers support CSS selector expressions for searching. This is recommended because then when the crawler is written, this is the process of writing CSS selectors.

In addition to the most important: must first learn how to parse, otherwise the experiment will cause time wasted.

2.2.5 Data storage

This section is divided into two parts:

    1. File format: CSV, TXT, JSON, do not recommend jumping. It's a basic knowledge to jump back and learn.
    2. Use of the database. Look at personal needs, if not necessary, you can skip, but for the relational database and the concept of the Philippine relational database to understand not suffer.
2.2.6 Ajax Data crawling + dynamic render page crawl

Both solve the same problem: How does dynamic loading content crawl?

    1. The solution given by the Ajax data crawl is to observe the rules of the request link to simulate the request (Request: Dynamic content asks the server for content before displaying the content).
    2. A solution for dynamically rendered page crawls is to use the browser emulator to simulate a person manipulating the browser to achieve crawling of dynamic content. I have to say it's quite slow. After all, the browser has to step-by-step, and AJAX data crawling directly constructs a URL link to apply for information.

Both will be, the former need to find the link law, it is difficult and some sites have no laws, the latter is more simple, the application of a wide range. This part is not difficult, the book is very clear, but it takes time, because it is necessary to try different sites to understand the difference between the two ways and the advantages and disadvantages in use. This is a waste of time if the previous CSS selection syntax is not learned. Look patiently, the code knocks over to form a template, and then directly with the good.

Use of the 2.2.1 Agent

Crawler is an essential part of the site will be a large number of unusual access to the IP to block, this time agents to break the bureau.

This part of the agent pool part does not want to see can jump, after all, ready-made agent Pool A lot of, API is very simple.

He recommended a number of paid agents, I did not try to use a free agent, may be small in the experiment, the speed is not unacceptable.

This section is the basis for the use of proxies, and the scrapy framework to be learned will also involve the knowledge of proxy settings.

2.2.7 Analogue Landing

I have to say, this part of me is quick.

Only learn the basic method of simulation landing, this space is very small not difficult to learn.

Then there is a better use of the cookie pool, this chapter a large part of this, considering that this is the first time, the cookie pool is not a necessary part, temporarily skip this part, the next semester on the project again to deal with.

2.3. Crawler Frame

The crawler frame is simply the semi-finished item of the reptile project, and the basic knowledge support is necessary to build a complete project. Even without frames, you can still write bots, but why build your own wheels? The framework is not omnipotent, the framework is only a reasonable combination of the above basic capabilities, the core of the crawler's logic still needs to be done by itself, not to give the URL can produce a satisfactory crawler.

2.3.1 Pyspider: A simple but useful reptile frame

Pysider is a crawler framework developed by the domestic binux great God, easy to start, easy to debug, with an image interface, but the degree of custom is not high, the efficiency is less than scrapy, but for the novice is really a very good learning tool. Pyspider There is no pit, but its online editor is not particularly useful, no code completion function, or it is recommended that the local modifications completed before testing.

I personally recommend beginners to use Pyspider to learn reptiles, because Scrapy debugging information is really not clear, debugging difficulty or pyspider lower.

2.3.2 Scrapy: Engineering-grade crawler framework

Compared to the Pyspider,scrapy complex is not only stint, there is no image interface, pure command line operation, the code is edited locally. Because of the complexity, it is necessary to understand the framework structure in order to use it handy.

"Python3 web crawler Development in action" to scrapy have a detailed description, but I still recommend to read the official documents, official documents can provide more comprehensive information.

For Scrapy, I think it's not a snap point for people who like to delve into technology, like Linux and Windows, where complexity is fun to explore. On the other hand scrapy real use will feel in the data screening more useful than Pyspider, complex is to better support the use of simplicity. And although complex, but as long as the development of the first crawler, then the development speed is much faster, after all, is the code level of things, a lot of repetition.

3. Summary

Although I study the crawler's starting point just want to learn more about Python in the project, but I have to say that the crawler is a very fun tool, learning is very interesting. This article is mainly about the introduction of Python crawler, the specific technical details are not involved, one is compared to the predecessors themselves too food, and the other is the Crawler technology blog has been a lot of repetition and no meaning. Learning experience this thing is also very subjective, the story of the pony across the river Everyone has heard, you crossing combined with their actual situation to adopt.

Please correct me if there is any mistake in the article, please ask for additional suggestions.

Python Crawler Starter Learning Program

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.