Recently working on a project that needs to crawl data from a specific site using a web crawler, I'm going to write a crawler article to share with you how to write a crawler. This is the first article of the project, this time to briefly introduce the Python crawler, and later on according to the progress of the project will continue to update.
First, what is a web crawler
The concept of web crawler is not difficult to understand, we can understand the Internet as a huge network (fishnet bar), and the web crawler like a spider (Spider in English, spiders, the meaning, the individual think that the network spider is not more image of haha), and this spider is crawling on the net, If it encounters a resource, it will crawl down. What resources do you want to crawl? This is defined by yourself, what you want to crawl, what you have, you have absolute dominance, and in theory you can get any information you want and exist with the Internet through a web crawler.
Second, the process of browsing the web
In order to understand the crawler, we should understand the process of browsing the web, in fact, the crawler is actually using computer simulation of human web browsing. So what is the process of browsing the Web?
In the process of users to browse the Web page, we may see a lot of good-looking pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, the process is actually user input URL, after the DNS server, find the server host, Send a request to the server, the server after parsing, sent to the user browser HTML, JS, CSS and other files, browser resolution, the user can see all kinds of pictures.
Therefore, the user to see the Web page is essentially composed of HTML code, crawler crawling is these content, through the analysis and filtering of these HTML code, to achieve the image, text and other resources.
Third, the meaning of the URL
URL, the Uniform Resource Locator, which is what we call the URL, the Uniform Resource Locator is a concise representation of the location and access methods of resources available from the Internet, and is the address of standard resources on the Internet. Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it.
The format of the URL consists of three parts:
① The first part is the protocol (or service mode).
② the second part is the host IP address (and sometimes the port number) where the resource is stored.
③ The third part is the specific address of the host resource, such as directory and file name.
Crawling data must have a target URL to get the data, so it is the basic basis for the crawler to obtain data, accurate understanding of its meaning for the crawler to learn a lot of help.
Iv. Configuration of the environment
In theory you can write web crawlers in any language, but what I'm sharing here is using Python to write crawlers. Python's flexibility, beauty, and strong support for network programming make it the first choice for a web crawler programming language. The installation of Python is very simple, here will not repeat, download an installation from the official website of its own installation is OK, the editor with its own idle bar, after installation, the right button will appear idle for several times.
Five, the first reptile experience
Say so much, first come to feel the next reptile, here we crawl a page directly such as: http://www.cnblogs.com/ECJTUACM-873284962/
This page is my official blog, we want to crawl the content down, in fact, only two lines of code to complete, need to use the URLLIB2 library, the code is as follows:
Then print the results as follows:
As you can see, the page content of my blog homepage is all crawled down, you can click the link to visit my blog to see if it is consistent with its content.
In fact, the crawler is so simple, as long as the understanding of the principle, everything is not a problem. Today is only a preliminary experience crawler, the follow-up will continue to step up, to share more crawler knowledge.
Python crawler notes (i): Basic crawler Primer