1. What is crawler, that is, web crawler, we can be understood as crawling on the internet has been spiders, the internet is likened to a large network, and the crawler is crawling on this web spider, if it encounters resources, then it will crawl down. What do you want to grab? It's up to you to control it. For example, it is crawling a Web page, in which he discovers a path, in fact, a hyperlink to a Web page, then it can crawl to another web to get the data. In this way, the entire network of connected to the spider is within reach, it is not a matter of minutes to climb down. 2. The process of browsing the Web page in the process of users to browse the Web, we may see many good-looking pictures, such as http://image.baidu.com/, we will see a few pictures and Baidu search box, the process is actually user input URL after the DNS server, Find the server host, send a request to the server, the server after parsing, sent to the user browser HTML, JS, CSS and other files, browser resolution, the user can see all kinds of pictures. Therefore, the user to see the Web page is essentially composed of HTML code, crawler crawling is these content, through the analysis and filtering of these HTML code, to achieve the image, text and other resources. The meaning of the 3.URL URL, that is, the Uniform Resource Locator, which is what we say, the Uniform Resource Locator is a concise representation of the location and access methods of resources available from the Internet, and is the address of standard resources on the Internet. Each file on the Internet has a unique URL that contains information that indicates the location of the file and how the browser should handle it. The format of the URL consists of three parts: ① The first part is the protocol (or service mode). ② the second part is the host IP address (and sometimes the port number) where the resource is stored. ③ The third part is the specific address of the host resource, such as directory and file name. Crawling data must have a target URL to get the data, so it is the basic basis for the crawler to obtain data, accurate understanding of its meaning for the crawler to learn a lot of help. 4. Environment configuration Learning Python, of course, the environment of the configuration, initially I use notepad++, but found that its hint is too weak, so, under Windows I used Pycharm, under Linux I used Eclipse for Python, There are several more excellent ides, you can refer to this article to learn the IDE recommended by Python. Good development tools are the propulsion of the forward, I hope you can find the right IDE for you
Python crawler Primer two crawler Basics Learn