Network crawler (web crawler): Also known as network Spider (web spider), the basic operation of web crawler is to crawl Web pages.
Browse the Web: Open Baidu www.baidu.com in Firefox browser, that is, the browser as a ' client ',
Send a request to the server, the server's files ' crawl ' to the local, and then to explain and show.
HTML: is a markup language that tags content and parses and differentiates it.
Browser Features: Parse the acquired HTML code and turn the original code into the site page we see directly.
URL (uniform/universal Resource Locator): called Uniform Resource Locator (also called URL)
URL format:
- protocol
- server (domain name or IP address), sometimes including port number (in number, omitted)
- path (that is, the specific address of the host resource)
- query (in? As the starting point)
The first part of the Protocol and the second part of the server with ' :// ' to be separated, the second part of the server and the third part of the path with ' / ' separated
Example:http://zh.wikipedia.org:80/w/index.php
http: is protocol
zh.wikipedia.org, is the server
is the network port number on the server
/w/index.php, is the path
The main object of the crawler is the URL
Reference resources: http://blog.csdn.net/pleasecallmewhy/article/details/8922826
Python notes-crawler 1