need to combine: "Baidu search engine keyword URL Collection crawler optimization industry fixed investment Program efficient access to industry traffic-code" study together#百度搜索引擎关键字URL采集爬虫优化行业定投方案高效获得行业流量 #知识点" "1 web crawler2 Python development web crawler3 Requests Library4 file Operations" " #项目结构" "key.txt keyword document, crawling based on keywords in this documentdemo.py the contents of the crawler filesres/software development. TXT crawler-acquired URLs" " #在Pycharm中新建项目: C: ... 0501#该项目暂时没有多线程和多进程#在项目中新建脚本spider. PY #版本信息" "1 Environmental Python32 third-party module requests installation method PIP install requests3 IDE pycharm" " #数据在哪里? Where do you crawl the data? #打开浏览器, open Baidu, in the search box, enter the "program design" click on the "Baidu" button, in the information returned by Baidu, advertising part do not, the rest of the site of each website #爬虫其实就是在模拟浏览器, send an HTTP request to the target website, how is this HTTP request sent? #在浏览器按F12, can help us monitor requests sent by the browser, more than 90% of the site is based on the HTTP request#在搜索框中输入 "Program Design", click on the "Baidu" button, there will be a lot of data in the network bar, each data represents an HTTP request #点击 "Baidu Click" button, the display of the page hypertext is what it looks like? In the blank place right---"view page source code", you will find that the Web page is actually an HTML text, through the browser processing, display as the user sees the appearance. Each message that the front desk sees will be a <a></a> tag, which is a hyperlink. #所以首先需要找到html文本中的 The hyperlink information in the <a> tag, there is a lot of information under the Respont bar of each HTTP request in the network, want to get this data, go to headers bar down to find request URL information (for example: Request url:https://ss3.baidu.com/6onwsjip0qiz8tyhnq/ps_default.gif?_t=1525188253376). #要想访问一个网页首先需要知道的是, URLs in the URL bar, such as: https://www.baidu.com/s?wd=%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1&rsv_spt=1 &rsv_iqid=0x967855b80019cdd1&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_ pg&rsv_enter=0&rsv_sug3=3&rsv_sug1=2&rsv_sug7=100&inputt=643536&rsv_sug4=644636 #过一遍原理#网络爬虫"' simply define what a reptile is .is essentially a program, before getting the text of these lines of code is a reptileThis app can simulate the browser to automatically download the Internet resources we need" " #网络资源" "images, videos, Web pages, documents, etc. that can be accessed on the Internetevery network resource, through what access? For example, Web pages to access URLs through the URL" " #url" "Global Uniform Resource Locator" " #浏览器的工作流程" "The first step: The browser first to access a resource, the first to have a URL, according to the URL to access network resources, the URL to execute its corresponding server, according to the URL, the browser sends HTTP requests (two common ways get/post) ignoring the server's processing (not web development, after all) The second step: The server returns the result to the browser, the return is the HTTP response, normally, the normal return data, the browser will unpack the data to render, display (if the picture is displayed as a picture) to the user different crawler differences are in the requested section, depending on the crawled site, get () need to take the parameters are different" "#爬虫的原理讲完了#难在分析过程#简单在就是发送一个http请求 #开发爬虫的几个步骤" "1. Find Target dataLocate the page where the target data resides or the URL where the target data resides 2. Find the loading process of the dataparsing how to send HTTP requests 3, send HTTP requestSimulation Browser, which is described in "Browser workflow" 4, extracting datadata cleaning, processing 5, Data persistencewarehousing or writing files" " #获取到response之后, is the data processing link, extract the useful information from the returned page HTML, do data cleansing-this link will use the content of the regular expression The analysis of each returned result will be class= "result C-container" div package, where the content of the href = "" is the URL we want to use the regular expression to extract from the text of this part of the content, This part as far as possible with regular expression to write, beautifulsoap, such as the bottom is also a regular expression, efficiency is not directly with the regular expression high. #需要复习一下正则表达式的内容 #需要复习一下文件操作 #需要复习一个html的知识 #学习正则表达式的方法:" "do not be greedy, many people study is very greedy, think as long as you can take a look at all the meta-characters to learn, if so to think is to find themselves unhappy. need to learn meta-characters, each meta-character, to practice more, their own design string to match. Keep writing until you get the meta-character out of the way. What do you mean, you know, get it? It is 3 days later to come back to see, but also to write it. " " #学习方法:#听课学习的是老师的思路, until you know what to do, and then study it in class. When you don't, look at the video and take a look at the notes. #写代码方法: The way to die#先把程序写死, and then keep optimizing, finally allowing the program to adapt to a variety of situations #如何学习第三方库#学习别人的代码, look at the document and know how to use it. ----------------------------------------------the reptile part .
Baidu search engine keyword URL Collection crawler optimization industry fixed investment Program efficient access to industry flow-notes