Python Crawler Learning Note 1

Source: Internet
Author: User
Tags hadoop mapreduce

After a period of study, finally into the door

First climb a CSDN blog to practice practiced hand

The whole idea is to first determine how many pages a blog has

And then get the URL based on the number of pages.

Then crawl out the title of each page and the corresponding URL

BeautifulSoup is used here to parse the page

#Coding=utf-8ImportUrllib2 fromBs4ImportBeautifulSoupImportsysreload (SYS) sys.setdefaultencoding ('Utf-8')defQuery_item (input,cla=None):" "gets the object of a tag class in the corresponding URL" "Soup=beautifulsoup (Input,"Html.parser")    ifcla==None:returnSoup.find_all ('Div')    Else:        returnSoup.find_all ('Div', class_=CLA)" "HTTP://BLOG.CSDN.NET/ZHAOYL03/ARTICLE/LIST/1" "URL="HTTP://BLOG.CSDN.NET/ZHAOYL03/ARTICLE/LIST/1"Req_header= {'Host':"blog.csdn.net",'user-agent':"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/35.0.1916.153 safari/537.36",'Accept':"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",'Accept-language':"zh-cn,zh;q=0.8",'Connection':"keep-alive","Cache-control":"max-age=0","Referer":"http://blog.csdn.net"}blog_art=[]i=1
#该循环是获取最大页面数 and put the fetched page in a list whileTrue:url="http://blog.csdn.net/zhaoyl03/article/list/"req=urllib2. Request (url+Str (i), none,req_header) result=Urllib2.urlopen (req,none) Artcle_num=query_item (Result.read (),'List_item Article_item') ifLen (artcle_num) <15: forXinchartcle_num:blog_art.append (x) Break Else: I+=1 forXinchartcle_num:blog_art.append (x)#now get the active page of the blog I and all the blog posts Blog_artHost_url='http://blog.csdn.net'Query_result={} forXinchBlog_art: forYinchX.find ('span','Link_title'): #get title and URL for all postsQuery_result[str (Y.get_text ())]=str (Host_url+y.get ('href'))PrintLen (query_result) forX, yinchQuery_result.items ():Printx+':'+y

The results are as follows:

Open Source Robotics toolkits:use virtual arenas to test your robotics algorithms:http:blog.csdn.net/zhaoyl03/article/details/8179441design mode: Observer mode: http:blog.csdn.net/zhaoyl03/article/details/40223067Hobbes: People are like wolves: http:blog.csdn.net/zhaoyl03/article/details/8158739Chimerge algorithm: Iris data Set as an example: http:blog.csdn.net/zhaoyl03/article/details/8689440use Python to write reptiles, to crawl csdn content, the perfect solution403forbidden:http:blog.csdn.net/zhaoyl03/article/details/8631897solving boundary value problems using BVP4C in Matlab: http:blog.csdn.net/zhaoyl03/article/details/8153140Newton's Descent method: http:blog.csdn.net/zhaoyl03/article/details/8228732Gadgets Series: Enhancing the functionality of the Windows Runtime Bar (i): http:blog.csdn.net/zhaoyl03/article/details/8887157Replace with regular expression in UltraEdit: http:blog.csdn.net/zhaoyl03/article/details/8432129Socrates: Knowing his ignorance: http:blog.csdn.net/zhaoyl03/article/details/8158793How to remove duplicates of another file from one file: http:blog.csdn.net/zhaoyl03/article/details/8188264to add a row or column to a PPT table: http:blog.csdn.net/zhaoyl03/article/details/8156308Another solution to the beauty of programming "string shift contains problems": http:blog.csdn.net/zhaoyl03/article/details/8656755C++file Terminator: http:blog.csdn.net/zhaoyl03/article/details/8165989some tips in matlab: http:blog.csdn.net/zhaoyl03/article/details/8155941a function that clears a series of symbolic definitions in Mathematica: http:blog.csdn.net/zhaoyl03/article/details/8205689talking about the computational in SXSW2013: http:blog.csdn.net/zhaoyl03/article/details/8822284Bruno buchberger:a Life devoted to symbolic computation:http:blog.csdn.net/zhaoyl03/article/details/8612627Magical Windows "Run": http:blog.csdn.net/zhaoyl03/article/details/8874937usage of MATLAB function handle: http:blog.csdn.net/zhaoyl03/article/details/8215588clenshaw–curtis quadrature:http:blog.csdn.net/zhaoyl03/article/details/8500408applications for Cin.fail,cin.clear,cin.sync: http:blog.csdn.net/zhaoyl03/article/details/8167049Visual Studio Command window:http:blog.csdn.net/zhaoyl03/article/details/8144816Study notes: Cin.clear (istream::failbit): http:blog.csdn.net/zhaoyl03/article/details/8197649Bloomfilter (Bron filter): http:blog.csdn.net/zhaoyl03/article/details/8653391mass Data processing (i): http:blog.csdn.net/zhaoyl03/article/details/8684006the difference between a database and a data warehouse: http:blog.csdn.net/zhaoyl03/article/details/8655596design mode: Single-case mode: http:blog.csdn.net/zhaoyl03/article/details/40264363C++typedef usage Detailed: http:blog.csdn.net/zhaoyl03/article/details/8195621python write crawler-crawl web pages and parse Html:http:blog.csdn.net/zhaoyl03/article/details/8631645build a Hadoop environment on Ubuntu (standalone mode )+pseudo-distribution mode): http:blog.csdn.net/zhaoyl03/article/details/8657104Regular expression substitution in Tex: http:blog.csdn.net/zhaoyl03/article/details/8686915Common DOS Commands Daquan: http:blog.csdn.net/zhaoyl03/article/details/8144856Java's first program: http:blog.csdn.net/zhaoyl03/article/details/8457074Rousseau: Man is not in Chains: http:blog.csdn.net/zhaoyl03/article/details/8158752Mathematica function call stops calculation when an exception occurs: http:blog.csdn.net/zhaoyl03/article/details/8191083Ada (Ada Lovelace): http:blog.csdn.net/zhaoyl03/article/details/8279768Study notes: C++Pointer to a character array: http:blog.csdn.net/zhaoyl03/article/details/8274575Gadgets Series: Enhancements to the Windows Runtime Bar (ii): http:blog.csdn.net/zhaoyl03/article/details/8887724Python Sort: http:blog.csdn.net/zhaoyl03/article/details/8683091implementing Hadoop MapReduce programs using python: http:blog.csdn.net/zhaoyl03/article/details/8657031The beauty of mathematics: the Ordinary and Magical Bayes method: http:blog.csdn.net/zhaoyl03/article/details/8655464map for making Web visitors: http:blog.csdn.net/zhaoyl03/article/details/8531409C++atof (): http:blog.csdn.net/zhaoyl03/article/details/8176387Data Mining Study notes: KNN algorithm (ii): http:blog.csdn.net/zhaoyl03/article/details/8679256Data Mining Study notes: ID3 Algorithm (i): http:blog.csdn.net/zhaoyl03/article/details/8665663C/c++ Compiler-cl.exe Command options: http:blog.csdn.net/zhaoyl03/article/details/8144675Excel table Multiplication function formula: http:blog.csdn.net/zhaoyl03/article/details/8208537using WinDbg to analyze Minidump:http:blog.csdn.net/zhaoyl03/article/details/8217337the path of excellent ASP. NET Programmer: http:blog.csdn.net/zhaoyl03/article/details/8456466insert a mathematical formula on the CSDN Web page: http:blog.csdn.net/zhaoyl03/article/details/8153608using Chebfun to solve the Blasius equation (ii): http:blog.csdn.net/zhaoyl03/article/details/8266419python and the Simple web crawler: http:blog.csdn.net/zhaoyl03/article/details/8631928[Scholar's] history: The Rise of great powers: from China to China Xuanshuo: http:blog.csdn.net/zhaoyl03/article/details/8177741A study of the mathlink of the Mathematica System Communication mechanism: http:blog.csdn.net/zhaoyl03/article/details/8181690physicists Discover a whoppingNew Solutions to three-Body problem:http:blog.csdn.net/zhaoyl03/article/details/8822310How to use Mathematica to invoke a function written in C: http:blog.csdn.net/zhaoyl03/article/details/8181706Gadgets series: Python calls Google Translate: http:blog.csdn.net/zhaoyl03/article/details/8830806First Glimpse of Applet:http:blog.csdn.net/zhaoyl03/article/details/8810940Charles Babbage-the father of computer pioneers: http:blog.csdn.net/zhaoyl03/article/details/8279940Lobatto quadrature:http:blog.csdn.net/zhaoyl03/article/details/8155438Enter the Greek alphabet in Matlab: http:blog.csdn.net/zhaoyl03/article/details/8147696how Excel sets the Print area: http:blog.csdn.net/zhaoyl03/article/details/8144595Batch for command explanation: http:blog.csdn.net/zhaoyl03/article/details/8886067sizeof:http:blog.csdn.net/zhaoyl03/article/details/9090639cin.get,cin.clear and Cin.sync:http:blog.csdn.net/zhaoyl03/article/details/8167024Data Mining Study notes: KNN Algorithm (i): http:blog.csdn.net/zhaoyl03/article/details/8666906Chebyshev Expand: http:blog.csdn.net/zhaoyl03/article/details/8494474Pythonyield: http:blog.csdn.net/zhaoyl03/article/details/8683936Socrates: "Know Yourself": http:blog.csdn.net/zhaoyl03/article/details/8158812cin.get (), stream, and buffer: http:blog.csdn.net/zhaoyl03/article/details/8165889C++call Exe:http using the system with parameters:blog.csdn.net/zhaoyl03/article/details/8176699Data Mining Study notes: KNN algorithm (three): http:blog.csdn.net/zhaoyl03/article/details/8679378Study notes: C++pointer to function: http:blog.csdn.net/zhaoyl03/article/details/8195922OpenCL Development Case Study: http:blog.csdn.net/zhaoyl03/article/details/8517369using Chebfun to solve the Blasius equation (i): http:blog.csdn.net/zhaoyl03/article/details/8263627embed search and access counters on a Web page: http:blog.csdn.net/zhaoyl03/article/details/8524693Shanks transformation:http:blog.csdn.net/zhaoyl03/article/details/8607019[Finishedinch2.5S]

Python Crawler Learning Note 1

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.