More like the Phoenix News client's Fun series, so I wrote a python program to download all the addresses of such jokes.The following program is just download the first page of all the article URL, the program modified, you can crawl all the articles.#!/usr/bin/python#-*-coding:utf-8-*-import requestsimport jsonimport reheaders={"Host":'i.ifeng.com', "user-agent":"mozilla/5.0 (Windows NT 6.1; WOW64;
Scrapy is an application framework for crawling Web site data and extracting structured data. It can be applied in a series of programs including data mining, information processing or storing historical data. Using frames to crawl data that can save a lot of energy, such as do not need to download their own pages, data processing we do not have to write. We only need to focus on the data crawl rules on the
Java to the final exam, the teacher unexpectedly said no test volume, we write procedures to grade ... I'm not a little defensive ...
Anyway, I'm going to write a Baidu stick crawler to him, in order to facilitate the use of Jsoup to parse crawl.
Use our school bar to carry out the experiment (Guilin University of Technology), this is just a simple test, do not like to spray.
Use Jsoup to parse the
();
# Get the source of the page and store it in an array
def get_data (self,url,endpage):
url = url + ' pn= '
For I in Range (1,endpage+1):
Print U ' Crawler report: Crawler%d is loading ... '% i
myPage = urllib2.urlopen (url + str (i)). Read ()
# Process the HTML code in the MyPage and store it in the Datas
Self.deal_data (Mypage.decode (' GBK '))
# Pull content out of the page code
def deal_data (self,mypage):
MyItems = Re.findall (' id= ' post_content.*?> (. *?) For item in m
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.