Python crawler source for automatic crawling of 163 news

Source: Internet
Author: User

The learning of Python crawlers, python crawler source for automatic crawling of 163 news, this is a Python language written in the automatic crawling NetEase News Python crawler implementation of the article.

The Python crawler's approach is:
(1) Analyze the target news URLs and analyse the links that begin with News.xxx.com
(2) Get the contents of each link and merge it into the prepared. txt text to see the news.
However, it should be noted that: due to today's test object, NetEase News format is not very uniform, all will have partial leakage of the situation, but also can be forgiven. Also hope that the ability of friends to help improve.

The Python crawler source for automatic crawling of 163 news is as follows:

?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051 #coding:utf-8importre, urllibstrTitle =""strTxtTmp =""strTxtOK =""f =open("163News.txt", "w+")m =re.findall(r"news\.163\.com/\d.+?<\/a>",urllib.urlopen("http://www.163.com").read(),re.M)#www.iplaypy.comfori inm:testUrl =i.split(‘"‘)[0]if testUrl[-4:-1]=="htm":strTitle = strTitle + "\n" + i.split(‘"‘)[0] +i.split(‘"‘)[1]  # 合并标题头内容okUrl = i.split(‘"‘)[0] # 重新组合链接UrlNews =‘‘UrlNews ="http://"+okUrlprintUrlNews"""查找分析链接里面的正文内容,但是由于 163 新闻的格式不是非常统一,所以只能说大部分可以。整理去掉部分 html 代码,让文本更易于观看。"""n =re.findall(r"

(.*?)<\/P>",urllib.urlopen(UrlNews).read(),re.M)

forj inn:iflen(j)<>0:j =j.replace(" ","\n")j =j.replace("","\n_____")j =j.replace("","_____\n")strTxtTmp =strTxtTmp +j +"\n"strTxtTmp =re.sub(r"", r"", strTxtTmp)strTxtTmp =re.sub(r"<\/[Aa]>", r"", strTxtTmp)strTxtOK =strTxtOK +"\n\n\n==============="+i.split(‘"‘)[0] + i.split(‘"‘)[1] +"===============\n"+strTxtTmpstrTxtTmp ="" # 组合链接标题和正文内容printstrTxtOKf.write(strTitle +"\n\n\n"+strTxtOK)# 全部分析完成后,写入文件f.close()#关闭文件

The article code is limited in effectiveness, please make the appropriate changes and then use.

Python crawler source for automatic crawling of 163 news

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.