The learning of Python crawlers,
python crawler source for automatic crawling of 163 news, this is a Python language written in the automatic crawling NetEase News Python crawler implementation of the article.
The Python crawler's approach is:
(1) Analyze the target news URLs and analyse the links that begin with News.xxx.com
(2) Get the contents of each link and merge it into the prepared. txt text to see the news.
However, it should be noted that: due to today's test object, NetEase News format is not very uniform, all will have partial leakage of the situation, but also can be forgiven. Also hope that the ability of friends to help improve.
The Python crawler source for automatic crawling of 163 news is as follows:
?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051 |
#coding:utf-8
import
re, urllib
strTitle
=
""
strTxtTmp
=
""
strTxtOK
=
""
f
=
open
(
"163News.txt"
,
"w+"
)
m
=
re.findall(r
"news\.163\.com/\d.+?<\/a>"
,urllib.urlopen(
"http://www.163.com"
).read(),re.M)
#www.iplaypy.com
for
i
in
m:
testUrl
=
i.split(
‘"‘)[0]
if testUrl[-4:-1]=="htm":
strTitle = strTitle + "\n" + i.split(‘"‘
)[
0
]
+
i.split(
‘"‘)[1] # 合并标题头内容
okUrl = i.split(‘"‘
)[
0
]
# 重新组合链接
UrlNews
=
‘‘
UrlNews
=
"http://"
+
okUrl
print
UrlNews
"""
查找分析链接里面的正文内容,但是由于 163 新闻的格式不是非常统一,所以只能说大部分可以。
整理去掉部分 html 代码,让文本更易于观看。
"""
n
=
re.findall(r
"
(.*?)<\/P>" ,urllib.urlopen(UrlNews).read(),re.M)
for
j
in
n:
if
len
(j)<>
0
:
j
=
j.replace(
" "
,
"\n"
)
j
=
j.replace(
""
,
"\n_____"
)
j
=
j.replace(
"
"
,
"_____\n"
)
strTxtTmp
=
strTxtTmp
+
j
+
"\n"
strTxtTmp
=
re.sub(r
""
, r"", strTxtTmp)
strTxtTmp
=
re.sub(r
"<\/[Aa]>"
, r"", strTxtTmp)
strTxtOK
=
strTxtOK
+
"\n\n\n==============="
+
i.split(
‘"‘)[0] + i.split(‘"‘
)[
1
]
+
"===============\n"
+
strTxtTmp
strTxtTmp
=
""
# 组合链接标题和正文内容
print
strTxtOK
f.write(strTitle
+
"\n\n\n"
+
strTxtOK)
# 全部分析完成后,写入文件
f.close()
#关闭文件
|
The article code is limited in effectiveness, please make the appropriate changes and then use.
Python crawler source for automatic crawling of 163 news