1. Enter the file as
Fufang_list.txt
Yaofang_a Aaiwan an ai pill yaofang_a aaiwulingsan a Ai scattered yaofang_a Acaitang a vegetable soup yaofang_a Afurongjiu Lotus Wine yaofang_a Aqietuoyao agatho medicine yaofang_a aweichubisan Awai nasal dispersion yaofang_a Aweigao Wei Ointment yaofang_a Aweigaoyao wei plaster yaofang_a Aweihuapigao Awaihua swelling paste yaofang_a Aweihuapisan Awaihua yaofang_a Aweijikuaiwan a Wei pills yaofang_a aweileiwansan Avere pills scattered yaofang_a Aweilizhongwan Aweiri pills yaofang_a aweiliangjiangwan Awailiang Ginger pills yaofang_a aweiruanjiansan Wei Soft Jian scattered yaofang_a aweisan yaofang_a Aweishexiangsan A wei scattered yaofang_a Aweitongjingwan Wei pass through pill yaofang_a Aweiwan Wei Wan yaofang_a Aweiwanlinggao a Weiwanling ointment
2. Crawler scripts
get_tcmdata.py
#!/usr/bin/python#Coding:utf8 from __future__ Importprint_functionImportClickImportUrllib2ImportRe fromBs4ImportBeautifulSoupImportsysreload (SYS)Importsocketsys.setdefaultencoding ("UTF8") Socket.setdefaulttimeout (20) Base_url="http://www.zysj.com.cn/zhongyaofang/{}.html"Headers= {'user-agent':'mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6'@click. Command () @click. argument ('INPUT1') @click. Option ("--pos",'-pos')defQuery_tcm_info (input1,pos):"""The script would ignore the previous POS lines"""zhongyaofang_list=Open (INPUT1) POS=int (POS) num=0ifPOS: forIinchRange (0,pos): zhongyaofang_list.readline () num= num +POS forZhongyaofang_infoinchZhongyaofang_list:num= num +1zhongyaofang_info_list= Zhongyaofang_info.strip ("\ n"). Split ("\ t") url_id="/". Join (Zhongyaofang_info_list[0:2]) File_out="_". Join (Zhongyaofang_info_list[0:2]) File_out_name="_". Join ([File_out,str (num)]) output_file= Open (file_out_name+". txt","W") Query_url=Base_url.format (URL_ID) Req= Urllib2. Request (query_url,headers =headers) Content= Urllib2.urlopen (req,timeout=20). Read () Soup=BeautifulSoup (content) Words=Soup.gettext () output_file.write (words)if __name__=="__main__": Query_tcm_info ()
3. Run the script command
Python get_tcmdata.py fufang_list.txt--pos 0
4. Simple Baidu Crawler
#!/usr/bin/python#Coding:utf8 from __future__ Importprint_functionImportsysreload (SYS) sys.setdefaultencoding ("UTF8")Importurllib2request=Urllib2. Request (URL) request.add_data ('a',"1") Request.add_heder ('user-agent',"mozilla/5.0") Response=Urllib2.urlopen (Request) Cont=Response.read ()Print(cont)
Python Training crawler