To crawl an HTML page, sometimes you need to set the parameter post request, generate JSON, save the file.
1) Introduction of modules
Import Requests from BS4 import BeautifulSoup
"Http://www.c .............. '
2) Setting parameters
Datas = { 'yyyy':' the', 'mm':'-12-31', 'CWZB':"incomestatements", 'Button2':"%CC%E1%BD%BB", }
3) Post request
R = requests.post (Url,data = datas)
4) Set the encoding
r.encoding = r.apparent_encoding
5) BeautifulSoup parsing request requests
Soup = BeautifulSoup (r.text)
6) Find_all Screening
Soup.find_all ('strong', text=re.compile (U" stock code ")) [0 ].parent.contents[1]
7) CSS Selection Select
Soup. Select ("option[selected]") [0].contents[0]
Beautifulsoap's API please see Https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#beautifulsoup
Demo: File read URL, set parameters, Post mode, Beautifulsoap parsing, generate JSON, save file
Import Requests fromBS4 Import beautifulsoupimport reimport jsonimport timefd= Open (R"E:\aa.txt","R") MyList= [] forLineinchFd:mylist.append (line)Url_pre='http://www.c ........ ........'Code= Open (R"e:\a---. txt","a") forIndexinchXrange0, Len (mylist)):Print Index URL_ID= Mylist[index].split ('?')[-1] URL_ID= url_id[-7:-1]datas= { 'yyyy':' the', 'mm':'-12-31', 'CWZB':"incomestatements",'Button2':"%CC%E1%BD%BB",} URL= Url_pre +str (url_id) Print URL print datas r= Requests.post (Url,data =datas) r.encoding=r.apparent_encoding Print R soup=BeautifulSoup (r.text) r.encoding=r.apparent_encoding Soup=BeautifulSoup (R.text)ifLen (Soup.find_all ("TD", Text=re.compile (U"Operating Income"))) ==0: ContinueJsonmap={} jsonmap[u'Stock Code'] = Soup.find_all ('Strong', Text=re.compile (U"Stock Code"))[0].parent.contents[1] Jsonmap[u'stock abbreviation'] = Soup.find_all ('Strong', Text=re.compile (U"Stock Code"))[0].parent.contents[3] Jsonmap[u'Annual'] = soup.Select("option[selected]")[0].contents[0] Jsonmap[u'reporting period'] = soup.Select("option[selected]")[1].contents[0] YYSR= Soup.find_all ("TD", Text=re.compile (U"Operating Income"))[0].parent Yysrsoup=BeautifulSoup (str (YYSR)) Jsonmap[u'Operating Income'] = Yysrsoup.find_all ('TD')[1].contents[0] YYLR= Soup.find_all ("TD", Text=re.compile (U"Operating profit"))[0].parent Yylrsoup=BeautifulSoup (str (YYLR)) Jsonmap[u'Operating profit'] = Yylrsoup.find_all ('TD')[3].contents[0] Strjson= Json.dumps (Jsonmap, ensure_ascii=False) Print Strjson #code. Write (Strjson) code.write (Strjson.encode ('Utf-8') +'\ n') Time.sleep (0.1) Code.flush ()
Crawler 2:html page +beautifulsoap module +post mode +demo