Objective
As a ready to change the data analysis of the small white, I first contact is a network crawler learning, every time the crawler run has a new bug harvest, through constant debug, finally slightly can crawl some data, here want to share with you ~
Private messages Small 007 can get a small series of carefully prepared PDF 10 set Oh!
Take a look at the last page of search results
。
PS: Tips, in the lower page of the jump page enter a large number, such as 10000 can jump to the last page.
Right-click on the page source code, ctrl+f search key to crawl information, such as red box content
The big data analyst on the page red box can't be found!!!
It might be hidden in a JSON file.
Then try again and search the data analyst.
It's finally there.
Why is this, after the inspection found:
Daiwa back of the data analysis in the middle there is a <b> tag, what this means, frighten me hurriedly Baidu a bit
Set to Bold? Exm Well, it does show bold on the chart.
Continue to observe the source code, found that I want the information are in this (red box), it seems not to grab packet analysis Spicy ~
There's no time to explain, get in the car!
Well, you're not the driver at all, start writing code ...
The above is the setting path and the final data written to Excel file to pave the way
Result11=[]
Result21=[]
Result31=[]
Result41=[]
Result51=[]
Set up five empty lists put the final message I want to catch
There is no Chinese ah, copy out to visit to see.
Sure as it is!!!
Notice that there is a p=1 at the end of this URL, which is probably the page number, I'll try it for 5.
Look, sure enough, I'll try the last 90th page
Range (1,91) loops through the 1~90 page, p= "+str (k) to construct the loop URL (I'm going to crawl all 90 pages down)
Select regular extraction by observing Web page construction
Every time a page is extracted, all the information is constantly circulating in the list of result11~51
Results such as
A total of 5,221 data, not web search 12,354, this is eaten half alive!
I ran again, and sure enough the number is different, OK ... This question still needs to be solved, the trouble everybody Dalao understand the message reminds the younger brother
This <b></b> tag looks uncomfortable and uses Excel to do some post-processing
Find replacements
The amount of the error
Originally my default is WPS open, replaced by Office Excel opened after the operation results are as follows
is not much better, have the opportunity to continue to thank for the subsequent data analysis of this data ~
The complete code is as follows:
The code runs about 15~20second
They say Python can't find a job? So what are these jobs? 0 Basic Crawl Intelligence!