use Python to crawl the journal information on the Medsci, by setting conditions, and then getting the corresponding journal the Influence factor ranking, journal name, English full name and influence factor. The main process is as follows:
First, through the analysis of Web site Http://www.medsci.cn/sci Interactive process. You can use Google or Firefox "review element-->network", and then you can see the action page to see the site's interactive information. When you click on "I want to query" on the webpage, the webpage sends a post message to the server, and then the server returns the query result.
Then, the query results are extracted using regular expressions to extract the required data.
Finally, the extracted data is output to a file.
The key to the code is to analyze the post datagram, find the data that needs to be sent to the server, and fill in the header of the HTTP message.
Through the browser's "Review element-->network-->post link-->headers", can find a form data table, this data table holds all the query criteria.
By assigning values to these data tables in your code, you can simulate the browser sending a POST request and then get the HTML code. The next step is to do further processing of the acquired data.
Some of the variables in the code do a simple description:
Num: Indicates the number of journals to get
The parameters in value hold the query criteria: The parameter names for each sub-condition are as follows:
FullName: Journal Key Words
Province: Major categories in the journal field
City: Two-level classification in periodicals,
Impact_factor_b:if a value less than the range
Impact_factor_s:if range is greater than the value
Rank: How to arrange
This code has a bug, when the impact factor of a journal is empty or unknown, then the journal must be at the last position, otherwise the code may produce an exception, and in the final result will not output the issue of unknown factors.
The code is as follows:
1 #!/usr/bin/python2 #Coding=utf-83 ImportUrllib4 ImportUrllib25 ImportRe6 Import Time7 8 GlobalRank9 GlobalNumTenURL ='Http://www.medsci.cn/sci/index.do?action=search' Oneheaders = { A 'POST': URL, - 'Host':'www.medsci.cn', - 'Origin':'http://www.medsci.cn', the "Referer":"Http://www.medsci.cn/sci", - "user-agent":"mozilla/5.0 (X11; Linux x86_64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36", - } -Value = { + "FullName":"", - "Province":"Medical", + " City":"', A "Impact_factor_b":"0", at "impact_factor_s":"0", - "Rank":"If_rank_b", - "Submit":"I want to inquire" - } - - defgetData (Impact): invalue["Impact_factor_b"] =Impact -data =Urllib.urlencode (value); toreq =Urllib2. Request (URL, data) + forKeyinchheaders: - Req.add_header (Key, Headers[key]) theResponse =Urllib2.urlopen (req) *HTML =Response.read (); $ returnHTMLPanax Notoginseng - defsaveData (HTML, rank): theR1 = Re.findall (r'_blank > ([a-z|\-|a-z|\s|\ (|\) |&]*?) </a>', HTML); +r2 = Re.findall (r'<br> ([A-z|a-z|\-|\s|&|\ (|\)]+?) </td>', HTML); AR3 = Re.findall (r'\ n ([0-9|.] +?) [\s|<]', HTML); the +Le1 =Len (R1) -Le2 =Len (R2) $Le3 =Len (R3) $Le =Le1 - ifLe2 <le: -Le =Le2; the ifLe3 <le: -Le =Le3Wuyi theCount =0 -Flag =True Wu ifLe < 50: -Flag =False About whileCount < le andRank <Num: $Rank + = 1 - #print ' Count: ', count, ', le: ', le, ', rank: ', rank, ', num: ', num -str1 = str (rank) +","+r1[count]+","+r2[count]+","+r3[count]+"\ n" -Count + = 1 A f.write (str1) + the returnR3[count-1], Rank,flag#return the last impact - $ if __name__=="__main__": thef = open ("Res.csv","w+") theImpact =0 theRank =0 thenum = 100 -Flag =True in whileRank < num andflag: theHTML =getData (impact); theImpact, rank, flag =saveData (HTML, rank); About Print 'already get Data number:', Rank theTime.sleep (2); the f.close () the Print 'finished!'
Use Python to crawl the pre-ranked literature on the impact factors on Medsci