Use Python to crawl the pre-ranked literature on the impact factors on Medsci

Last Update:2014-11-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

use Python to crawl the journal information on the Medsci, by setting conditions, and then getting the corresponding journal the Influence factor ranking, journal name, English full name and influence factor. The main process is as follows:

First, through the analysis of Web site Http://www.medsci.cn/sci Interactive process. You can use Google or Firefox "review element-->network", and then you can see the action page to see the site's interactive information. When you click on "I want to query" on the webpage, the webpage sends a post message to the server, and then the server returns the query result.

Then, the query results are extracted using regular expressions to extract the required data.

Finally, the extracted data is output to a file.

The key to the code is to analyze the post datagram, find the data that needs to be sent to the server, and fill in the header of the HTTP message.

Through the browser's "Review element-->network-->post link-->headers", can find a form data table, this data table holds all the query criteria.

By assigning values to these data tables in your code, you can simulate the browser sending a POST request and then get the HTML code. The next step is to do further processing of the acquired data.

Some of the variables in the code do a simple description:

Num: Indicates the number of journals to get

The parameters in value hold the query criteria: The parameter names for each sub-condition are as follows:

FullName: Journal Key Words

Province: Major categories in the journal field

City: Two-level classification in periodicals,

Impact_factor_b:if a value less than the range

Impact_factor_s:if range is greater than the value

Rank: How to arrange

This code has a bug, when the impact factor of a journal is empty or unknown, then the journal must be at the last position, otherwise the code may produce an exception, and in the final result will not output the issue of unknown factors.

The code is as follows:

1 #!/usr/bin/python2 #Coding=utf-83 ImportUrllib4 ImportUrllib25 ImportRe6 Import Time7 8 GlobalRank9 GlobalNumTenURL ='Http://www.medsci.cn/sci/index.do?action=search' Oneheaders = { A     'POST': URL, -     'Host':'www.medsci.cn', -     'Origin':'http://www.medsci.cn', the     "Referer":"Http://www.medsci.cn/sci", -     "user-agent":"mozilla/5.0 (X11; Linux x86_64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36", - } -Value = { +     "FullName":"", -     "Province":"Medical", +     " City":"', A     "Impact_factor_b":"0", at     "impact_factor_s":"0", -     "Rank":"If_rank_b", -     "Submit":"I want to inquire" - } -  - defgetData (Impact): invalue["Impact_factor_b"] =Impact -data =Urllib.urlencode (value); toreq =Urllib2. Request (URL, data) +      forKeyinchheaders: - Req.add_header (Key, Headers[key]) theResponse =Urllib2.urlopen (req) *HTML =Response.read (); $     returnHTMLPanax Notoginseng      - defsaveData (HTML, rank): theR1 = Re.findall (r'_blank > ([a-z|\-|a-z|\s|\ (|\) |&]*?) </a>', HTML); +r2 = Re.findall (r'<br> ([A-z|a-z|\-|\s|&|\ (|\)]+?) </td>', HTML); AR3 = Re.findall (r'\ n ([0-9|.] +?) [\s|<]', HTML); the      +Le1 =Len (R1) -Le2 =Len (R2) $Le3 =Len (R3) $Le =Le1 -     ifLe2 <le: -Le =Le2; the     ifLe3 <le: -Le =Le3Wuyi      theCount =0 -Flag =True Wu     ifLe < 50: -Flag =False About      whileCount < le andRank <Num: $Rank + = 1 -         #print ' Count: ', count, ', le: ', le, ', rank: ', rank, ', num: ', num -str1 = str (rank) +","+r1[count]+","+r2[count]+","+r3[count]+"\ n" -Count + = 1 A f.write (str1) +      the     returnR3[count-1], Rank,flag#return the last impact -  $ if __name__=="__main__": thef = open ("Res.csv","w+") theImpact =0 theRank =0 thenum = 100 -Flag =True in      whileRank < num andflag: theHTML =getData (impact); theImpact, rank, flag =saveData (HTML, rank); About         Print 'already get Data number:', Rank theTime.sleep (2); the f.close () the     Print 'finished!'

Use Python to crawl the pre-ranked literature on the impact factors on Medsci

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python to crawl the pre-ranked literature on the impact factors on Medsci

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use Python to crawl the pre-ranked literature on the impact factors on Medsci

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support