Use Python to crawl the pre-ranked literature on the impact factors on Medsci

Source: Internet
Author: User

use Python to crawl the journal information on the Medsci, by setting conditions, and then getting the corresponding journal the Influence factor ranking, journal name, English full name and influence factor. The main process is as follows:

First, through the analysis of Web site Http://www.medsci.cn/sci Interactive process. You can use Google or Firefox "review element-->network", and then you can see the action page to see the site's interactive information. When you click on "I want to query" on the webpage, the webpage sends a post message to the server, and then the server returns the query result.

Then, the query results are extracted using regular expressions to extract the required data.

Finally, the extracted data is output to a file.

The key to the code is to analyze the post datagram, find the data that needs to be sent to the server, and fill in the header of the HTTP message.

Through the browser's "Review element-->network-->post link-->headers", can find a form data table, this data table holds all the query criteria.

By assigning values to these data tables in your code, you can simulate the browser sending a POST request and then get the HTML code. The next step is to do further processing of the acquired data.

Some of the variables in the code do a simple description:

Num: Indicates the number of journals to get

The parameters in value hold the query criteria: The parameter names for each sub-condition are as follows:

FullName: Journal Key Words

Province: Major categories in the journal field

City: Two-level classification in periodicals,

Impact_factor_b:if a value less than the range

Impact_factor_s:if range is greater than the value

Rank: How to arrange

This code has a bug, when the impact factor of a journal is empty or unknown, then the journal must be at the last position, otherwise the code may produce an exception, and in the final result will not output the issue of unknown factors.

The code is as follows:

1 #!/usr/bin/python2 #Coding=utf-83 ImportUrllib4 ImportUrllib25 ImportRe6 Import Time7 8 GlobalRank9 GlobalNumTenURL ='Http://www.medsci.cn/sci/index.do?action=search' Oneheaders = { A     'POST': URL, -     'Host':'www.medsci.cn', -     'Origin':'http://www.medsci.cn', the     "Referer":"Http://www.medsci.cn/sci", -     "user-agent":"mozilla/5.0 (X11; Linux x86_64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 safari/537.36", - } -Value = { +     "FullName":"", -     "Province":"Medical", +     " City":"', A     "Impact_factor_b":"0", at     "impact_factor_s":"0", -     "Rank":"If_rank_b", -     "Submit":"I want to inquire" - } -  - defgetData (Impact): invalue["Impact_factor_b"] =Impact -data =Urllib.urlencode (value); toreq =Urllib2. Request (URL, data) +      forKeyinchheaders: - Req.add_header (Key, Headers[key]) theResponse =Urllib2.urlopen (req) *HTML =Response.read (); $     returnHTMLPanax Notoginseng      - defsaveData (HTML, rank): theR1 = Re.findall (r'_blank > ([a-z|\-|a-z|\s|\ (|\) |&]*?) </a>', HTML); +r2 = Re.findall (r'<br> ([A-z|a-z|\-|\s|&|\ (|\)]+?) </td>', HTML); AR3 = Re.findall (r'\ n ([0-9|.] +?) [\s|<]', HTML); the      +Le1 =Len (R1) -Le2 =Len (R2) $Le3 =Len (R3) $Le =Le1 -     ifLe2 <le: -Le =Le2; the     ifLe3 <le: -Le =Le3Wuyi      theCount =0 -Flag =True Wu     ifLe < 50: -Flag =False About      whileCount < le andRank <Num: $Rank + = 1 -         #print ' Count: ', count, ', le: ', le, ', rank: ', rank, ', num: ', num -str1 = str (rank) +","+r1[count]+","+r2[count]+","+r3[count]+"\ n" -Count + = 1 A f.write (str1) +      the     returnR3[count-1], Rank,flag#return the last impact -  $ if __name__=="__main__": thef = open ("Res.csv","w+") theImpact =0 theRank =0 thenum = 100 -Flag =True in      whileRank < num andflag: theHTML =getData (impact); theImpact, rank, flag =saveData (HTML, rank); About         Print 'already get Data number:', Rank theTime.sleep (2); the f.close () the     Print 'finished!'

Use Python to crawl the pre-ranked literature on the impact factors on Medsci

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.