Preface: Data science is getting more and more hot, and Web pages are a big source of data. Recently many people ask how to grasp the webpage data, as far as I know, the Common programming language (C++,java,python) can realize grasping the webpage data, even many statistics computation language (R,matlab) has can realize and the website interaction package. I tried to use Java,python,r to grasp the Web page, the sense of grammar are different, logically is the same. I am going to use Python to talk about the concept of grasping the Web page, the specific content to read the manual or Google other People's blog, here is a start. The level is limited, there are errors or there are better ways to welcome the discussion.
. From:1point3acres.com/bbs
Step One: Familiarize yourself with Python's basic syntax. Already familiar with Python's jump to step two.
Python is a relatively easy to get started programming language, how to learn from the basics of programming depends.
(1) If there is a certain basis for programming, suggest looking at Google's Python class, link https://developers.google.com/edu/python/?hl=zh-CN&csw=1
This is a two-day short-term training course (two full days, of course), probably seven videos, each of which is followed by a programming assignment that can be completed within one hours of each job. This is my second class to learn Python (the first one is Codecademy python, very early to see, a lot of content are not remembered), then watch video + programming One hours a day, six days to finish, the effect is good, with Python write basic program no problem.
(2) If there is no programming basis, it is recommended to look at the Coursera Rice University open an A Introduction to Interactive programming in Python. I did not follow this course, but see coursetalk comments reflect very good, the field also has classmate comments (point here), course Links: Https://www.coursera.org/course/interactivepython. Udacity on the CS101 is also a good choice, the field has a related discussion post (point here), and this course is called build a search engine, will be devoted to some network-related module. Other learning resources also have code school and Codecademy, these resources are quite good, but the amount of programming is too little, beginners or system with the class, more practice practicing to lay a good foundation. From:1point3acres.com/bbs
Of course, everyone's preferences are different, I recommend not necessarily for you. You can take a look at this post "long-term bonus points" to introduce what other people in the Open class you've been talking about, or go to coursetalk.org to take a look at the course reviews and decide.
. 鐣 欏 Ms. Clever coax 潧-Juan € Bang-Juan 夊 垎-鍦
Step Two: Learn how to create a link with the website to get the data. Write scripts interact with the Web site to familiarize yourself with one of several module (URLLIB,URLLIB2,HTTPLIB) related to Python and web pages, knowing one, others are similar. These three are the basic module that Python provides for interacting with Web pages, and others, such as mechanize and scrappy, which I have not used, may have better performance, and are welcome to add. Basic Web Crawl, the previous three module is enough. 1point3acres Network
The following code shows how to use URLLIB2 to interact with Google Scholar to get web page information.
# import Module Urllib2
Import Urllib2
# Random Query an article, such as on random graph. Query Google for each one. Waral Click 氬 chain 夋 draft weather 氭 PO 绔
# scholar have a URL, the rules of the formation of this URL to be analyzed by themselves. Imagery CuO 鎴 戜 sliding @1point 3 acres
query = ' on+random+graph '
url = ' http://scholar.google.com/scholar?hl=en&q= ' + query + ' &btng=&as_sdt=1%2c5&as_sdtp= '
# Set header file. Crawl some of the pages do not need to set the header file specifically, but here if not set. More info on 1point3acres.com
# Google will think that robots are not allowed to access. In addition to accessing some websites and setting cookies, this will be relatively complicated,
# don't mention it here for a while. about how to know how to write the header file, some plug-ins can see your browser and the site interaction
# header files (many of these browsers are self-contained), I use Firefox's Firebug plugin.
Header = {' Host ': ' scholar.google.com ',
' User-agent ': ' mozilla/5.0 (Windows NT 6.1; rv:26.0) gecko/20100101 firefox/26.0 ',
' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ',
' accept-encoding ': ' gzip, deflate ',
' Connection ': ' Keep-alive '}
# Create a connection request, when Google's server returns page information to con this variable, con is an object
req = Urllib2. Request (url, headers = header). More info on 1point3acres.com
Con = urllib2.urlopen (req)
# to con This object calls the Read () method, which returns an HTML page, which is plain text with HTML tags
doc = Con.read ()
# Close the connection. Just like reading a file to close the file, if not close sometimes can, but sometimes there are problems,
# so as a law-abiding good citizen, it's better to close the connection.
Con.close () The code above the copy code returns the result of the query on Random Graph in Google Scholar to the doc variable, which you open Google Scholar search on Random graph, And then the right key to save the page effect is the same.
step three, parse the Web page. 1point3acres.com/bbs
The steps above get the information on the Web page, but include the HTML tag, you want to remove the tags, and then sort out the useful information from the HTML text.
You need to parse this page.
Ways to parse a Web page:
(1) Regular expression. Regular expressions are useful, familiar with it saves a lot of time, sometimes cleaning data do not need to write a script or query on the database, directly on the notepad++ with regular expression combination of the line. How to learn regular expression suggestions look: Regular Expressions 30-minute introductory tutorials, Links: http://deerchao.net/tutorials/regex/regex.htm-google 1point3acres
(2) BeautifulSoup module. BeautifulSoup is a powerful module that parses an HTML file into an object that is a tree. We all know that HTML files are tree, such as the Body-> table-> tbody-> TR, for tbody This node, there are a number of TR child nodes. BeautifulSoup can be conveniently fetched to a particular node, or it can be sibling node for a single point. There are a lot of relevant instructions on the web, not in detail here, just demo simple code:
(3) The above two methods are used in combination.
# import BeautifulSoup module and re module, re is the module of Python's regular expression
Import BeautifulSoup
Import Re. Visit 1point3acres.com for more.
# Generate a Soup object, Doc is the-google mentioned in step two 1point3acres
Soup = Beautifulsoup.beautifulsoup (doc)
# grab the title, author, brief description, number of citations, version number, hyperlink to the list of articles that reference it
# here also used some regular expressions, unfamiliar with the first ignorance of it good. As for ' class ': ' Gs_rt '
# "Gs_rt" is how to come, this is the analysis of HTML files visible to the naked eye. The Firebug plugin mentioned above 鏉 ユ 簮 Juan € bang. The 夊 垎 mixed 鍦.
# Let this be easy, just a little page, you can know the location and attributes of the corresponding HTML tag,
# quite handy.
Paper_name = Soup.html.body.find (' h3 ', {' class ': ' Gs_rt '}). Text
Paper_name = Re.sub (R ' \[.*\] ', ', paper_name) # Eliminate ' [] ' tags like ' [PDF] '
Paper_author = Soup.html.body.find (' div ', {' class ': ' Gs_a '}). Text
Paper_desc = Soup.html.body.find (' div ', {' class ': ' Gs_rs '}). Text
Temp_str = Soup.html.body.find (' div ', {' class ': ' Gs_fl '}). Text
Temp_re = Re.match (R ' [a-za-z\s]+ (\d*) [a-za-z\s]+ (\d*) ', temp_str)
Citetimes = Temp_re.group (1). From:1point3acres.com/bbs
Versionnum = Temp_re.group (2)
if citetimes = = ':
Citetimes = ' 0 '
if Versionnum = = ':
Versionnum = ' 0 '
Citedpaper_href = Soup.html.body.find (' div ', {' class ': ' Gs_fl '}). a.attrs[0][1] Copy Code
These are the code of my project in an analysis of citation network. By the way, I grabbed paper information from Google Scholar and cited the list of information, visited about 1900 times when the Google block, resulting in the area of IP temporarily unable to login Google Scholar. Chain po 鍘 熷 Dang 鑷 1point3acres Coax Ms. Clever
Step four: Access data Finally grabbed the data, now only stored in memory, must be saved to use.
(1) The simplest way to write the data into the TXT file, Python can be implemented in the following code:
# Open File Webdata.txt, generate object file, this file can not exist, the parameter a means to add to the inside.
# There are other parameters, such as ' R ' can only read but not write, ' W ' may write but delete the original record and so on
File = open (' Webdata.txt ', ' a ')
line = Paper_name + ' # ' + paper_author + ' # ' + Paper_desc + ' # ' + citetimes + ' \ n '
# The Write method of object file writes string line to file
File = File.write (line)
# once again, be a good young man to close the papers
File.close () Copying the code so that the data captured and parsed from the Web site is stored locally, is not easy.
(2) Of course, you can not write TXT file, but directly connected to the database, Python in the MySQLdb module can be implemented and MySQL database interaction, the data directly into the database, and the MySQL database to establish the logic of the link with the Web server to establish a link logic. If you have studied the database before, learning to use the MYSQLDB module implementation and database interaction is very simple, if not, you have to rely on the coursera\ Stanford Openedx platform is open to the introduction to database system to learn, W3school used as a reference or as a handbook. 鐣 欏 Ms. Clever coax 潧-Juan € Bang-Juan 夊 垎-鍦
Python is able to link the database on the premise that the database is open, I use Win7 + MySQL5.5, the database is local.
% can use CMD to open the database, the start command is:
net start Mysql55-google 1point3acres
The% Close command is:. More info on 1point3acres.com
net stop MYSQL55 replication code using the MYSQLDB module code example:. 1point 3acres Ms. Clever coax 潧
# import MYSQLDB Module
Import MySQLdb
# and the server to establish a link, host is the server IP, my MySQL database built in this machine, the default is 127.0.0.1,
# Users, passwords, database names corresponding to the loss of the line, the default port number is 3306,charset is the encoding method,
# The default is UTF8 (it may also be GBK, see the installed version).
conn = MySQLdb.connect (host= ' 127.0.0.1 ', user= ' root ', passwd= ' YourPassword ', db= ' dbname ', port=3306, charset= ' UTF8 ')
# Build Cursor
cur = conn.cursor ()
# Execute SQL statement via Object cur execute () method
Cur.execute ("select * from citerelation where papername = ' on Random Graph '"). Imagery CuO 鎴 戜 sliding @1point 3 acres
# Fetchall () method to get query results, return a list, you can directly query: List[i][j],
# I means that the record,j of the i+1 in the query result represents the first j+1 attribute of the record (don't forget that Python counts from 0)
List = Cur.fetchall ()
# can also be delete,drop,insert,update and other operations, such as:
sql = "Update Studentcourserecord set fail = 1 where StudentID = '%s ' and Semesterid = '%s ' and CourseID = '%s '"% (student ID,COURSE[0],COURSE[1])
Cur.execute (SQL)
# Unlike queries, you must execute the following command to successfully update the database after you finish delete,insert,update these statements
Conn.commit ()
# As always, when you're done, close the cursor and close the link.
Cur.close ()
Conn.close () Copy code. From:1point3acres.com/bbs
This enables the interaction between Python and the database. In addition to the MySQL database, Python's Pygresql module can support the PostgreSQL database in a similar way. Also, if your page contains Chinese, set the encoding format will be very troublesome, the need for server, Python, database and database interface with the same coding format can not appear garbled, if there is really garbled Chinese problem, please believe that you are not a person. Go to Google, thousands of people have encountered this problem.
For questions about coding, here's a summary of Bowen < Python coding issues I've read >:
Http://www.xprogrammer.com/1258.html
PostScript: The above describes the way to crawl Web data, crawl data is only a small step, how to analyze the data is the study, welcome to discuss.
There is no clear place on the above, welcome to exchange.
Special attention: Large-scale crawling site will bring a lot of pressure on the site's servers, as far as possible to choose a relatively relaxed server time (such as early morning). Site A lot, do not take an acre of three points to do the experiment. The sleep () method of the Python time module allows the program to pause for a while, such as Time.sleep (1) to suspend 1 seconds while the program is running here. Timely pause can ease the pressure on the server, can also protect their own hard drive, just a long time to sleep, or to visit gym, the result came out. . From 1point 3acres BBS
Update:
February 15, 2014, changed several typos, added related course links, added Udacity CS101 Introduction, added MySQLdb module introduction.
February 16, 2014, added a link to the blog that describes the coding method.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.