Python for Infomatics chapter 14th database and SQL application four (translated)

Source: Internet
Author: User

14.6 crawling Twitter with a database

In this section, we will create a simple crawler program. It will search the Twitter account carefully and create an account database. Note: Be very careful when running this program. If you crawl too much data or run the program for a long time, it may end up being turned off by your Twitter account.

The problem with any crawler is that it needs to be able to shut down and restart many times, and you don't want to lose the data you've acquired so far. You don't want to get all the data back every time you reboot, so we're going to store the data we've got so our program can back it up and start over where it stopped.

We'll start by getting someone's Twitter friends and their status, loop through the list of friends, and add each friend you get to the database. When we finish working on a person's Twitter friends, log in to our database and then get each friend's friend. We do this over and over again, picking up people we don't have access to, getting their friends list and adding our non-logged friends to our list for future visits.

We also keep track of how many times a particular friend in the database has appeared to get his popularity.

We can stop and restart our program at any time by storing a list of our known accounts, and whether to obtain the account and the popularity of the account to the computer's hard drive.

This program is a bit complicated. It is based on the previous use of the Twitter API Practice code in this book.

The source code for our Twitter crawler application is as follows:

1 Importurllib.request2 ImportTwurl3 ImportJSON4 ImportSqlite35 6Twitter_url ='Https://api.twitter.com/1.1/friends/list.json'7 8conn = Sqlite3.connect ('Spider.sqlite3')9Cur =conn.cursor ()Ten  OneCur.execute (" " A CREATE TABLE IF not EXISTS twiter - (name TEXT, retrieved INTEGER, friends Interger)" ") -  the  whileTrue: -Acct = INPUT ('Enter A Twitter account:, or quit:') -     if(Acct = ='quit') : Break -     if(Len (ACCT) < 1): +Cur.execute ('SELECT name from Twitter WHERE retrieved = 0 LIMIT 1') -         Try: +Acct =cur.fetchone () [0] A         except: at             Print('No unretrieved Twitter accounts found') -             Continue -URL =twurl.augment (Twitter_url, -{'Screen_name': Acct,'Count': 20}) -     Print('Retrieving', URL) -Connection =urllib.request.urlopen (URL) indata =Connection.read () -headers =connection.info (). Dict toJS =json.loads (data) +  -Cur.execute ('UPDATE Twitter SET retrieved=1 WHERE name =?', (acct,)) the  *Countnew =0 $Countold =0Panax Notoginseng      forUinchjs['Users']: -Friend = u['Screen_name'] the         Print(Friend) +Cur.execute ('SELECT friends from Twitter WHERE name =? LIMIT 1', A (friend,)) the         Try: +Count =cur.fetchone () [0] -Cur.execute ('UPDATE Twitter SET friends =? WHERE name =?', $(count+1, friend)) $Countold = countold + 1 -         except: -Cut.execute (" "INSERT into Twitter (name, retrieved, friends the VALUES (?, 0, 1)" ", (Friend,)) -Countnew = countnew + 1Wuyi     Print('New accounts =', Countnew,'Revisited', Countold) the Conn.commit () -  WuCur.close ()
View Code

Our database is stored in the Spider.sqlite3 file. It has a table named Twitter, each row in the table has three columns: account name, whether we've got friends (retrieved) from this account, and how many times this account has been added as a friend (friends).

In the main loop of the program, we prompt the user to enter the Twitter account name or enter "quit" to exit the program. If a user enters a Twitter account, we get all of the user's friends list and their status, and then add the database to a friend that doesn't exist in the database. If this friend already exists, we add 1 to the number of friends field.

If the user presses the ENTER key, we look in the database for the next Twitter account we haven't acquired, and then get the friends and status of the account. Add them to the database or update their number of friends.

Once we have the Friends List and status, we iterate through all the user projects in the returned JSON and get each user's claim. Then we use the SELECT statement to see if we have saved the nickname in the database, and if so, whether to get their friend information.

1Countnew =02Countold =03  forUinchjs['Users'] :4Friend = u['Screen_name']5 Printfriend6Cur.execute ('SELECT friends from Twitter WHERE name =? LIMIT 1',7 (friend,))8 Try:9Count =cur.fetchone () [0]TenCur.execute ('UPDATE Twitter SET friends =? WHERE name =?', One(count+1, friend)) ACountold = countold + 1 - except: -Cur.execute (" "INSERT into Twitter (name, retrieved, friends) the VALUES (?, 0, 1)" ", (Friend,)) -Countnew = countnew + 1 - Print 'New accounts=', Countnew,'revisited=', Countold -Conn.commit ()
View Code


As long as the cursor executes the SELECT statement, we are bound to fetch multiple rows. We can loop through the for statement, but because we limit only one row with (limit 1), you can use the Fetchone () method to return the first row (and only one row) of the result of the query operation. Because Fetcheone () returns rows in tuples (even if there is only one field), we use index [0] to extract the first value of the tuple, get the current number of friends, and save the variable count.

If the extraction succeeds, we use the SQL UPDATE statement with the WHERE clause to add 1 to the number of friends in the corresponding friend account. Note that there are two placeholders (?) in SQL. ), and the second parameter of the Execute () method is a two-tuple that contains a value to replace the question mark in SQL.

The code in the try module might not be queried to match where Name=? The record failed. So in the except module we use the SQL INSERT statement to add a friend nickname to the table and point out that we have not yet acquired the nickname's friend and set his friend number to zero.
When this program first runs, we enter a Twitter account, which runs the following information:
Enter a Twitter account, or Quit:drchuck
Retrieving Http://api.twitter.com/1.1/friends ...
New accounts= revisited= 0
Enter a Twitter account, or Quit:quit

Because this is the first time we run the program, the database is empty, so we created the database in the Spider.sqlite3 file and added a table called Twitter. Then we get some friends and add them all to the database.

At this point, we may write a simple database replicator to see what we have in our Spider.sqlite3 file:

ImportSqlite3conn= Sqlite3.connect ('Spider.sqlite3') cur=conn.cursor () Cur.execute ('SELECT * from Twitter') Count=0 forRowinchcur:PrintRowCount= Count + 1PrintCount'rows.'cur.close ()
View Code

This program opens the database and queries the column information for all rows in the Twitter table, and then loops through each row. If we run this program after first executing the previous Twitter crawler, its output looks like this:

(U ' opencontent ', 0, 1)
(U ' Lhawthorn ', 0, 1)
(U ' Steve_coppin ', 0, 1)
(U ' davidkocher ', 0, 1)
(U ' hrheingold ', 0, 1)
...
Rows.

We see a nickname for each line, and we don't crawl them, and they all have a friend in the database.

Now the database reflects the crawl of our first Twitter account (Drchuck) buddy. We can run the program again and no need to enter the account, just click on the carriage return, you can let it get back the account information not processed. The running results of the program are as follows:

Enter a Twitter account, or quit:
Retrieving Http://api.twitter.com/1.1/friends ...
New accounts= revisited= 2
Enter a Twitter account, or quit:
Retrieving Http://api.twitter.com/1.1/friends ...
New accounts= revisited= 3
Enter a Twitter account, or Quit:quit

Because we're hitting the ENTER key (we didn't specify a Twitter account), the program executes the following code:

if (Len (ACCT) < 1 ): Cur.execute ('SELECT name from Twitter WHERE retrieved = 0 LIMIT 1' )try= cur.fetchone () [0]except:print'  No unretrieved Twitter accounts found'continue
View Code

We use the SQL SELECT statement to get the first user name with zero unhandled value in the database. At the same time using the Try/except code block in the way of fetchone[0], extract the nickname or output error message, and again look for.

If we have successfully acquired an unhandled user nickname, we obtain their data through the following code:

url = twurl.augment (Twitter_url, {'Screen_name': Acct,'Count':' -'} )Print 'Retrieving', URLConnection=urllib.urlopen (URL) data=connection.read () JS=json.loads (data) Cur.execute ('UPDATE Twitter SET retrieved=1 WHERE name =?', (acct,))
View Code

Once we have successfully obtained the data, we use the UPDATE statement statement to set the value of the Get column to 1, indicating that we have finished getting the job. This prevents duplicate fetches and keeps the program moving forward on the network with Twitter friends. If we run a buddy program and enter two times to get a friend of the next unreachable friend and then run the replicator, it will give the following output:

(U ' opencontent ', 1, 1)
(U ' Lhawthorn ', 1, 1)
(U ' Steve_coppin ', 0, 1)
(U ' davidkocher ', 0, 1)
(U ' hrheingold ', 0, 1)
...
(U ' cnxorg ', 0, 2)
(U ' Knoop ', 0, 1)
(U ' Kthanos ', 0, 2)
(U ' lecturetools ', 0, 1)
...
Rows.

We can see that we have correctly recorded two friends who have visited Lhawthorn and opncontent. At the same time cnxorg and Kthanos already have two followers. Because now we have three people (Drchuck,opencontent and Lhawthon) friends, we have 55 rows of friends in our table.
Every time we run this program and press ENTER, it picks up the next outstanding account (for example, the next account will be Steve_coppin), gets and identifies each steve_coppin friend, adds them to the database, and updates their number of friends if they already exist.
Because the data for this program is stored in the database on the hard disk, the crawl activity can be arbitrarily aborted and restarted without losing data.

Note: The original article is Dr. Charles Severance's "Python for Informatics". The code in this section is not rewritten, and the 2.7 version of the commissioning times authentication error is estimated to be an OAuth security authentication issue.

Python for Infomatics chapter 14th database and SQL application four (translated)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.