Mining Twitter Data with Python

Source: Internet
Author: User

Directory
  • 1.Collecting data
    • 1.1 Register Your APP
    • 1.2 Accessing the Data
    • 1.3 Streaming
  • 2.Text pre-processing
    • 2.1 The Anatomy of a Tweet
    • 2.2 How to tokenise a Tweet Text
  • 3.Term Frequencies
    • 3.1 Counting Terms
    • 3.2 Removing Stop-words
    • 3.3 More term filters
  • 4.Rugby and Term co-occurrences
    • 4.1 The Application Domain
    • 4.2 Setting up
    • 4.3 Interesting terms and hashtags
    • 4.4 Term Co-occurrences
  • 5.Data Visualisation Basics
    • 5.1 from Python to Javascript with Vincent
    • 5.2 Time Series Visualisation
  • 6.Sentiment Analysis Basics
    • 6.1 A Simple approach for sentiment analysis
    • 6.2 Computing Term probabilities
    • 6.3 Computing The Semantic Orientation
    • 6.4 Some Limitations
  • 7.Geolocation and Interactive Maps
    • 7.1 GeoJSON
    • 7.2 from Tweets to GeoJSON
    • 7.3 Interactive Maps with Leaflet.js

Twitter is a popular social the network where users can share short sms-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and EN Gage with customers. The list of different ways to use Twitter could is really long, and with millions of tweets per day, there's a lot of Data to analyse and to play with.

This was the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we'll see the different options to collect data from Twitter. Once we have built a data set, in the next episodes we ' ll discuss some interesting data applications.

1.Collecting Data1.1 Register Your App

In order to has access to Twitter data programmatically, we need to the create a app that's interacts with the Twitter API.

The first step is the registration of your app. In particular, "need to" your browser to http://apps.twitter.com , log-in to Twitter (if you ' re not alread Y logged in) and register a new application. You can now choose a name and a description for your app (for example "Mining Demo" or similar). You'll receive a consumer key and a consumer Secret:these is application settings that should always be kept private. From the configuration page of the Your app, you can also require a access token and an access token secret. Similarly to the consumer keys, these strings must also being kept Private:they provide the application access to Twitter on behalf of your account. The default permissions is Read-only, which was all we need in our case, and if you decide to change your permission to PR Ovide writing features in your app, you must negotiate a new access token.

Important Note: There is rate limits in the "The use of the" the Twitter API, as well as limitations in case you want to P Rovide a downloadable data-set, see:

Https://dev.twitter.com/overview/terms/agreement-and-policy
Https://dev.twitter.com/rest/public/rate-limiting

1.2 Accessing the Data

Twitter provides REST APIs You can use to interact with their service. There is also a bunch of python-based clients out there so we can use without re-inventing the wheel. In particular, tweepy in one of the most interesting and straightforward to use, so let's install it:

pip install tweepy==3.5.0

In order to authorise our app for access Twitter on our behalf, we need to use the OAuth interface:

import tweepyfrom tweepy import OAuthHandler consumer_key = ‘YOUR-CONSUMER-KEY‘consumer_secret = ‘YOUR-CONSUMER-SECRET‘access_token = ‘YOUR-ACCESS-TOKEN‘access_secret = ‘YOUR-ACCESS-SECRET‘auth = OAuthHandler(consumer_key, consumer_secret)auth.set_access_token(access_token, access_secret) api = tweepy.API(auth)

The API variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

for status in tweepy.Cursor(api.home_timeline).items(10): # Process a single status print(status.text)

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we ' re using ten to limit the number of tweets we ' re reading, but we can of course access more. The status variable is an instance of the status () class, a nice wrapper to access the data. The JSON response from the Twitter API was available in the attribute _json (with a leading underscore), which are not the R AW JSON string, but a dictionary.

    • So the code above can is re-written to process/store the JSON:
for status in tweepy.Cursor(api.home_timeline).items(10): # Process a single status process_or_store(status._json)
    • What if we want to has a list of all our followers? There you go:
for friend in tweepy.Cursor(api.friends).items(): process_or_store(friend._json)
    • And how on a list of all our tweets? Simple:
for tweet in tweepy.Cursor(api.user_timeline).items(): process_or_store(tweet._json)

In this "we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert int o Different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function Process_or_store () is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweets per line:

def process_or_store(tweet): print(json.dumps(tweet))
1.3 Streaming

In case we want to "keep the connection open", and gather all the upcoming tweets about a particular event, the streaming API is the what we need. We need to extend the Streamlistener () to customise the "we process" the incoming data. A Working example that gathers all the new tweets with the #python hashtag:

from tweepy import Streamfrom tweepy.streaming import StreamListener class MyListener(StreamListener): def on_data(self, data): try: with open(‘python.json‘, ‘a‘) as f: f.write(data) return True except BaseException as e: print("Error on_data: %s" % str(e)) return True def on_error(self, status): print(status) return True twitter_stream = Stream(auth, MyListener())twitter_stream.filter(track=[‘#python‘])

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true to live events with a world-wide coverages (World Cups, Super bowls, Academy Awards, you name it), So keep a eye in the JSON file to understand how fast it grows and consider what many tweets you might need for your test S. The above script would save each tweet on a new line, so you can use the command wc-l Python.json from a Unix shell to Know how many tweets you ' ve gathered.

You can see a minimal working example of the Twitter Stream APIs in the following Gist:

##config.pyconsumer_key = ‘your-consumer-key‘consumer_secret = ‘your-consumer-secret‘access_token = ‘your-access-token‘access_secret = ‘your-access-secret‘
# #twitter_stream_download. PY # To run this code, first edit config.py with your configuration, then:## mkdir data# python Twitter_stream_download.py-q apple-d data# # It would produce the list of tweets for the query "Apple" # in the file data /stream_apple.jsonimport tweepyfrom tweepy Import streamfrom tweepy import oauthhandlerfrom tweepy.streaming Import Streamlistenerimport timeimport argparseimport stringimport configimport jsondef get_parser (): "" "Get parser for Comman D line arguments. "" " Parser = Argparse. Argumentparser (description= "Twitter Downloader") parser.add_argument ("-Q", "--query", dest= "Query", help= "Query/filter", default= '-') parser.add_a Rgument ("-D", "--data-dir", dest= "Data_dir", help= "Ou Tput/data Directory ") return Parserclass MyListener (streamlistener):" "" Custom Streamlistener for streaming data. "" " def __init__ (self, Data_dir, query): query_fname = format_filename (query) self.outfile = "%s/stream_%s.json" % (Data_dir, query_fname) def on_data (self, data): Try:with open (Self.outfile, ' a ') as F: F.write (data) print (data) return True except Baseexception as E:pri NT ("Error on_data:%s"% str (e)) Time.sleep (5) return True def on_error (self, status): print (S tatus) return truedef format_filename (fname): "" "Convert file name into a safe string. Arguments:fname-The file name to convert return:string--converted file name "" "Return". Jo In (Convert_valid (One_char) to One_char in fname) def convert_valid (One_char): "" "Convert a character into ' _ ' if Invali D. Arguments:one_char-The char to convert Return:character--converted Char "" "Valid_char s = "-_.%s%s"% (string.ascii_Letters, String.digits) if One_char in Valid_chars:return One_char else:return ' _ ' @classmethoddef pa RSE (CLS, API, raw): status = Cls.first_parse (API, Raw) SetAttr (status, ' json ', Json.dumps (Raw)) return Statusif _ _name__ = = ' __main__ ': parser = Get_parser () args = Parser.parse_args () auth = Oauthhandler (Config.consumer_key, Config.consumer_secret) Auth.set_access_token (Config.access_token, Config.access_secret) API = Tweepy. API (auth) Twitter_stream = Stream (auth, MyListener (Args.data_dir, Args.query)) Twitter_stream.filter (Track=[args.que RY])

Mining Twitter Data with Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.