How to crawl Twitter data (i)

Source: Internet
Author: User

Scraping Tweets Directly from Twitters Search Page–part 1

Published January 8,

Edit–since I wrote this post, Twitter have updated how to get the next list of tweets for your result. Rather than using Scroll_cursor, it uses max_position. I ' ve written a bit more in detail.

In fairly recent news, Twitter has started indexing it's entire history of Tweets going all the the-the-the-the-the-same-to 2006. Hurrah for Data scientists! However, even with this news (at time of writing), their search API was still restricted to the past seven days of Tweets. While I doubt this would be is the case permanently, as a useful exercise this post presents how we can search for Tweets from Twitter without necessarily using their API. Besides the indexing, there is also the advantage so Twitter is a little more liberal with rate limits, and you don ' t re Quire any authentication keys.

The post would be a split up into a parts, this first part looking at what we can extract from Twitter and how we might sta RT to go about it, and the second a tutorial on how we can implement this in Java.

Right, to begin, lets say we want to search Twitter for all tweets related to the query "Babylon 5". You can access Twitters advanced search without being logged in:https://twitter.com/search-advanced

If we take a look at the URL, that's constructed when we perform the search we get:

Https://twitter.com/search?q=Babylon%205&src=typd

As we can see, there is parameters, Q (our query encoded) and SRC (assumed to be the source of the query, i.e.  typed). However, by default, Twitter returns top results, rather than all, so on the displayed page, if you click on all the URL C Hanges to:

Https://twitter.com/search?f=realtime&q=Babylon%205&src=typd

The difference is the F=realtime parameter, appears to specify we receive Tweets in realtime as opposed to a subs ET of top Tweets. Useful to know, but currently we ' re is only getting the first and Tweets back. If We scroll down though, we notice this more Tweets is loaded on the page via AJAX. Logging all xmlhttprequests in whatever dev tool you choose to use, we can see that everytime we reach the bottom of the P Age, Twitter makes an AJAX call a URL similar to:

Https://twitter.com/i/search/timeline?f=realtime&q=Babylon%205&src=typd&include_available_features =1&include_entities=1&last_note_ts=85&scroll_cursor= Tweet-553069642609344512-553159310448918528-bd1uo2ffu9qaaaaaaaaetaaaaacaaaasaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Aaaaaaaaaaaaaaaaaaaaaaaaaa

On further inspection, we see that it's also a JSON response, which is very useful! Before we look at the response though, let's has a look at that URL and some of it ' s parameters.

First off, it ' s slightly different to the default search URL. The path Is/i/search/timeline as opposed To/search. Secondly, while we notice our familiar parameters Q, F, and SRC, from before, there is several additional ones. The most important and essential new one though is scroll_cursor. This was what Twitter uses to paginate the results. If you remove scroll_cursor from the URL, you end up with your first page of results again.

Now lets take a look-now at the JSON response that Twitter provides:

{has_more_items:boolean,items_html: "...", is_scrolling_request:boolean,is_refresh_request:boolean,scroll_cursor : "...", Refresh_cursor: "...", focused_refresh_interval:int}

Again, not all parameters for this post is important to take note of, but the ones that is include:has_more_items, item S_html, and Scroll_cursor.

Has_more_items–this lets you know with a Boolean value of whether or not there is any further results after this query.

Items_html–this is where all the tweets was which Twitter uses to append to the bottom of their timeline. It requires parsing, but there are a good amount of information in there to be extracted which we'll look at in a minute.

Scroll_cursor–a pagination value that allows us to extract the next page of results.

Remember Our scroll_cursor parameter from earlier On? well for each search request you do to Twitter, the value Of this key in the response provides your with the next set of tweets, allowing-to-recursively call Twitter until Eithe R Has_more_items is False, or your previous scroll_cursor equals the last scroll_cursor you had.

Now so we know how to access Twitters own search functionality lets turn our attention to the tweets themselves . As mentioned before, items_html in the response was where all the tweets was at. however, it comes in a block of HTML As Twitter injects, this block at the bottom of the page are made. The HTML inside is a list of Li elements and each element a tweets. I won ' t post the HTML for one here, as even one tweet have a lot of HTML in it, and if you want to look at it, copy the ITE Ms_html value (omiting the quotes around the HTML content) and paste it into something like jsbeautifier to see the F ormatted results for yourself.

If We look through the HTML, aside from the tweets text, there are actually a lot of useful information encapsulated in this D ATA packet. The most important item is the Tweet ID itself. If you check, it's actually in the root Li element. Now, we could stop here as with this ID, you can query Twitters official API, and if it's a public Tweet, you can get all kinds of information. However, that's ' d defeat the purpose of the not using the the API, so lets see what we can extract from what we already has.

The table below shows various CSS selector queries that's can use to extract the information with.

Embedded Tweet Data
Selector Value
Div.original-tweet[data-tweet-id] The authors Twitter handle
Div.original-tweet[data-name] The name of the author
Div.original-tweet[data-user-id] The user ID of the author
Span._timestamp[data-time] Timestamp of the Post
SPAN._TIMESTAMP[DATA-TIME-MS] Timestamp of the Post in MS
P.tweet-text Text of Tweets
Span. Profiletweet-action–retweet > span. Profiletweet-actioncount[data-tweet-stat-count] Number of Retweets
Span. Profiletweet-action–favorite > span. Profiletweet-actioncount[data-tweet-stat-count] Number of favourites

That's quite a sizeable amount of information in that HTML. From looking through, we can extract a bunch of stuff about the author, the time stamp of the tweet, the text, and number of retweets and favourites.

What has we learned here? Well, to summarize, we know what to construct a Twitter URL query, the response we get from said query, and the information We can extract from said response. The second part of this tutorial (to follow shortly) would introduce some code as to how we can implement the above.

How to crawl Twitter data (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.