Quantify data for 50 days in Hacker news quantifying Hacker News with

Source: Internet
Author: User

Quantifying Hacker News

I thought it would be analyze the activity on one of my favorite sources of interesting links and information, Hack Er News. My source of data is a script I ' ve set up some time in August that downloads HN (the Front page and the New Stories page) Every minute. We'll be interested in visualizing the stories as they get upvoted during the day and figuring out which domains/users is Most of popular, what topics is most popular, and the best time-to-post a story. I ' m making all my data and code (Python data collection scripts + IPython Notebook for analysis) available in case you ' d l IKE to carry out a similar analysis.

Data Collection Protocol

I set up a very simple Python script that scrapes the HN front page and the new Stories page every minut E. A single day of data begins at 4am (PST) and ends at 4am the next day. The. html files is saved compressed as gzipped pickles and one day occupies roughly 10mb in this format. I had bring down my machine for a few days a few times so there is some gaps in the data, and in the end we get a. F data from period between August and October 30.

Raw HTML Data parsing

The parsing Python script uses beautifulsoup to convert the raw HTML into a more structured JSON. This script were by the the-the-no means simple to write--HN are based on unstructured tables and I had to discover many St Range Edge cases in its behavior along the. At the end I ended to with a 100-line ugliest-parsing-function-ever (really, I ' m not proud of it) but it works and outputs Something like the following for a single story at a specific snapshot:

{' Domain ': U ' play.google.com ', ' Title ': u ' Nexus 5 ',   ' URL ':  u ' HTTPS://PLAY.GOOGLE.COM/STORE/DEVICES/DETAILS?ID=NEXUS_5_BLACK_16GB ',   ' num_comments ':  42,  ' Rank ':  1,  ' points ':  65,  ' user ':  u ' sonier ',   ' Minutes_ago ':  39,  ' ID ':  u ' 6648519 ' }    

We get such entries every minute (for the front page and for new page) and these is again all saved to disk. We is now ready to bring out the IPython Notebook and get to the juicy analysis!

The analysis:detailed analysis

Head of the IPython Notebook rendered as HTML for the analysis:

Note:i had the entire dataset and. IPYNB Ipython Notebook source available for download but recently took it down to save Space on my host (sorry).

from:http://karpathy.github.io/2013/11/27/quantifying-hacker-news/

Quantify data for 50 days in Hacker news quantifying Hacker News with

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.