Visual analysis of 911 News (Python version) with theme model

Last Update:2015-04-28 Source: Internet

Author: User

Tags python script idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article by Bole Online-Dongdou translation, Toolate School Draft. without permission, no reprint!
English Source: blog.dominodatalab.com. Welcome to join the translation team.

This article describes a project that visualizes the topics of the 911 attacks and the related news articles on subsequent impacts. I will introduce my starting point, the technical details of the implementation and my thoughts on some of the results.

Introduction

There is no more profound event in modern American history than the 911 attacks, and its impact will continue in the future. From the events to the present, thousands of articles of different themes into print without. How can we use the tools of data science to explore these topics and track their changes over time?

Inspiration

The first question was a company called local projects, who was commissioned to set up an exhibition for the National 911 Museum in New York. Their exhibitions, Timescape, visualize the themes and articles of events and project them onto a wall in the museum. Unfortunately, due to bureaucratic intervention and the three-minute heat of modern people, this exhibition can only show a lot of themes, fast cycle play. Timescape's design has inspired me, but I want to try to be more in-depth and interactive so that everyone who can access the internet can watch it in their free time.

The key to this question is how to tell a story. Each article has a different story-telling perspective, but there are clues to connect them with words. "Osama bin Laden", "Guantanamo Bay", "Freedom", and more words make up the bricks of my model.

Get Data

None of the sources is more appropriate than the New York Times to tell the story of 911. They also have a magical API that allows all articles on a topic to be queried in a database. I built my dataset with this API and some other Python web crawlers and NLP tools.

The crawl process is as follows:

Call the API to query the metadata for the news, including the URL for each article.
Send a GET request to each URL to find the body text in the HTML and extract it.
Clean up article text, remove discontinued words and punctuation

I wrote a python script to do these things automatically and was able to build a dataset with thousands of articles. Perhaps the most challenging part of the process is writing a function that extracts the body from an HTML document. In recent decades, The New York Times has also changed the structure of their HTML documents, so this extraction function depends on cumbersome nested conditional statements:

123456789101112131415 # s is a BeautifulSoup object containing the HTML of the pageif s.find(‘p‘, {‘itemprop‘: ‘articleBody‘}) is not None: paragraphs = s.findAll(‘p‘, {‘itemprop‘: ‘articleBody‘}) story = ‘ ‘.join([p.text for p in paragraphs])elif s.find(‘nyt_text‘): story = s.find(‘nyt_text‘).textelif s.find(‘div‘, {‘id‘: ‘mod-a-body-first-para‘}): story = s.find(‘div‘, {‘id‘: ‘mod-a-body-first-para‘}).text story += s.find(‘div‘, {‘id‘: ‘mod-a-body-after-first-para‘}).textelse: if s.find(‘p‘, {‘class‘: ‘story-body-text‘}) is not None: paragraphs = s.findAll(‘p‘, {‘class‘: ‘story-body-text‘}) story = ‘ ‘.join([p.text for p in paragraphs]) else: story = ‘‘

Document Vectorization

Before we apply the machine learning algorithm, we need to quantify the document. Thanks to Scikit-learn's IT-IDF Vectorizer module, it's easy. It's not enough to consider a single word, because my dataset doesn't lack important names. So I chose to use N-grams,n to take 1 to 3. Happily, implementing multiple N-gram is as simple as implementing a single keyword, simply setting Vectorizer parameters.

123	vec `=` `tfidfvectorizer (Max_features` `=` `max_features,` Code class= "Python plain" >ngram_range `=` `(` `1` `3 max_df = max_df)`

In the starting model, I set the Max_features (maximum number of words or phrases in the vector model) to be 20000 or 30000, within the computing power of my computer. But given that I've also joined 2-gram and 3-gram, these combinations can cause a number of features to explode (many of which are also important), and in my final model I'll raise that number.

using NMF to make a thematic model

Non-negative matrix decomposition (non-negative matrix factorization, or NMF), is a linear algebra optimization algorithm. Its most magical place is that it can extract meaningful information about the subject without any prior knowledge of the meaning of the interpretation. Mathematically its goal is to decompose a NXM input matrix into two matrices, called W and H,w, which are the document-subject matrix of the NXT, and H is the subject-word matrix of TXM. You can see that the dot product of W and H is the same as the input matrix shape. In fact, the model tries to construct W and H so that their dot product is an approximation of the input matrix. Another advantage of this algorithm is that the user can choose the value of the variable t independently, representing the number of generated topics.

Once again, I gave this important task to Scikit-learn, whose NMF module was enough to handle the task. If I devote more time to this project, I may find some more efficient NMF implementations, after all, it is the most complex and time-consuming process in this project. The implementation process I produced an idea, but did not implement it, is a hot start problem. That allows the user to fill in the lines of the H matrix with some specific words, thus giving the system some domain knowledge in the process of forming the subject. Anyway, I have only a few weeks to complete the project. There are a lot of other things that need my energy.

parameters of the topic model

Because the theme model is the cornerstone of the entire project, the decisions I make during the build process have a big impact on the final outcome. I decided to enter the model article for the 911 event that occurred 18 months later. In this period of time the noise is no longer, so the theme of this time is indeed the direct result of the 911 incident. In the vectorization phase, the scale of the start of several runs is limited to my computer. The results of 20 or 30 themes are good, but I want a larger model with more results.

My final model uses 100,000 vector vocabularies and about 15,000 articles. I set up 200 themes, so the NMF algorithm needs to deal with 15000x100000, 15000x200 and 200x100000 scale matrices. Gradually transform the latter two matrices to fit the first matrix.

Complete the Model

After the final model matrix is complete, I look at each topic and check the keywords (those with the highest probability values in the subject-word matrix). I give each topic a specific name (which can be used in visualizations) and decide whether to keep the theme. Some themes have been removed (such as local sports) because they have nothing to do with the central topic; some are too broad (about the stock market or the political theme); some are too specific, probably the error of the NMF algorithm (a series of related 3-grams from the same article)

After this process I have 75 clear and relevant topics, each named after the content.

Analysis

After the topic model has been trained, it is easy to figure out the weights of the different topics for a given article:

Use the stored TF-IDF model to quantify the text of the article.
Figure out the dot product of this vector and the condensed NMF theme-the word matrix. (1x100k * 100k x = 1 x 75)
The 75 dimensions of the result vector represent how relevant this article and 75 topics are.

The more difficult part is deciding how to turn these weights into a form of storytelling-telling visualizations. If I simply add up the topic weight of all the articles for a period of time, this distribution should be an accurate representation of how often each topic appears in that period. But the components of this distribution are meaningless to humans. In a different way, if I do a sub-classification of each topic, I can work out a percentage of articles related to a topic over a period of time. I chose this method because it is more illustrative of the problem.

Topic Sub-classification is also difficult, especially in the case of so many articles and topics. Some articles have higher weights under many topics because they are longer and contain keywords that appear in different topics. Other articles have a low weight on most subjects, even if artificial judgments can be found to be relevant to certain topics. These differences determine that the fixed weight threshold is not a good classification method; Some articles belong to many topics and some articles do not belong to any topic. I decided to classify each article under the three most weighted topics. Although this method is imperfect, it can provide a good balance to solve some of the problems of our thematic model.

Visualization of

Although data acquisition, the topic model and the analysis phase are important for this project, they are all for the final visualization service. I strive to balance visual attractiveness and user interaction so that users can explore and understand the trends of the topic without guidance. I started with a stack of blocks, and then I realized that the simple line drawing was enough and clear.

I use D3.js to visualize, which is appropriate for the data-driven model of this project. The data itself is uploaded to the Web page, through a CSV file containing the theme trend data and two JSON files containing the subject and article metadata. Although I am not an expert in front-end development, I succeeded in learning enough d3,html and CSS to build a satisfying visualization page through a week of courses.

some interesting themes.

After the –911 of anthrax, panic gripped the country. Fortunately, most of the panic is over-worried. The anthrax scare in the late 2001 was an isolated event with little to no follow-up effect, as is clearly visible.

Osama bin Laden, al-Qaeda, Paula Bora – all the subjects of interest in the spike occurred after bin Laden was killed in Abbottabad in 2011. This combination of topics is noteworthy because it shows the evolution of media attention after the 911 event: At first, bin Laden received a lot of attention. After a while, the topic of La Bora became prominent, as Tora Bora was the suspect bin Laden hideout and the U.S. military's focus. When Osama bin Laden escaped the chase, the focus of the two topics declined, while the broader al-Qaida conversation was somewhat elevated. The gradual improvement of each topic in recent years illustrates their relevance. Although there is no significant improvement, their relative concern is improved when other topics are quiet.

What have I learned?

Although I presented this project with an understanding of the topic model and the various components of data processing, the real significance of the project is that it (again) tells the story. 911 the nature of the event is negative, but there are many positive stories: Many heroes save a lot of people, community integration, and reconstruction.

Unfortunately, the media environment in my thematic model is focused on negative energy, villains, and destruction. Of course, some of the individual heroes were praised in one or two articles, but none were wide enough to form a theme. On the other hand, villains like Osama bin Laden and Kalia Moussavis are mentioned in many articles. Even Richard Reid, a clumsy (trying) bomber, has a more enduring media impact than some successful heroes (a supplement: a disadvantage of a vocabulary-focused theme model is that a common name like Reid leads to the gathering of articles that talk about different people. In this example, Harry Reid and Richard Reid).

Visual analysis of 911 News (Python version) with theme model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More