This article describes a project that will visualize the theme of the 911 attacks and their subsequent impact on news articles. I will introduce my starting point, realize the technical details and I think about some of the results.
Brief introduction
There is no more far-reaching event in modern American history than the 911 attacks, and its impact will continue in the future. From the events to the present, thousands of articles of different themes have been published. How can we use the tools of data science to explore these topics and track them over time?
Inspiration
The first to raise the issue was a company called local projects, which was appointed to set up an exhibition for the National 911 Museum in New York. Their exhibitions, Timescape, visualize the themes and articles of the event and then project them onto a wall of the museum. Unfortunately, due to bureaucratic intervention and the three-minute heat of modern humans, the exhibit can only show a lot of themes and be quickly recycled. Timescape's design inspired me, but I want to try to be more in-depth and interactive so that everyone who can access the internet can watch it in their spare time.
The key to this question is how to tell a story. Each article has a different storytelling perspective, but there are clues to connect them with words. "Osama bin Laden", "Guantanamo Bay", "Freedom", and more words form the tiles of my model.
Get Data
None of the sources is better suited to tell the story of 911 than the New York Times. They also have a magical API that allows you to query the database for all the articles on a topic. I built my dataset with this API and some other Python web crawlers and NLP tools.
The crawl process is as follows:
- Invoke the API to query the metadata for the news, including the URL of each article.
- Send a GET request to each URL, find the body text in the HTML, and extract it.
- Clean up the text of the article, remove the stop words and punctuation
I wrote a python script that automatically did these things and was able to build a dataset with thousands of articles. Perhaps the most challenging part of the process is to write a function that extracts the body from an HTML document. In recent decades, The New York Times has also changed the structure of their HTML documents, so this extraction function depends on cumbersome nested conditional statements:
# s is a BeautifulSoup object containing
the HTML to the page if S.find (' P ', {' itemprop ': ' Articlebody '}) is not None:
paragraphs = s.findall (' p ', {' itemprop ': ' Articlebody '})
story = '. Join ([P.text for P in paragraphs])
elif S.find (' Nyt_text '):
story = S.find (' Nyt_text '). Text
elif s.find (' div ', {' id ': ' Mod-a-body-first-para '}):
story = S.find (' div ', {' id ': ' Mod-a-body-first-para '}). text
Story + = S.find (' div ', {' id ': ' Mod-a-body-after-first-para '}). Text
Else:
if S.find (' P ', {' class ': ' Story-body-text '}) is not None:
Paragraphs = s.findall (' p ', {' class ': ' Story-body-text '})
story = '. Join ([P.text for P in paragraphs])
else:< C14/>story = ' '
Document to Quantification
Before we apply the machine learning algorithm, we need to quantify the document. Thanks to Scikit-learn's IT-IDF Vectorizer module, it's easy. It's not enough to think of a single word, because my dataset doesn't lack some important names. So I chose to use N-grams,n to take 1 to 3. Happily, implementing multiple N-gram is as simple as implementing a single keyword, simply setting the Vectorizer parameters.
VEC = Tfidfvectorizer (Max_features=max_features,
ngram_range= (1, 3),
MAX_DF=MAX_DF)
In the starting model, I set the Max_features (the maximum number of words or phrases in the vector model) to 20000 or 30000, within the computational power of my computer. But given that I'm also involved in 2-gram and 3-gram, these combinations can lead to an explosion in the number of features (many of which are important), and I'll raise that number in my final model.
using NMF to make theme models
Nonnegative matrix factorization (non-negative matrices factorization, or NMF) is a linear algebraic optimization algorithm. Its most magical place is that it can extract meaningful information about the subject without any prior knowledge of the meaning of the interpretation. Mathematically its goal is to decompose a NXM input matrix into two matrices called W and H,w are NXT document-subject matrices, and H is the TXM theme-word matrix. You can see that the dot product of W and H is the same as the input matrix shape. In fact, the model attempts to construct W and H so that their dot product is an approximation of the input matrix. Another advantage of this algorithm is that the user can select the value of the variable t and represent the number of generated topics.
Once again, I gave this important task to Scikit-learn, and its NMF module was sufficient to handle the task. If I spend more time on this project, I may find some more efficient NMF implementations, after all, it is the most complex and time-consuming process in the project. I came up with an idea during the implementation, but it was a hot start problem. That allows the user to populate the H-matrix rows with certain words, giving the system some domain knowledge in the process of forming a theme. Anyway, I only have a few weeks to complete the project. There are a lot of other things that need my energy.
Parameters for the subject model
Because the theme model is the cornerstone of the entire project, the decisions I make during the build process have a big impact on the final outcome. I decided to enter the model article for the 911 event that occurred 18 months later. At this time the noise is no longer, so the theme of this period is indeed the direct result of the 911 event. In the phase of quantization, the number of runs started several times limited by my computer. The results for 20 or 30 themes are good, but I want a larger model that contains more results.
My final model uses 100,000 vector words and about 15,000 articles. I set up 200 themes, so the NMF algorithm needs to deal with 15000x100000, 15000x200 and 200x100000 scale matrices. The two matrices are gradually transformed to fit the first matrix.
Complete the Model
After the final model matrix is complete, I look at each topic and check the keywords (those that have the highest probability value in the subject-word matrix). I give each topic a specific name (which can be used in visualization) and decide whether to keep the topic. Some topics have been deleted (such as local sports) because they have nothing to do with the central topic, and some are too broad (on the stock market or politics); and some are too specific, most likely the error of the NMF algorithm (a series of 3-grams from the same article)
After this process I have 75 clear and relevant topics, each named after the content.
Analysis
After the theme model is trained, it is easy to figure out the weights of the different topics for a given article:
- Use the stored TF-IDF model to quantify the text of the article.
- Calculate the dot product of this vector and the condensed NMF theme-word matrix. (1x100k * 100k x = 1 x 75)
- The 75 dimensions of the result vector indicate how this article is related to 75 topics.
The harder part is deciding how to turn these weights into a visual form that tells the story. If I simply add up the topic weights of all the articles for a period of time, this distribution should be an accurate expression of the frequency of each topic during that period. But the components of this distribution are meaningless to humans. In other ways, if I do a two-point classification of each topic, I can work out the percentage of articles that are related to a topic over a period of time. I chose this method because it is better at explaining the problem.
Topic classification is also difficult, especially with so many articles and topics. Some articles have higher weights in many subjects because they are longer and contain keywords that appear in different topics. Other articles have low weights in most subjects, even though artificial judgments can reveal that it does have something to do with certain topics. These differences determine that fixed weight thresholds are not a good classification method; Some articles belong to many topics and some articles do not belong to any subject. I decided to classify each article under the three themes with the highest weights. Although this method is imperfect, it can provide a good balance to solve some of the problems of our subject model.
Visualization of
Although data acquisition, the theme model and the analysis phase are important for this project, they are all for the final visualization service. I strive to balance visual attractiveness with user interaction, allowing users to explore and understand the subject's trends without guidance. The diagram I started with was stacked blocks, and I realized that simple line drawings were sufficient and clear.
I use D3.js for visualization, which is appropriate for the data-driven model of this project. The data itself is uploaded to the Web page through a CSV file that contains the subject trend data and two JSON files containing the subject and article metadata. Although I'm not an expert at front-end development, I managed to build a satisfying visualization page by learning enough d3,html and CSS knowledge through a week's course.
some interesting themes.
After anthrax –911, panic gripped the country. Fortunately, much of the panic is overblown. The anthrax scare of the late 2001 is an isolated event with no subsequent effects, as is clearly seen in the figure.
Osama Benraden, al-Qaida, Tora Bora – all topics focused on spikes in the aftermath of Osama bin Laden's death in Abbottabad in 2011. This combination of topics is noteworthy because it shows the evolution of media attention after 911 events: At first, bin Laden received a lot of attention. Soon after, the topic of the Tora Bora became prominent, as the suspect bin Laden's hideout and the U.S. military's focus. When bin Laden escaped the hunt, the focus on both topics dropped, while the broader al-Qaida topic was somewhat elevated. The gradual ascent of each topic in recent years illustrates their relevance. Although there is no significant improvement, their relative concerns are raised when other topics are quiet.
What have I learned?
Although I presented this project with a knowledge of the subject model and the various components of the data processing, the real meaning of the project is the story it tells. The nature of the 911 event is negative, but there are also many positive stories: Many heroes save many people, community integration, and reconstruction.
Unfortunately, this media environment is shown in my theme model: Focus on negative energy, villains, and destruction. Of course, some individual heroes were praised in one or two articles, but none was broad enough to form a theme. On the other hand, villains like Osama bin Laden and Kalia Moussavis are mentioned in many articles. Even if it's Richard Reid, a clumsy (trying) bomber who blew a plane in a bomb shoe has a more enduring media impact than some successful heroes (a supplement: one disadvantage of a vocabulary-focused theme model is that ordinary names like Reid can lead to articles that talk about different people together.) In this case, Reid and Richard Reid).