False news recognition, from 0到95%-machine learning Combat _ machine learning

Source: Internet
Author: User

We have developed a false news detector using machine learning and natural language processing, which has an accuracy rate of more than 95% on the validation set. In the real world, the accuracy rate should be lower than 95%, especially with the passage of time, the way the creation of false news will change.

Because of the rapid development of natural language processing and machine learning, I thought maybe we could have a model that could identify false news, thus curbing the disastrous consequences of a false news deluge.

It can be said that the hardest part of making your own machine learning model is collecting training data. It took me days and days to collect pictures of all the NBA players in the 2017/2018 season, with a view to training a face recognition model. I did not know that I would dive into a painful process that was months long, revealing something really dark and disturbing, and that it was still being disseminated as news and real information.

If you want to learn about machine learning skills and knowledge at once, I recommend you use the Python machine of Huitong network to learn the definition of online running environment fake news

My first hurdle was an accident. After doing some research on the fake news, I quickly found that there were many different categories of error messages. Some articles are blatantly fictitious, some articles provide real events but do wrong interpretation, some articles belong to pseudoscience, some articles that preach one-sided view are disguised as news, the article is ironic, and some articles are mainly about Twitter and quoting other people's words. I searched and found that some people tried to divide the site into categories such as "satire", "false" and "misleading".

I think this is a good start, so go ahead and start visiting these tagged sites and try to find some examples. Almost immediately, I found a problem: Some sites marked as "false" or "misleading" sometimes have real articles. So I know there's no way to shave without a complete check.

So I began to ask myself whether my model should take into account irony and opinion, and if so, should they be considered false, true or in their own category. Emotional analysis

After about a week on the fake news site, I began to wonder if I had overly complicated the problem. Maybe I just need to use a ready-made machine learning model for emotional analysis to see if there is a pattern. I decided to build a quick gadget, using a web crawler to grab the title, description, author, and content of the article, and to enter the crawl results into the affective analysis model. I used the textbox, the online service is convenient, and the results can be returned soon.

The TextBox returns an emotional score that you can interpret as positive or negative. Then I got a rough little algorithm to add weights to the emotions of different types of text (title, content, author, and so on) and put them together to see if we could get a meaningful global score.

It was good at first, but when I tried the seventh or eighth article, the little algorithm began to talk nonsense: it was far from the fake news detection system I wanted to build.

Failed. Natural language Processing

This is part of what my friend David Hernandez David Hernandez suggested: Use real text to train a model. To do this, we need to provide a large number of sample instances for different categories.

In order to try to understand the pattern of fake journalism, I have been very tired, so we decided to crawl only those sites that are known to be false, true or satirical, and see if we can quickly build a dataset.

After a few days of rough crawling, we got a dataset that we thought was large enough to train a model.

The result is nonsense. After delving into the training data, we realize that these sites have never been neatly dropped into a predetermined category, as we think. Some of these sites mix fake news with real news, while others have only a few blog posts from other sites, and some 90% of the text in the articles is Trump's tweets. So we realized we had to restart the training data.

This is when things get bad.

It was a Saturday, and I started a long process, reading each article before deciding which category to classify it into, and clumsily copying and pasting the text into a spreadsheet. There are some dark, disgusting, racist, really depraved things, and I try to ignore them at first. But after hundreds of of such articles, they began to approach me. When my vision blurred and my interpretation of color became confusing, I began to become very frustrated. Why human civilization has fallen to such a low level. Why can't people think critically? Do we really have hope? The process lasted a few days because I tried to prepare enough data samples for the model.

I found myself in the interpretation of false news, and when I saw the articles that I had different opinions, I became angry and continued to struggle only to choose what I thought was the right article. But, what is right, what is wrong.

But in the end, I collected enough samples and sent them to David with a lot of confidence.

The next day, as I eagerly awaited the result, he trained again.

We have reached an accuracy rate of about 70%. At first I thought it was good, but when I used the open selection article for a spot check, I realized that the model was useless.

Failed. Fakebox

Back to the drawing board. I did something wrong. Is David pointing out that simplifying the problem may be the key to improving accuracy. So I did think about what was the problem I was going to solve. Then suddenly a jiling, may not need to detect false news, as long as the detection of real news is enough. Real news is easier to classify, objective and realistic, with little need for interpretation. And there are many credible sources of information.

So I went back to the internet and started to collect the training data again. I decided to classify all the news in two categories: real and untrue. Untrue classes will include sarcasm, one-sided views, false news, and other content that does not comply with AP standards.

It took me weeks to do this, and I spent hours each day getting the latest content from the onion to the various websites of Reuters. I put thousands of samples of real and unreal content into a huge spreadsheet, and I add hundreds of more each day. Finally, I think the sample is enough to try again. So I sent a spreadsheet to David, and then I waited anxiously for the result.

When I saw the accuracy above 95%, I almost jumped up. This means we've found a pattern that can be used to distinguish between real news and news that you should be cautious about.

Success (to a certain extent).

If you like this article, please pay attention to my headline: The Brain in the new cylinder.

Original: I trained fake news detection AI with >95% accuracy, and almost went

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.