Using Fasttext for Text processing and text prediction

Source: Internet
Author: User
Tags comments join

Programmers who have transformed into AI have followed this number???

Big Data Mining DT data Analysis public number: DATADW

First exposure to text forecasts because of participating in the Big Data contest organized by Datafountain and CCF. Comparing some models, the final decision is to try Fasttext. The process of getting started Fasttext can be said to be very painful, because there are few Fasttext blogs on major blog sites in China. On the one hand is Fasttext Facebook last year only open source, with less people, there is a fasttext most of the reference materials are in English, I chew a long time in English documents, take a ladder to foreign forums, finally is a simple to get started. These two days almost all the time spent on this, very deep feeling. Based on the above points, I think it is to write a blog, although only to get started, but also you crossing a lot of comments, put forward different views.
Problem Analysis

360 Search out this problem, the title is "360 Search-alphago after" man-Machine War "Round 2--machine writing and human writing Peak Duel", at first glance quite scary, in fact, let the developer through a set of models to identify an article is written by the machine, or human written out. In fact, there is a certain distinction, such as human written articles, the title and content of the article is relatively high (excluding the title of the party's case), and the text of the article has a certain degree of logical continuity, rarely in the body of the article appears garbled. The articles written by the machine are different from the articles written by humans in the above aspects.

Perhaps I am not so intuitive to say, "What is the machine written out of the article." "I took a small piece from the dataset, like the following article, which was written by the machine.

Chongqing Yongchuan Fire Warning: Summer heat coming to alert for fire hazards
  Cinema part of the housing structure deformation, two safe evacuation door deformation can not be opened, since the fits, avoid water-based material quicklime, residents and unit power consumption will also increase, resulting in more fire, Zhongqing Yongchuan Fire brigade in this reminds you: high temperature weather to enhance safety awareness, To strengthen fire prevention and control, we should always be vigilant to the following common fire. Rescue fire officers arrived at the scene. First, electrical fire. With the advent of high temperature, air-conditioning, refrigerators and other electrical equipment increased a large number of electrical equipment line overload operation, the damage caused by the insulation of the power supply short-circuit ignition, pictured as the trapped people were successfully rescued. or electrical motor water moisture, so that the insulation strength, the occurrence of short-circuit burned motor ignition. Second, car fire. Rescue firefighters arrived at the scene, the summer is easy to happen car fire, the main reason is: Some car use for too long, has been from the second floor window out loud shouting "help", power line aging prone to short-circuit, some car overload load, resulting in engine temperature rise, coupled with the hot weather, engine ventilation equipment is not good, Thus causing auto spontaneous combustion. And, some owners in order to clean the air inside the car, causing the movie in the cinema to watch the 9 people trapped, choose to place in their car perfume, air freshener, two groups of trapped people to emotional comfort. Reading glasses, lighters and other items, very easy to cause fire. Third, the county's blue digital cinema behind a landslide, battery car fire. Successfully removed the deformed safety evacuation door to a single exit. With the popularization of electric car, battery charging caused a few fires. In particular, some users of the private cable connection, not as required to use power strip, Gongshan County Digital cinema behind the landslide. A breach of charge causes a fire. Four, the construction site fire. On the construction site of oxygen cylinders, acetylene bottles, fire-retardant materials, paint thinner and other inflammable and explosive material management is not strict, directly placed in the high temperature exposure, did not take effective occlusion measures, not set up in the ventilation, cool place to save, three groups using broken tools to break the deformation of the security door, so it is easy to fire accident five, dangerous Successfully removed the deformed safety evacuation door to a single exit. Summer ground temperature sometimes up to 40 ℃ above, rescue fire officers and soldiers through metal cutting machine, door breaker, hydraulic demolition tool group and other demolition equipment, in such a hot temperature conditions, chemical dangerous goods in production, pictured trapped people were successfully rescued. Transport, peroxide alkali. Therefore, it is important to keep it carefully and to take the picture as a damaged and serious blue-film digital cinema, using flammable and explosive chemical dangerous goods. Six, the material spontaneous combustion fire. Spontaneous combustion substances In addition to the past we often talk about straw, coal heap, cotton duo, trapped personnel have been all rescued, there are oily fiber, tertiary, ammonium nitrate fertilizer, leading to the cinema in the movie 9 people trapped, fishmeal, agricultural products. These substances are stored, if the accumulation of time is too long, ventilation is not good, itself will change to produce heat, temperature gradually increased. Bogey water-based material quicklime, anhydrous alumina,peroxide, chlorine sulfonic acid, and so on, these substances will release a lot of combustible gas after encountering water or moisture in the air, four.

The above article, look carefully can see the flaw:

1, there are repeated, and do not need to repeatedly emphasize the text, such as "bogey water-based material quicklime";

2, the logic is not fluent, the article ends a "four", do not know what it means;

3, the article has obvious patchwork traces, from the "1234" points can be seen from a lot of articles in the clip from, the context of weak association.


There are two datasets (1.6GB and 2GB, respectively), one data set is a training set (used for training the model), and the other is a test set (for submitting results).

Data Format

Data one: Training set, scale 500,000 sample (with tag answer), the data format is as follows:

Field Type Description Note
Article ID String Article ID
Article title String The title of the article, words within 100 words has been desensitization. The line break symbol is removed.
Article content String Content of the article has been desensitization. The article content is a long string, minus the line break symbol.
Label answer String Human writing is positive, robot writing is negative Robot writers and human-written articles, contestant training data, you can choose the whole amount of data in this collection, or select some of the data. However, entrants cannot find additional data to join the training set on their own.

Data two: Test set A, scale 100,000 sample (no tag answer), the data format is as follows:

Field Type Description Note
Article ID String Article ID
Article title String The title of the article, words within 100 words has been desensitization. The line break symbol is removed.
Article content String Content of the article has been desensitization. The article content is a long string, minus the line break symbol.

Data three: Test set B, Scale 300,000 sample (no tag answer), the data format is as follows:

Field Type Description Note
Article ID String Article ID
Article title String The title of the article, words within 100 words has been desensitization. The line break symbol is removed.
Article content String Content of the article has been desensitization. Article content is a long string, minus the line break

The above three data contain both robot writers and human-written articles. A sample example mainly includes article ID, article title, article content and tag information (human writing is positive, robot writing is negative). You need to get the model on the training set, and then use the model to determine whether an article is a live writing or a machine build on a test set. If this article is generated by robot writing, then the label is negative, otherwise positive. To provide label features only on the training set, contestants need to predict the label on the test set.

Data preprocessing

Data preprocessing can be said to be very critical, many teams have expressed the need to spend a lot of time for data preprocessing, my side stole a lazy, using Jieba training set and test set text word segmentation, and easily convert it into fasttext format.

#encoding =utf-8
Import Jieba
#author Linxinzhu
Seg_list = Jieba.cut ("This contest really costs time", cut_all=true)
Print "Full Mode:", "/". Join (Seg_list) #全模式

Seg_list = Jieba.cut ("Zwq Addicted to wandering QQ space, but also occasionally sultry sister", Cut_all=false)
Print "Default Mode:", "/". Join (Seg_list) #精确模式

Seg_list = Jieba.cut ("Test set is great, run once for a long time") #默认是精确模式
Print ",". Join (Seg_list)

Seg_list = Jieba.cut_for_search ("This blog was written on November 17, 2017,
Everyone crossing feel useful, you can comment like ") #搜索引擎模式
Print ",". Join (Seg_list)

Output results

PS c:\users\linxinzhu\py> python full
mode:building prefix dict from C:\Python27\lib\site-packages\
jieba\dict.txt ... Loading model from cache C:\users\linxin~1\appdata\local\temp\jieba.cache
Loading model cost 0.804000139236 seconds.
Prefix Dict has been built succesfully.
Default mode:zwq/Addiction/stroll/qq/space/,/also/occasional/provocative sister
test, set good, big, ah,,, run, once, want, long
This article, blog, Yes, in, 2017, year, 11, month, 17, day write,,,, everyone, crossing, feel, useful, words,,, can, comments, likes

It is important to note that:

1, the code at the beginning remember to write on the encoding, including the back of the Fasttext in the code is also quite troublesome, do not write the words have surprise oh.

2, Jieba.cut returns a list, so when doing string concatenation to the list to turn to string, commonly used is "". Join ()

Symbolic processing

def go_split (S,min_len):
# stitching Regular Expressions
Symbol = ',;.. 、。!'
Symbol = "[" + Symbol + "]+"
# split string at once
result = Re.split (symbol, s)
return [x for x in result if Len (x) >min_len]

def is_dup (S,min_len):
result = Go_split (S,min_len)
return len (Result)!=len (set (Result))

def is_neg_symbol (Uchar):
neg_symbol=['! ', ' 0 ', '; ', '? ', ', ', '. ', ',']
Return Uchar in Neg_symbol

Special word processing

Some words, such as "," and so on, have special meanings somewhere, such as "yes", "understanding", but in most cases there is no particular effect on the semantics of the article. For example, "Drink milk this Morning" and "Drink milk this Morning" is not much different.

if (ur "," in S0) and (Not (ur ", indeed" in S0) ") and (Not (Ur, taxi" in S0)) \
and (Not (UR), cab "in S0)" and (Not (UR), true "in S0)):
Flag = "negative"
if (ur "," in S0) and (Not (ur ", understanding" in S0) "and (Not (ur", end "in S0)) \
and (Not (UR), none "in S0)" and (Not (Ur, but "in S0)) \
and (Not (ur ", great" in S0)):
Flag = "negative"

if (UR). S0) and (Not (UR). True "in S0)" and (Not (UR). Taxi "in S0)" \
and (Not (UR). S0) and (Not (UR). "in S0)):
Flag = "negative"
if (UR). In S0) and (Not (UR). Understanding "in S0") and (Not (UR). End "in S0)" \
and (Not (UR). S0) and (Not (UR). "in S0)" \
and (Not (UR). Great "in S0)"):
Flag = "negative"

if (ur ";" in S0) and (Not (ur "; indeed" in S0)) and (Not (ur "; taxi" in S0)) \
and (not (UR); the S0) and (not (ur; true "in S0)):
Flag = "negative"
if (ur ";" in S0) and (Not (ur "understands;" in S0)) and (Not (ur "; end" in S0)) \
and (not (ur; no "in S0)) and (Not (ur"; but "in s0)) \
and (Not (ur "; great" in S0)):
Flag = "negative"

if (Ur s0) and (Not (ur ", true" in S0)) and (Not (ur "Taxi" in S0)) \
and (Not (ur "," "in S0)" and ("true" in S0) of Not (UR)):
Flag = "negative"
if (ur "? S0") and (Not (ur "? understanding" in S0)) and (Not (ur "End" in S0)) \
and (Not (UR) (s0)) and (Not (ur ", in s0)) \
and (Not (

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.