Spam information screening technology for classified information websites

Source: Internet
Author: User

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Baidu and Google in recent days to fight a war of words, Baidu is not just attributed to the garbage information, and Google retorted, said relying on machines, technical means to counter garbage has become the law, Baidu's argument is just an excuse. No matter who is right or wrong, an indisputable fact is that junk information has become a public enemy of the information age.

Throughout the classified information Web site, spam is rife. The spam information greatly reduces the user experience, if a website is flooded with the rubbish information, then the user to its trust degree will greatly reduce. Remember that last year, the market and 58 of the second-hand car column filled with a different vehicle smuggling information, now the junk information is basically no, the proportion of other columns of junk information is also greatly reduced. But in addition to these business-type classified information sites, many other classified information sites are still a lot of spam information.

Here, I am willing to work with the major classification information site webmaster Share list Network (http://www.liebiao.com/) anti-spam technology and experience. These methods are simple and easy to do, for a familiar with the programming webmaster is relatively easy to achieve. The practice proves that through these methods, the list network can control the spam information in an acceptable range.

Method one: Through the extraction of information in the contact form, establish contact list

Spammers no matter how the IP changes, the content changes, but the contact method is unchanged. Based on this, we can build a blacklist database of contact methods. At present, Chinese users use the most contact methods are: Telephone number, QQ number, URL, email address. These contact methods have certain characteristics and are easily extracted by regular expressions.

So how do you create a blacklist? The practice of the list network is that if a message is rated five times by the user, the information is automatically labeled as a bad comment state and hidden. The confirmed information of the user's complaint is also placed in a bad state of assessment. When a message is placed in a bad comment, all contact information in the bad comment is stored in the blacklist database, and the frequency field plus 1 is present. So we have the contact method blacklist database, the database has the frequency of contact and the latest occurrence of information such as time. All of these operations are done automatically by the machine, in addition to the user's complaint requiring manual identification.

With this blacklist, you can use it to identify junk information. How do you identify junk information? The practice of the list network is that the machine periodically checks the information that the user publishes. If the information contains a list of contacts in the Blacklist database, and the contact method occurs more than 1 and the most recent time is less than six months, then the information is automatically deleted. Qualifying for six months is a chance for the publisher to rehabilitate.

The above is a simple description of the way. In fact, there are a lot of details to be considered, but also to avoid excessive punishment. For example, before the extraction of contact information, need to deal with the data, such as 1, ①, Ⅰ, etc. to 1, delete the space between the number, and so on, the user complaint for the intermediary of the message should not do bad reviews, but will be changed to intermediary, at the same time put the telephone into When the intermediary publishes the house information, the system is automatically recognized as intermediary information. If the intermediary information is also bad comments, then the intermediary will not send information, so a bit too.

In addition, consider a situation, if a user in the training category in the publication of a large number of false enrollment information, if the user wants to publish housing information, this time if the user's contact method fell into the blacklist, he could not send home information. A better solution is to add the category field to the Blacklist database. It is also necessary to check the category field when judging spam by blacklist. This avoids the occurrence of the above.

Method Two: Identify and delete the offsite merchant information

Classification information site is one of the characteristics of local, local users to the classification of information on the site to see the local rental, friends, services and other information. Therefore, if there is a remote phone number in the information should be used as garbage information processing. This can be judged by the mobile phone's dependency database and the telephone area code database. Not all categories apply to this method, such as making friends and looking for human beings should not use this method. But like the used car category, the service category can completely use this method to filter out the remote information.

Method Three: Limit certain categories A user can only post one message on the same day

There are too many repetitions of information, and the user experience is not good. Here, duplicate information is defined as the same or similar information published by the same user or merchant (including information publishers hired by the merchant). These categories include: Life services, business services, training, friends, vehicles and so on. So how do you avoid the release of duplicate information? The practice of a list network is to keep only the most recently published information in all the information published by the user within one day of these categories, and other information to be deleted.

Method IV: Keyword filtering

Finally, don't forget the keyword filter. Some harmful, sensitive keywords are sure to be filtered.

The above is the classification information website can adopt the anti-spam information simple and effective several ways. It would be even more perfect if you could filter spam with the Bayesian algorithm.

The idea of using Bayesian algorithm to filter rubbish information is: in the specimen library which has been artificially considered as rubbish information and non spam information, sampling each word in the specimen information, establishing the weight database of Word segmentation, the weight of word segmentation in the rubbish information is increased, and the word segmentation in normal information is reduced. When the word segmentation weight database is established, the weight of each word can be calculated according to the new information of the database. If the weight value exceeds a predefined threshold, then the information can be judged as spam. With the information constantly judged and corrected by the human error, the word weight database will be more and more accurate, judge the accuracy rate will be more and more high.

For the classification of information, not the same purpose participle of the weight is not the same, because the different categories of use to the frequency of participle is different. Therefore, different segmentation weights should be used for different categories of data.

Bayesian algorithm is not very difficult to achieve, the network has ready-made algorithms can be used. The hardest part is the accurate participle of Chinese sentences. In addition to the need for a large database of Word segmentation, but also need a number of High-performance CPU server to the daily new tens of thousands of information for word segmentation and weight calculation. This will be a technique for judging the high accuracy and cost of anti-spam information.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.