Scientific techniques and techniques are important to deal with spam on the Web. In the previous installment of this column, I showed you how to use workflows to deal with spam robots, and it's best not to cause any inconvenience to legitimate users. In this article, I will discuss other aspects of spam, such as the relationship between content sites and how spam victims can collaborate to counter spam-mongering.
Community Police
The most effective way to deal with Web spam is through community action. Spam is a large-scale attack, so there is an urgent need for large-scale resistance. If communities can assist in discovering and exposing the behavior and content patterns of spammers, these patterns can be shared so that robotic programs cannot or will not work easily. The community is particularly useful for other forms of spam that are still not available in the way described in the previous installment. First, I'll take a moment to introduce the back-chain spam.
Linkback spam Information
The purpose of Weblog and other Web articles is to share insights and discoveries. Sometimes, inspired by a blog entry, people will leave a comment directly on the Weblog. Sometimes people will refer to this blog entry to their own entries or articles, and linkback is a technical term used to inform you of another site with which you are linking. It is a network signal, like the role of "ping", declares the relationship between an item and another item, and helps the reader find the relevant content. However, Linkback also provides spammers with the opportunity to misuse Weblog, who send so-called "sping" (the abbreviation for "spam ping"). There are three common types of linkback, and all of them have problems related to junk information.
Refback: When a Web browser user clicks a link from one page to another page, the request to the second page contains an HTTP header (which is called a referrer) and the URL of the first page. This process is also known as "Refback". The second site may track Refback, may post a list of Refback, and may even follow Refback to the original site and extract information such as headings, metadata, linked text, and other text around the link. Spammers use this to embed their customer links on legitimate sites to enhance the customer's search engine profile. They send requests to the target site through a customer link in the referrer field. This is called "referrer spam". The common way to counter it is to create a blacklist of refback sites.
Trackback:refback relies on user-tracking links, but some weblogger want a more determined way to linkback notifications, as one of the main Weblog software providers Six Apart. TrackBack is a specification that includes "discovery" (where and how to send ping) and a ping process that uses simple HTTP requests. TrackBack's notification and discovery are Resource Description Framework (RDF) information embedded in the page comments section. The author's reference to an entry will cause a ping to be sent to the informed endpoint. TrackBack is already popular in Weblogger, and spammers follow weblogger. Spammers will crawl in Weblog to find TrackBack listeners and send false notifications to their client sites. The problem with TrackBack spam is so great that many Weblog have to disable TrackBack, just as some sites disable the comment form on the Weblog entry. The response to TrackBack spam is similar to the content analysis to be covered later in this article and the blacklist countermeasures against spam information.
Pingback:pingback is very close to TrackBack, except that the format of the notification message is an XML-RPC request rather than a simple HTTP request. The discovery mechanism is also more sophisticated, using HTML links or HTTP headers, rather than embedded RDF. Finally, the Pingack receiver will follow the ping process back to the original site to make sure that the site does have such a link, and that this step will remove a lot of spammers. Although TrackBack can be implemented to perform such checks, it is more of an important part of the Pingback specification. In addition to this inspection, Pingback's anti-spam measures are essentially the same as those of TrackBack.
Content analysis
Whether spam comes from a robotic program or a mechanical Turkish attack, it has a fundamental weakness. The main purpose of spammers is to improve the ranking of their customer sites in search engines, a process known as "Black Hat Search Engine optimization (SEO)" or "Junk Index". Unless the spammers do bring the links in, the whole process is of little significance, which means that the community can share information about the spam customer's links with statistical information and reported links and link patterns. With a slight exception, spammers are indeed trying to advertise directly to visitors to the site and to deliberately sabotage the property of others. This is one of the reasons why there are many ways and strategies to deal with junk information.
Statistical analysis
The basic idea of content analysis is different from Community mode. It involves automatic checking of the content, and the statistical analysis of the text to determine whether it is close to any content that has previously been marked as spam. The most common statistical method is called Bayesian inference, and in more detail, you can refer to another article in IBM DeveloperWorks (see Resources). Once someone marks the content as spam or good content (sometimes called "Ham"), the statistics are updated. Bayesian reasoning has a prominent performance in dealing with spam and other areas unrelated to spam, such as the customer affinity engine, or the business recommendation engine.
Into the community
If you can perform content analysis across multiple target sites, it will be more effective. Spammers often deal with a large number of such sites and issue robots to violate spam messages one by one. If spam is detected on a site, it allows other sites to identify the post, which is similar to the popular list of spam bot registration emails and IP addresses described above. In the context of content analysis, there are several commercial (and semi commercial) services to complement.
Akismet is one of the best-known collaborative content analysis systems and is a business service provided by a company associated with the WordPress Weblog platform. The user submits a comment to the service and receives a flag indicating whether it has been flagged as junk information. Akismet is a vivid manifestation of the complexity of the problem. It is a closed service, which means that neither the user nor the observer can see the spam detection process. The reason for this is that the spammers are not aware that they cannot attack the system, but the downside is that others cannot help improve the service. Worse still, if legitimate comments are flagged as spam, it's not enough to trust the service. Akismet has also been blamed for putting people on the wrong list, including those who criticized the company or its founders. Other similar services include Mollom, Defensio, and Project honeypot Spam Domains List (phsdl).