One way for BSP to prevent trackback garbage

Source: Internet
Author: User

Trackback is a good thing, but it has been plagued by spam due to the lack of confirmation mechanisms for senders. For example, csdn and donews are all affected. Even many people have been discussing whether trackback is damn or not.

This discussion is futile. In terms of technology, although some people have already tried (1, 2, 3, 4), most of them may not be simple and feasible. This article attempts to introduce a simple and feasible method and describes how to clear the garbage in the csdn Blog system.

Solution

An important cause of trackback spam is the abnormal behavior driven by a large number of webmasters, including a large number of foreign poor Seo. We can solve the problem of trackback spam only by analyzing these behaviors.

As we know, trackback spammers have two purposes: first, they want users to access their websites through connections (such as some beautiful pornographic sites ), second, search engines are expected to enhance their website ranking and indexing. Therefore, they usually have trackback backtracing addresses. Through the analysis of these addresses, we can find some valuable rules. Then, the trackback garbage is cleared based on this rule.

Collect trackback Information

The trackback of the csdn blog is saved in a data table named blog_feedback. First, we try to extract the primary domain name from the trackback address (database Field displayurl. You can create a new displayurlroot field on the table to save it.

Update blog_feedback set displayurlroot = DBO. geturlroot (displayurl) Where feedbacktype = 2

Geturlroot is a user-defined function that obtains the primary domain name of the backend address.

Create Function [DBO]. [geturlroot]
(@ Surl nvarchar (256 ))
Returns nvarchar (256)
As
Begin
Declare @ M_f int
Declare @ m_len int
Declare @ RET nvarchar (256)
Set @ ret =''

Select @ M_f = patindex ('% // %', @ Surl)
If (@ M_f = 6) -- Is http url format
Begin
Select @ m_len = Len (@ Surl)
Select @ M_f = patindex ('%/%', substring (@ Surl, 8, @ m_len-7 ))
If (@ M_f! = 0)
Begin
Select @ ret = substring (@ Surl, 8, @ m_f-1)
End
Else -- no '/' Found '/'
Begin
Select @ ret = substring (@ Surl, 8, @ m_len-7)
End
Select @ M_f = patindex ('%. %', @ RET)
Select @ ret = substring (@ ret, @ M_f + 1, Len (@ RET)-@ M_f)
End
Return @ RET
End

In this way, we obtain a primary domain name for each trackback address. The following analysis is based on this new field for statistical analysis. Therefore, we should add an index for this field.

Collect trackback Information

Run the following statement in the SQL query Analyzer:

Select displayurlroot, count (ID) from blog_feedback where displayurlroot is not null
Group by displayurlroot having (count (ID)> 50 order by count (ID) DESC

We can see that the primary domain names of all trackback backtracing addresses have a list of more than 50 domain names, the top 5 in the csdn blog ranking are

Domain Name: Alice. It entries: 39337
Domain Name: blogspot.com entries: 14447
Domain Name: editme.com entries: 8439
Domain Name: aol.com entries: 7043
Domain Name: PSL. Lt items: 6907

After checking, all the items here are trackback garbage.

Trackback cleanup method

In csdn blogs, there are 367 primary domain names with more than 50 trackback backtracing addresses. We do not need to check a large number of unknown foreign domain names. We only need to focus on filtering some well-known and related websites. We found that for the csdn blog trackback, only the trackback from the following three domain names are all "clean": csdn.net, cnblogs.com, and donews.net. In addition, about half of msn.com is "clean ".

Now we are cleaning up. Execute the following statement:

Delete blog_feedback where displayurlroot in
(Select displayurlroot from blog_feedback where displayurlroot is not null and
Displayurlroot not in ('csdn. net', 'donews. net', 'cnblogs. com', 'msn. com ')
Group by displayurlroot having (count (ID)> 50)

After the execution, more than 0.24 million trackback spams disappear immediately. Only 16394 useful messages are left.

Which well-known websites have a lot of trackback Spam?

The following is purely the result of the csdn blog statistical analysis. It only shows that some people use some well-known websites to spread spam:

Domain Name: blogspot.com entries: 14447
Domain Name: aol.com entries: 7043
Domain Name: blog.hexun.com entries: 6621
Domain Name: blog.ccidnet.com entries: 4980
Domain Name: netscape.com entries: 1378
Number of domain names: Baidu.com: 804
Domain Name: a8.com entries: 704
Domain Name: blog.sohu.com entries: 344
Domain Name: china.alibaba.com entries: 243

Conclusion

Based on the statistical analysis of the primary domain name of the trackback address, this paper provides a method to eliminate trackback garbage in large batches, and the effect of eliminating the garbage in the csdn blog is good. It is worth using large BSP for reference.

Update:
As mentioned above, about half of msn.com is "clean". How can we clean it? I also found that 99.9% of normal trackback from msn.com has no title, and spam trackback has all titles, which is a useful clue. Run:

Delete blog_feedback where displayurlroot = 'msn. com' and title <> ''and id not in (limited useful IDs)

This garbage can be cleared. After the execution, the normal trackback of the csdn blog is reduced to 15411.

From: http://blog.csdn.net/zdg once high blog

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.