Development Process of the online pseudo-original tool www.bolewei.com

Source: Internet
Author: User

Preface:

The definition of pseudo-original in Baidu encyclopedia is "the so-called pseudo-original is to put an articleArticleTo make the search engine think it is an original article, so as to increase the weight of the website ". In fact,ProgramFor ape, pseudo-original is "replacing a large number of synonyms in the Article ".

Seo friends-especially those with black hat SEO-must know that pseudo-originality is critical to search engine optimization. The English pseudo-original is called "Spin". Mature pseudo-original tools include TBS and spinnerchief. The price of TBS is quite expensive. Mature and reliable pseudo-original Chinese tools have been rare. I searched Baidu for "Chinese pseudo-original" and tried several tools on the homepage, which was unsatisfactory.

So I developed bole pseudo original, http://www.bolewei.com/, interested friends can first experience. (But it must be noted that this is only a test version, which is far away from the official version. O (release _ release) O Haha ~)

Issues to be considered for the Chinese pseudo-original tool

I began to think about where the current Chinese pseudo-original tools have poor results? I think of the following:

I. Word Segmentation.

Chinese and English are different. Word Segmentation is required for search engine spider and Corpus Analysis processed by other natural languages. If the pre-processing of Word Segmentation is not performed first, it will start to be "pseudo-original", which is definitely unreliable. For example, "Naive" and "pure" are synonyms. Without word segmentation, "I have a naive cousin" can still be pseudo-Original: "I have a pure sister ", but "today is really hot" processing is "Today is pure hot" is serious unreliable.

Ii. Simplification caused by sequential execution.

I have tried both the popular Chinese pseudo-original dictionary and the ready-made Chinese pseudo-original tools on the Internet. Their dictionary format is as follows:

 

CommonAlgorithmIt is replaced in sequence. Therefore, the synonyms of "Mortal" originally include "dust Atlas", "dust room", and "Earth", but it will never be replaced with the last two words.

The ideal state is to replace it with a proper word according to the context. For example, synonyms of "Naive" include "innocent", "pure", and "simple". Not every word can be completely replaced in all contexts. Of course, the context is the category of advanced natural language processing. It is easy to say that the pseudo-original can reach the unsatisfactory state even if OK: in the case of multiple synonyms, replace one of them randomly.

Iii. dictionary fights

I downloaded most of the word libraries circulating on the internet, paid and free, and tried them all. No one found to solve the dictionary fight problem.

What is "dictionary fight problem "? It's easy to look at these two groups of words.

After the replacement is executed sequentially, you will find that the text circle is back to the origin ...... This is obviously not what we want, but it is very common in the existing dictionary and pseudo-original tools.

4. Poor readability

Why is it too readable? Because the existing pseudo-original Chinese tools are all replaced by the entire article. If there are 0.1 million synonyms in the dictionary, the program will traverse it once and squeeze all the strings that can be replaced in the original text ". In this way, although the pseudo-original is successful, the readability is infinitely close to zero. Let's look at a group of examples:

After pseudo-Original:

Can the reader understand the pseudo-original text directly?

How can we solve this problem? I think we can set a percentage or a pseudo-original level. For example, "mild, moderate, violent, abnormal, and beyond the reach of the soul. When you select "soft", you can only replace 10 of the 100 replaceable synonyms, select "moderate", replace 50 of the 100 replaceable synonyms, and so on. In this way, users can freely weigh before "readability" and "originality.

Perfect Chinese pseudo-original tools in my mind

Function module Function Description Technical Implementation
Smart Word Segmentation Avoid the problem that "today is really hot" becomes "Today is pure hot. You can call existing open-source Chinese Word Segmentation components.
Randomness It is the same Article, and the results of each pseudo-original operation are different. The random entry is introduced.
Pseudo-original strength adjustment Defines the percentage of pseudo-original, allowing users to freely weigh between readability and originality. Introduce level settings and random entry.
Do not fight in Word Library Avoid or decrease the situation where "swear" is replaced with "swear" and then replaced with "swear. Discard the sequential execution.
Statistics Count how many words are replaced after pseudo-original Insert tag when replacement
Mark Mark replaced words for easy viewing and Verification Insert tag when replacement

Core Program Implementation

Considering the many problems mentioned above, the program is still relatively long, close to 10 thousand lines. Below I only list several segments of CoreCodeFor your reference and discussion.

1. word segmentation.

Word splitting is actually very easy, because open-source components are well understood and can be used. My favorite is scws. The official website address is http://www.ftphp.com/scws/

Scws was originally a PHP Chinese Word Segmentation solution, but it provides APIs, and our C # program can also be called without pressure.

View code Public Static String Segment ( String Str)
{
System. Text. stringbuilder sb = New System. Text. stringbuilder ();
Try
{
String S = String . Empty;
System. net. cookiecontainer = New System. net. cookiecontainer ();
// Converts the submitted string data to a byte array.
Byte [] Postdata = system. Text. encoding. ASCII. getbytes ( " Data = " + System. Web. httputility. urlencode (STR) + " & Respond = JSON & charset = utf8 & ignore = Yes & duality = No & traditional = No & multi = 0 " );

// Set parameters for submission
System. net. httpwebrequest request = system. net. webrequest. Create ( " Http://www.ftphp.com/scws/api.php " ) As System. net. httpwebrequest;
Request. method = " Post " ;
Request. keepalive = False ;
Request. contenttype = " Application/X-WWW-form-urlencoded " ;
Request. cookiecontainer = cookiecontainer;
Request. contentlength = postdata. length;

//Submit request data
System. Io. Stream outputstream = request. getrequeststream ();
Outputstream. Write (postdata,0, Postdata. Length );
Outputstream. Close ();

// Receive returned page
System. net. httpwebresponse response = request. getresponse () As System. net. httpwebresponse;
System. Io. Stream responsestream = response. getresponsestream ();
System. Io. streamreader reader = New System. Io. streamreader (responsestream, system. Text. encoding. getencoding ( " UTF-8 " ));
String Val = reader. readtoend ();

Newtonsoft. JSON. LINQ. jobject Results = newtonsoft. JSON. LINQ. jobject. parse (VAL );
Foreach ( VaR Item In Results [ " Words " ]. Children ())
{
Newtonsoft. JSON. LINQ. jobject word = newtonsoft. JSON. LINQ. jobject. parse (item. tostring ());
SB. append (word [ " Word " ]. Tostring () + "   " );
}
}
Catch
{
}

ReturnSB. tostring ();
}

 

2. Solve the dictionary fight problem.

That's nothing to say. The program has only one sentence.

3. Randomly retrieve x records based on the percentage.

The most convenient way to retrieve random records is the MySQL database. The built-in rand () function can be used in select. If SQL Server is used, the efficiency will be lower.

Reference http://www.rndblog.com/how-to-select-random-rows-in-mysql/

4. Replace the synonym main function.

See the comments in the code.

Public String Replacearticle ( String Old, Int Strength, Bool Markred, Ref Int Replacedcount)
{

String NewArticle = old;
Int Wordsamount = 180482 ; // Total dictionary size

Int Kwcount = convert. toint32 (strength * 1.0 / 100 * Wordsamount ); // Calculate the number of words required based on the percentage passed in.
VaR Dataset = sqlhelper. executedataset (sqlhelper. connectionstring, system. Data. commandtype. Text, " Select * From Words Order by rand () Limit " + Kwcount ); // Random kwcount group words

Foreach (Datarow R In Dataset. Tables [ 0 ]. Rows)
{
// Random. For the phrase a-> B in the dictionary. Randomly decide whether to replace (A, B) or replace (B,)
Int I = New Random (). Next (0 , 1 );
NewArticle = newArticle. Replace (R [I]. tostring ()
, String . Format ( " <B> {0} </B> " , R [ 1 -I]. tostring ()));
}

// Number of words to be replaced.
Replacedcount = newArticle. Split ( New String [] { " </B> " }, Stringsplitoptions. None). Length- 1 ;
If (! Markred)
{
// If the user does not require a tag, remove the <B> tag.
NewArticle = newArticle. Replace ( " <B> " , String . Empty). Replace ( " </B> " , String . Empty );
}

ReturnNewArticle;
}

 

Final Program Interface

 

 

Last

As mentioned above, Chinese pseudo-originality is almost blank. I hope this article will help you better understand and improve it.

Bole pseudo original http://www.bolewei.com/

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.