Small note: Add a new feature to open source project development process

Last Update:2018-08-23 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is not to remember the development of Journal, but to the development process of the problems encountered and solutions to communicate with you and learn. I am an ordinary PHP engineer, hoping to help junior development students. The specific experience is summarized at the end of the article.

Earlier this month, I started a small project on GitHub: Chinese-typesetting. This is a correction of Chinese text typesetting Composer package.

Chinese-typesetting includes the following features:

Add spaces between Chinese and English alphabet/Greek letters/numbers for math, science and engineering;
Limited full-width turn half-width (English, numerals, spaces, and some special characters, etc. using half-width characters);
Fix the wrong punctuation;
Clears the style of the HTML tag;
Clear empty HTML tags;
clear paragraph indent;

This week, the company did not develop a lot of business, no overtime, so began to conceive of new functions to correct the implementation of the English terminology of the case.

The data source of English proper nouns

First of all, the first question to be faced is:

Where does the data from English proper nouns come from?

My first thought was that Python has a natural language-processing package NLTK, which has a function called Pos_tag that can be used to identify and label the part of speech of each word, where the word labeled NNP or NNPS is the proper noun (Proper Noun). I suspect that there should be a corresponding set of proper noun data in the NLTK packet, but I have not been able to find it because I have limited capacity.

After the above-mentioned path has not gone through, I also search by Google, found that through the network dictionary to obtain data is a feasible solution. Through this method, we finally found a list of English proper nouns in Wiktionary. Then, using Python to write a small reptile script, crawled the corresponding data.

Finally, some sorting and filtering is done on the crawled data.

The filtering scenarios are as follows:

is_numeric()the use of methods, such as the elimination 007 of such words;
The use '/\W/' of regular, such as the elimination ǃXóõ of such words;
Culling strlen method, eliminating A such single-byte words;
Eliminate words that conflict with HTML, CSS, and JavaScript reserved words;

How to customize the proper noun data for the user

The initial code is as follows:

/** * 专有名词使用正确的大小写 * Correct English proper nouns. * * @param $text * * @return null|string|string[] */public function properNoun($text){    $dict = include __DIR__ . '/../data/dict.php';    foreach ($dict as $noun) {        $text = preg_replace("/\b{$noun}\b/i", $noun, $text);    }    return $text;}

Then I thought, what if the developers who use this method want to expand or ignore some of the proper nouns?
So, I will properNoun() transform the method as follows:

/** * 专有名词使用正确的大小写 * Correct English proper nouns. * * @param $text * @param array $extend * @param array $ignore * * @return null|string|string[] */public function properNoun($text, array $extend = [], array $ignore = []){    $dict = include __DIR__ . '/../data/dict.php';    if ($extend) {        $dict = array_merge($dict, $extend);    }    if ($ignore) {        $dict = array_diff($dict, $ignore);    }    foreach ($dict as $noun) {        $text = preg_replace("/\b{$noun}\b/i", $noun, $text);    }    return $text;}

How to improve and optimize code logic

When I write this function, I am also studying and referencing the implementation logic of some existing open source projects. After seeing a commit on the open source project Auto-correct (PS: This PR is submitted by the Community great God Overtrue. ), I will also properNoun() transform the method as follows:

public function properNoun($text, array $extend = [], array $ignore = []){    $dict = include __DIR__ . '/../data/dict.php';    if ($extend) {        $dict = array_merge($dict, $extend);    }    if ($ignore) {        $dict = array_diff($dict, $ignore);    }    foreach ($dict as $noun) {        $text = preg_replace("/(?<!\.|[a-z]){$noun}(?!\.|[a-z])/i", $noun, $text);    }    return $text;}

How to avoid over-substitution

When I thought it was going to be done, I tested it with the PHPUnit Unit test code I had written before and reported an error, in the above method, if the parameter passed in is rich text containing HTML tags, then HTML elements, element attributes, and values may be replaced.

How do you avoid over-replacing this problem? Other words:

Replace text only, ignoring HTML tags and content inside tags?

I tried to write a couple of matching schemes and failed. In the end, we asked Google to help out the great God. Here, the search keyword is very important, it is best to want to search the keywords you want to translate into the corresponding English words, so that the search results will make you more satisfied. As a result, I found the solution: Matching A word/characters Outside of Html Tags.

With the tips of the above article, I will properNoun() transform the method as follows:

public function properNoun($text, array $extend = [], array $ignore = []){    $dict = include __DIR__ . '/../data/dict.php';    if ($extend) {        $dict = array_merge($dict, $extend);    }    if ($ignore) {        $dict = array_diff($dict, $ignore);    }    foreach ($dict as $noun) {        // Matching proper nouns Outside Of Html Tags        $text = preg_replace("/(?<!\.|[a-z]){$noun}(?!\.|[a-z])(?!([^<]+)?>)/i", $noun, $text);    }    return $text;}

Development summary

Learn to access science online;
With Google, Github and StackOverflow, these three "artifacts" will help you get rid of most (or all ) of the problems you've encountered during the development process;
Learn some tips for Google search. For example, the search keywords translated into English words, such search results will make you more satisfied;
English is really important. At the very least, you should install a Google translator plugin on your Chrome browser;
PHPUnit can really be useful, especially in projects that frequently change functionality or require code refactoring.
Don't let yourself be confined to one programming language, learn another or multiple languages as an aid, to expand your thinking and broaden your horizons.
Visit a high-quality community like Laravel China;

Last Words

If there's anything else to say, it's the Star, hahaha. Project address: github.com/jxlwqq/chinese-typesetting.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More