Small note: Add a new feature to open source project development process

Source: Internet
Author: User
Tags nltk

This article is not to remember the development of Journal, but to the development process of the problems encountered and solutions to communicate with you and learn. I am an ordinary PHP engineer, hoping to help junior development students. The specific experience is summarized at the end of the article.

Earlier this month, I started a small project on GitHub: Chinese-typesetting. This is a correction of Chinese text typesetting Composer package.

Chinese-typesetting includes the following features:

    • Add spaces between Chinese and English alphabet/Greek letters/numbers for math, science and engineering;
    • Limited full-width turn half-width (English, numerals, spaces, and some special characters, etc. using half-width characters);
    • Fix the wrong punctuation;
    • Clears the style of the HTML tag;
    • Clear empty HTML tags;
    • clear paragraph indent;

This week, the company did not develop a lot of business, no overtime, so began to conceive of new functions to correct the implementation of the English terminology of the case.

The data source of English proper nouns

First of all, the first question to be faced is:

Where does the data from English proper nouns come from?

My first thought was that Python has a natural language-processing package NLTK, which has a function called Pos_tag that can be used to identify and label the part of speech of each word, where the word labeled NNP or NNPS is the proper noun (Proper Noun). I suspect that there should be a corresponding set of proper noun data in the NLTK packet, but I have not been able to find it because I have limited capacity.

After the above-mentioned path has not gone through, I also search by Google, found that through the network dictionary to obtain data is a feasible solution. Through this method, we finally found a list of English proper nouns in Wiktionary. Then, using Python to write a small reptile script, crawled the corresponding data.

Finally, some sorting and filtering is done on the crawled data.

The filtering scenarios are as follows:

    • is_numeric()the use of methods, such as the elimination 007 of such words;
    • The use '/\W/' of regular, such as the elimination ǃXóõ of such words;
    • Culling strlen method, eliminating A such single-byte words;
    • Eliminate words that conflict with HTML, CSS, and JavaScript reserved words;
How to customize the proper noun data for the user

The initial code is as follows:

/** * 专有名词使用正确的大小写 * Correct English proper nouns. * * @param $text * * @return null|string|string[] */public function properNoun($text){    $dict = include __DIR__ . '/../data/dict.php';    foreach ($dict as $noun) {        $text = preg_replace("/\b{$noun}\b/i", $noun, $text);    }    return $text;}

Then I thought, what if the developers who use this method want to expand or ignore some of the proper nouns?
So, I will properNoun() transform the method as follows:

/** * 专有名词使用正确的大小写 * Correct English proper nouns. * * @param $text * @param array $extend * @param array $ignore * * @return null|string|string[] */public function properNoun($text, array $extend = [], array $ignore = []){    $dict = include __DIR__ . '/../data/dict.php';    if ($extend) {        $dict = array_merge($dict, $extend);    }    if ($ignore) {        $dict = array_diff($dict, $ignore);    }    foreach ($dict as $noun) {        $text = preg_replace("/\b{$noun}\b/i", $noun, $text);    }    return $text;}
How to improve and optimize code logic

When I write this function, I am also studying and referencing the implementation logic of some existing open source projects. After seeing a commit on the open source project Auto-correct (PS: This PR is submitted by the Community great God Overtrue. ), I will also properNoun() transform the method as follows:

public function properNoun($text, array $extend = [], array $ignore = []){    $dict = include __DIR__ . '/../data/dict.php';    if ($extend) {        $dict = array_merge($dict, $extend);    }    if ($ignore) {        $dict = array_diff($dict, $ignore);    }    foreach ($dict as $noun) {        $text = preg_replace("/(?<!\.|[a-z]){$noun}(?!\.|[a-z])/i", $noun, $text);    }    return $text;}
How to avoid over-substitution

When I thought it was going to be done, I tested it with the PHPUnit Unit test code I had written before and reported an error, in the above method, if the parameter passed in is rich text containing HTML tags, then HTML elements, element attributes, and values may be replaced.

How do you avoid over-replacing this problem? Other words:

Replace text only, ignoring HTML tags and content inside tags?

I tried to write a couple of matching schemes and failed. In the end, we asked Google to help out the great God. Here, the search keyword is very important, it is best to want to search the keywords you want to translate into the corresponding English words, so that the search results will make you more satisfied. As a result, I found the solution: Matching A word/characters Outside of Html Tags.

With the tips of the above article, I will properNoun() transform the method as follows:

public function properNoun($text, array $extend = [], array $ignore = []){    $dict = include __DIR__ . '/../data/dict.php';    if ($extend) {        $dict = array_merge($dict, $extend);    }    if ($ignore) {        $dict = array_diff($dict, $ignore);    }    foreach ($dict as $noun) {        // Matching proper nouns Outside Of Html Tags        $text = preg_replace("/(?<!\.|[a-z]){$noun}(?!\.|[a-z])(?!([^<]+)?>)/i", $noun, $text);    }    return $text;}
Development summary
    • Learn to access science online;
    • With Google, Github and StackOverflow, these three "artifacts" will help you get rid of most (or all ) of the problems you've encountered during the development process;
    • Learn some tips for Google search. For example, the search keywords translated into English words, such search results will make you more satisfied;
    • English is really important. At the very least, you should install a Google translator plugin on your Chrome browser;
    • PHPUnit can really be useful, especially in projects that frequently change functionality or require code refactoring.
    • Don't let yourself be confined to one programming language, learn another or multiple languages as an aid, to expand your thinking and broaden your horizons.
    • Visit a high-quality community like Laravel China;
Last Words

If there's anything else to say, it's the Star, hahaha. Project address: github.com/jxlwqq/chinese-typesetting.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.