Teach you how to create a keyword matching project (search engine) ---- 20th days, teach you how to do 20th days

Source: Internet
Author: User
Tags database issues

Teach you how to create a keyword matching project (search engine) ---- 20th days, teach you how to do 20th days

Guest string: hacker form artifacts, database issues

Object-oriented Sublimation: object-oriented cognition-new first cognition, object-oriented imagination-sleepwalking (1), object-oriented cognition-how to find a class

Server Load balancer: Server Load balancer-concepts, Server Load balancer-configuration implementation (Nginx)

Tucao: Some people have reported such a piece of information, saying that the more ugly this article is at the end of the article, the more difficult it is to understand and unable to keep up with the pace. Some people also say how fast the capabilities of Shuai are, is it silly of me. Some of them directly read the text and do not read the Code. The Code is too hard to understand.

In fact, I have been thinking about this problem over the past few days, so I have no way to start some object-oriented courses, hoping to help those who cannot keep up. In fact, if the reader does not give feedback, I have to go to the course as I think.

 

20th days

Start Point: Teach you how to create a keyword matching project (search engine) ---- Day 1

Review: Experts teach you how to perform keyword matching projects (search engines)-19th days

He wrote the first version of the word segmentation algorithm. He was asked to rewrite it when he showed it to the boss.

The reasons are as follows:

1. How to test and test data?

2. Does Splitter do too many things?

3. What should I do if there are repeated phrases like xxl dress?

Shuai started restructuring with these problems.

First of all, he discovered this, the judgment of Chinese, English and Chinese, and the calculation of the length. He wrote this as a class:

<? Phpclass UTF8 {/*** check whether utf8 * @ param $ char * @ return bool */public static function is ($ char) {return (preg_match ("/^ ([". chr (1, 228 ). "-". chr (1, 233 ). "] {1 }[". chr (1, 128 ). "-". chr (1, 191 ). "] {1 }[". chr (1, 128 ). "-". chr (1, 191 ). "] {1}) {1}/", $ char) | preg_match ("/([". chr (1, 228 ). "-". chr (1, 233 ). "] {1 }[". chr (1, 128 ). "-". chr (1, 191 ). "] {1 }[". chr (1, 128 ). "-". chr (1, 191 ). "] {1}) {1} $/", $ char) | preg_match ("/([". chr (1, 228 ). "-". chr (1, 233 ). "] {1 }[". chr (1, 128 ). "-". chr (1, 191 ). "] {1 }[". chr (1, 128 ). "-". chr (1, 191 ). "] {1}) {2,}/", $ char ));} /*** calculate the number of utf8 characters * @ param $ char * @ return float | int */public static function length ($ char) {if (self :: is ($ char) return ceil (strlen ($ char)/3); return strlen ($ char );} /*** check whether the phrase is * @ param $ word * @ return bool */public static function isPhrase ($ word) {if (self: length ($ word) <= 1) return false; return true ;}}

Shuai again considered that the dictionary may come from multiple sources, such as the test data I provided. This does not solve the problem that the boss said he could not test, shuai draws the dictionary source into a class, the class is as follows:

<? Phpclass DBSegmentation {public $ cid;/*** get the phrase data for word segmentation under a category * @ return array */public function transferDictionary () {$ ret = array (); $ SQL = "select word from category_linklist where cid = '$ this-> cid'"; $ words = DB: makeArray ($ SQL); foreach ($ words as $ strWords) {$ words = explode (",", $ strWords); foreach ($ words as $ word) {if (UTF8: isPhrase ($ word )) {$ ret [] = $ word ;}}return $ ret ;}} class TestSegmentation {public function transferDictionary () {$ words = array ("dress, clothes ", "XXL, xxl, increase, XL", "X code, medium code", "coat, coat, clothes, coat, coat", "female, ladies, girls, female "); $ ret = array (); foreach ($ words as $ strWords) {$ words = explode (",", $ strWords); foreach ($ words as $ word) {if (UTF8: isPhrase ($ word) {$ ret [] = $ word ;}}return $ ret ;}}

Then Splitter will focus on word segmentation. The Code is as follows:

Class Splitter {public $ keyword; private $ dictionary = array (); public function setDictionary ($ dictionary = array () {usort ($ dictionary, function ($ a, $ B) {return (UTF8: length ($ a)> UTF8: length ($ B ))? 1:-1 ;}); $ this-> dictionary = $ dictionary;} public function getDictionary () {return $ this-> dictionary ;} /*** split the keyword into phrases or words * @ return KeywordEntity $ keywordEntity */public function split () {$ remainKeyword = $ this-> keyword; $ keywordEntity = new KeywordEntity ($ this-> keyword); foreach ($ this-> dictionary as $ phrase) {$ matchTimes = preg_match_all ("/$ phrase/", $ remainKeyword, $ matches); if ($ matchTimes> 0) {$ KeywordEntity-> addElement ($ phrase, $ matchTimes); $ remainKeyword = str_replace ($ phrase, ":", $ remainKeyword) ;}}$ remainKeywords = explode (":: ", $ remainKeyword); foreach ($ remainKeywords as $ splitWord) {if (! Empty ($ splitWord) {$ keywordEntity-> addElement ($ splitWord) ;}return $ keywordEntity ;}} class KeywordEntity {public $ keyword; public $ elements = array (); public function _ construct ($ keyword) {$ this-> keyword = $ keyword;} public function addElement ($ word, $ times = 1) {if (isset ($ this-> elements [$ word]) {$ this-> elements [$ word]-> times + = $ times ;} else $ this-> elements [] = new KeywordElement ($ word, $ times );} /*** @ desc calculate UTF8 string weight * @ param string $ word * @ return float */public function calculateWeight ($ word) {$ element = $ this-> elements [$ word]; return ROUND (strlen ($ element-> word) * $ element-> times/strlen ($ this-> keyword), 3) ;}} class KeywordElement {public $ word; public $ times; public function _ construct ($ word, $ times) {$ this-> word = $ word; $ this-> times = $ times ;}}

He threw the calculation weight to a class for special processing.

After writing the test, Shuai also easily wrote the test example:

<? Php $ segmentation = new TestSegmentation (); $ splitter = new Splitter (); $ splitter-> setDictionary ($ segmentation-> transferDictionary ()); $ splitter-> keyword = "xxl dress"; $ keywordEntity = $ splitter-> split (); var_dump ($ keywordEntity );

 

In this way, even if your algorithm is changed, it can be easily faced.

 

Shuai understands this. When you think there are too many things in the class, you can consider the single responsibility principle.

 

Single responsibility principle:A class has only one reason for its change. There should be only one responsibility. Every responsibility is a changing axis. If a class has more than one responsibility, these responsibilities are coupled. This leads to a fragile design. When a responsibility changes, other responsibilities may be affected. In addition, coupling of multiple responsibilities will affect reusability. For example, you need to separate the logic from the interface. [From Baidu encyclopedia]

 

When the boss mentions whether there are other word segmentation algorithms, we can use them. He is very happy because the code is so beautiful.

How can Shuai play with third-party word splitting extensions? continue with the next decomposition: teach you how to do a keyword matching project (search engine)-21st days

 




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.