Word Segmentation System Using NLPIR-ICTCLAS2014

Source: Internet
Author: User
0. preparation before using the NLPIR-ICTCLAS2014 word splitting system

Download NLPIR-ICTCLAS2014 download package, fast Portal:

Http://ictclas.nlpir.org/upload/20140618094605_ICTCLAS2014.zip


You need to have your own word library (in fact, it's okay. The word library is just what I need. Some Words can help you analyze pages)


1. Quickly get what we need from the NLPIR-ICTCLAS2014 download package

First, let's take a look at the structure of the entire folder.


The data folder contains the dictionary required for word splitting, configure. XML contains the relevant description information; Doc contains help (describes the basic function interfaces that need to be used); Include, Lib is naturally our main use; sample is the sample code; there is an EXE example in test. The authorization is license, which should be restricted by some form. It is not clear yet. If it is limited by time, debugging is required.


2. Extract what we need from the downloaded package and create a new example.

From the above, we need to mainly use three folders: data, include, and Lib (I plan to do 32-bit programs, so we only need to use

DLL and Lib under Lib \ Win32)


At this time, the project directory becomes like this (of course, nlpir. DLL needs to be placed with exe ):



In the first example, it is relatively simple to use several common functions:

// Test_nlpir.cpp: defines the entry point of the console application. // # Include "stdafx. H "# include" iostream "# include" string "using namespace STD; # include" nlpir. H "# pragma comment (Lib," nlpir. lib ") int _ tmain (INT argc, _ tchar * argv []) {If (! Nlpir_init () {printf ("init fails \ n"); Return-1;} const char * participant le_result; const char * sentence = "[full rent for rent] second house, building 4, xuanwumen West Street, full rent [full rent] tianju garden, Media Village, large two residences [full rent] tianju garden, Media Village ju "; cout <"================= nlpir_paragraphprocess =========================" <<Endl; participant iple_result = nlpir_paragraphprocess (sentence, 1); cout <participant iple_result <Endl; cout <"==================================" <endl; cout <"=============== nlpir_getfilenewwords =========================" <<Endl; const char * get_file_new_words = nlpir_getfilenewwords ("test.txt"); cout <get_file_new_words <Endl; cout <"==================================" <endl; cout <"=================== nlpir_getkeywords =========================" <<Endl; const char * get_key_words = nlpir_getkeywords (sentence); cout <get_key_words <Endl; cout <"==================================" <endl; cout <"=================== nlpir_getfilekeywords =======================" <<Endl; const char * get_file_key_words = nlpir_getfilekeywords ("test.txt"); cout <get_file_key_words <Endl; cout <"==================================" <endl; cout <"=================== nlpir_getnewwords =========================" <<Endl; const char * get_new_words = nlpir_getnewwords (sentence); cout <get_new_words <Endl; cout <"==================================" <endl; nlpir_exit (); Return 0 ;}

Functions can be understood literally.


This is the output result:


However, it is obvious that the word segmentation mentioned above has some minor problems. If we feed data to the clustering algorithm, such word segmentation will cause some problems.


3. solve the above problems

We have two methods to solve this problem.


A. This may be the simplest and most convenient method.

Of course, apart from some tedious tasks (if you have some required data entries and dictionaries, we can create a new dictionary file

For example, if user_dic.txt is used in this test, it contains the separate phrases ):



The following describes how to use the dictionary:

// Test_nlpir.cpp: defines the entry point of the console application. // # Include "stdafx. H "# include" iostream "# include" string "using namespace STD; # include" nlpir. H "# pragma comment (Lib," nlpir. lib ") int _ tmain (INT argc, _ tchar * argv []) {If (! Nlpir_init () {printf ("init fails \ n"); Return-1;} const char * participant le_result; unsigned int add_dic_items = nlpir_importuserdict ("user_dic.txt "); // import user dictionaryprintf ("% d User-Defined lexical entries added! \ N ", add_dic_items ); const char * sentence = "[full rent for rent] second house, building 4, xuanwumen West Street, full rent [full rent] tianju garden, Media Village, large two residences [full rent] tianju garden, Media Village ju "; cout <"================= nlpir_paragraphprocess =========================" <<Endl; participant iple_result = nlpir_paragraphprocess (sentence, 1); cout <participant iple_result <Endl; cout <"==================================" <endl; cout <"=============== nlpir_getfilenewwords =========================" <<Endl; const char * get_file_new_words = nlpir_getfilenewwords ("test.txt"); cout <get_file_new_words <Endl; cout <"==================================" <endl; cout <"=================== nlpir_getkeywords =========================" <<Endl; const char * get_key_words = nlpir_getkeywords (sentence); cout <get_key_words <Endl; cout <"==================================" <endl; cout <"=================== nlpir_getfilekeywords =======================" <<Endl; const char * get_file_key_words = nlpir_getfilekeywords ("test.txt"); cout <get_file_key_words <Endl; cout <"==================================" <endl; cout <"=================== nlpir_getnewwords =========================" <<Endl; const char * get_new_words = nlpir_getnewwords (sentence); cout <get_new_words <Endl; cout <"==================================" <endl; nlpir_exit (); Return 0 ;}

This is the result of Word Segmentation after the dictionary is used:


We can see that all the phrases we want to split have come out, and no new words have emerged.

 

B. Use multiple samples (that is, multiple pieces of data) for word segmentation, so that the frequency of the words we need increases (some words appear only once or twice and won't

In this way, we can get some data of our ideas in getnewwords. After we get the data, we can write it into the text. However

And then write it to the user dictionary.


Here we will only demonstrate the first step to see how to expand the getnewwords result by adding multiple pieces of data.

In fact, it is very simple to modify the test strings in the first example to the following (the strings corresponding to test.txt in the same sample are also modified ):

Const char * sentence = "[full rent for rent] second house, building 4, xuanwumen West Street [full rent] tianju garden, Media Village, two largest residences [house owner rent] Wanliu Zhong Road, Kangqiao water-gun one-bedroom [individual rental] On-floor Bridge east Qing shangyuan community opening room 58 Ping 3 months sublease [Zhongguancun baofusi bridge south] Two center master bedroom [full long-term rental] Haidian Anning jiayuan Anning village one-bedroom full rent (Landlord rent directly) [single room rental] near century Golden Source, [personal rental, second bedroom, third bedroom, building 11, yueda Park, [house owner 1 Direct Rent] 1 bedroom, xuante jiayuan, shilibao station, Metro Line 6 (only 1 female) [single room rental] Line 10: One of the second residences of the North shadow huangting sub-district of the West Tucheng peony park [complete rental] Three residences of the smart Xueyuan community of the West second flag [house owner complete rental] Three hardwarehouses of the west xiaokou station of Qinghe Metro full Set of [single room rental] single room in Tsinghua University [one-bedroom apartment for decoration in the south dahezhuang Garden of Haidian bridge, southwest gate of Peking University] su [for help] renting a room in dongli/sily/jiayuan/yard 2 of Nongda South Road clean comfort belt elevator two [for help] rent a West Ticket/Finance Street/two-bedroom apartment at erlong Road [Full rent] a full suite of North Beach sciences, with a 50-meter apartment near the University of Finance and Economics, 3400 yuan for rent Transfer, you need to obtain a sublease within the station. Rent a three-bedroom apartment opposite to the Hupan subway station in Zhongguancun Zhichun Li, you can also rent a whole person for rent [whole set of rent] The whole set of one-bedroom apartment in the south beach community of Chaoyang District (individual rental) xinlong City Phase II 14 square meters regular second bedroom 1000 yuan individual rent: wenquan Town shangfeng ShangShui community hardcover semi-underground second residence [Changping shahe gaojiao park one district small two residences] [2450 appliances all new] Individual rent, I personally rented a two-bedroom apartment near Shijingshan star anise. I decided to raise the rent by 10% tomorrow when I signed an online contract for the 90-level two-bedroom Stock Room. I had to prepare for the property tax. I had to wait for a rainy day: transgenic Rice has actually spread (12) full purchase Wannian Flower City two-bedroom (19) Ask a primary question: what is the quota of alumni cards? I asked another question (8) about my wife's city ranking: Chengdu's second largest Shanghai Ranking (14) tsinghua University East eight jiayuan 61 square meters from the south to the regular one-bedroom view Figure 2.4 million (1) 110 square meters of new homes for reliable decoration team, design, quotation wudaokou School District room for five years the only two homes for urgent sale 2.6 million parents with the migration can apply for policy housing house south of the Road 60 meters, the west side is close to primary and secondary schools, with 19 floors, will the school district room go up? Several intermediary agencies call to say house prices are starting to rise [Full rental] line 6, huangqu station, Apple campus, 77, 2, individual rental] Hualian south, wudaokou, Haidian District no intermediary fee near-ground [whole house rent] two-bedroom apartment owner in North China first rent [rent] Zhongguancun Peking University Ximen Single Room [help] seek rent on contemporary urban homes or two homes in Yimei or a three-bedroom apartment [Full rental] wudaokou Dongsheng park one room one hall North transparent all bright solid wood furniture floor appliances ";

This data is also obtained through the web page.


Now let's take a look at the results:



Some common words can also be reflected through getnewwords.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.