If you have analyzed words or phrases for a website to get a better ranking, you will be very interested in this SSIS task. A keyword extraction task is a tool used to find and calculate the frequency of occurrence of keywords in a text stream. It can be used to receive text in any form and output two columns: one column is the keyword, And the other column is the statistical value of the frequency when the keyword appears in the text. This statistical value can be the number of times this keyword appears, or the probability value calculated using a slightly more complex formula (TFIDF. The abbreviation TFIDF is term frequency and inverse document frequency, which is used to calculate the probability that a word or phrase appears in the entire text. The following is the expression of this formula:
Tdidf (of a term or phrase) = (frequency of term) * log (# Rows in sample)/(# Rows with term or phrase ))
The keyword extraction task discussed here uses the SSIS component embeddedAlgorithmWe cannot modify the internal algorithm logic of the extracted results from formula calculation. It is designed by others, but some options can be configured, such as how to separate keywords, statistical nouns, and noun phrases. We can also adjust these configurations based on the extracted behavior. Although extracted within a specific range, it still provides some pre-defined interference items for words or phrases that are not counted in the final result. After several tests and adjustments, a very valuable statistical result can be generated.
Before using it, you may think about its usage. For example, a data column in a customer service system is saved in a data warehouse. It stores customers' suggestions and opinions, is it useful to use this task to generate a keyword statistical result? We can clearly see that the customer is not satisfied in the maximum range, which is a very suitable scenario for keyword extraction. The input data type must be ANSI (dt_wstr) or Unicode (dt_ntext ). If data convertion is not required for various types in the next year. The best way to learn is to take an example. This example is about customer reviews.
Obtain the customer rating data column in a customer service system from the production environment. For convenience, we put them in a text file.
- Use the following content to create a text file and save it as C "\ custsvc.txt
Ice maker in freezer stopped working model XX-YY3
Door to refrigerator is coming off model XX-1
Ice Maker is making a funny noise XX-YY3
Handle on fridge falling off model XX-Z1
Freezer is not getting cold enough XX-1
Ice Maker grinding sound fridge XX-YY3
Customer asking how to get the ice maker to Work Model XX-YY3
Customer complaining about Dent in Side Panel Model XX-Z1
Dent in model XX-Z1
Customer wants to exchange model XX-Z1 because of dent in door
Handle is wigling model XX-Z1
-
- Create a package named termextractionexample, drag and drop a Data Flow task, and double-click it to enter the control flow design interface.
- Drag and Drop a flat file connection to connect to the above file and change the name of the output column to custsvcnote. Set outputcolumnwidth to 100. Add a flat file datasource and use the file data source created above
-
- We can see that the output data type of flat file datasource is dt_str. here we need to convert the data type to dt_wstr or dt_ntext. Add a data conversion transform, connect the flat file datasource with it, set the input column to custsvcnote, the output alias to convcustsvcnote, and the data type to dt_wstr
-
- Add a term extraction task, connect data conversion transform to it, open the editing page, 1 show the input column and two output columns with default column names, you can also modify the name of the output column. Here we select convcustsvcnote because it is converted to a unicode string. If you select custsvcnote, the following verification prompt is displayed: the input column can only have dt_wstr or dt_ntext as its data type.
Figure 1
-
- Close the editing page and ignore the warning about error output.
-
- Create an ole db destination, connect the term extraction task with it, and connect it to the database adventureworks
-
- Click Create a table and use the following statement to create a new table.
Create Table [termresults] ([term] nvarchar (128), [score] Double Precision)
-
- Click mappings to view the ing relationship.
- Add a data view between term extract transformation and ole db destination to run the package. The execution results will be stopped and displayed in dataview. We can see a group of words and their statistical values. Here we set the default value. Therefore, the score column displays the number of keyword occurrences. Click the advanced tag, and click 2 to view the advanced settings.
Figure 2
-
- The advanced tag is divided into four parts.
Term type: Set how the input stream is divided. "Noun" will only count nouns. "noun" phrases "will only count nouns and phrases." noun "and" noun "phrases will count both terms.
Score type: Use frequence or TFIDF Algorithm for statistics.
Parameters. The maximum length indicates that the maximum number of letters in a word must be less than 12.
Different configurations have different results. 3. Different configurations have different results.
Figure 3
It is noted that the "Model XX-Z1" counts three times when the term type is set to both, but we can see in the text that it appears five times, if the term type is set to nouns, It is counted 5 times. This also indicates that noun phrases may not be correctly calculated.
In the statistical results shown in Figure 3 we can also see "model", "Model XX-Z1", "Model XX-YY3", "ice maker", "dent ", "customer" appears more frequently, here can also assume the model XX-Z1, The XX-YY3 may have a problem, need to check.
In this result, we also see some terms that we are not interested in, such as model and customer. These interfering words can also be discarded. You can use the settings in the exclusion label to achieve this goal. The configuration result is displayed 6-20. Save the configuration and view the result. The following are the statements for creating a new termexclusions table. The table contains only two rows of model and customer.
Create Table [DBO]. [termexclusions] (
[Term] [nvarchar] (128) null
) On [primary]
Figure 4
Result 5 in the last termresults table. The number of times the word appears in the graph. If TFIDF is selected in the Advanced Table, the result here is a decimal like 0.213.
Figure 5
We can also see that "Model XX-Z1" in the results, "Model XX-YY3" is not counted, because we are removing "model ". Another method is to use a keyword list. In this list, add some keywords to intervene in the statistical result. This is another task-Keyword Search Method.