In the previous section, we crawled nearly 70 thousand pieces of second-hand house data using crawler tools. This section pre-processes the data, that is, the so-called ETL (extract-transform-load)
I. Necessity of ETL tools
Data cleansing is a prerequisite for data analysis. No matter how high the algorithm is, when an error data is encountered, an exception is thrown out, and it is absolutely dead. However, you cannot expect the Core algorithms to handle errors or insufficient data for you. Therefore, data cleansing (ETL) becomes essential. If data analysis is in the cooking stage, cleansing is a very important part of dish washing.
In fact, the ETL tool can be very simple or complex. Simply convert a string to a number or provide a regular expression. Complex ETL requires the establishment of a complete Error Log Mechanism, Intelligent Processing, automatic summarization, and complex forms such as workflows and scripts. I decided to make a compromise to provide functions that are most needed and nearly adequate. If you really have strong and complex requirements for ETL, you can only use professional ETL tools.
Ii. ETL plug-in
In this system, ETL exists as a plug-in and can be called by the system. These ETL tools include common scenarios:
- Regular Expression Extraction
- Data type conversion, including time format
- Random number and range generation
- Data segmentation, sorting, and filtering
- Noise Generation
- String to numeric number
- Script data
- Advanced modules such as Word Segmentation and text sentiment analysis
All modules support secondary extension. New modules can be added for new scenarios. There are two types of modules:
- Unsupervised tool: directly generates new data, such as a noise generator, without the need for previous data
- Supervised tools: generated based on previous data, such as regular expression Extraction
These ETL plug-ins form a processing queue that processes data in sequence to form a responsibility chain. In this way, the vast majority of ETL needs can be met.
Iii. Introduction
Next, we will take the sample data of Beijing house price as an example to perform ETL cleaning. These data crawlers come from the network. For details about the crawling process, refer to the previous section.
1. Data Import and observation
First, click "import" in the Data Manager to import the XML data file, as shown in:
Then you can view the data in the normal list:
The data is very regular. We can develop the following solutions:
- Filter out normal data because some data is incomplete. For example, the area column is empty.
- The price is xxxxx RMB/square meter. for statistics, regular expressions are required for extraction.
- Number of "attribute 3" floors, which can be divided by/. Just take the second item, and the last item is the boarding or tower.
- In "attribute 3", year can be extracted using regular match.
- In the "coordinates" field, numbers, geographical locations, residential areas, and other information can be easily extracted through segmentation.
- Others...
2. Write a data filtering script
First, perform data filtering and drag the "data filtering" plug-in into the algorithm box:
At the same time, for convenience, we only extract the first 1000 pieces of data as the analysis target.
The most important thing is to write a filtered user-defined function script. Similar to C # Syntax:
Click Start processing to filter out the first 1000 data records that meet the conditions.
3. Perform ETL Configuration:
Similar to the previous step, drag the "data generation and preprocessing" module to the algorithm manager.
Select "all" as the data source of this module ":
Click the command column to configure the ETL plug-ins:
The Set Editor lists all the ETL plug-ins loaded by the system. First, we extract the price:
Select the original column name and enter the new column name. If you do not select "add to new column", the original data will be overwritten. Enter the regular expression \ D + to extract the price. At the same time, select the target type as Int. After the conversion, the Data Type will change to int.
Similarly, the Construction year can be extracted.
Since the Construction year is string type after conversion, you can add another tool to convert the string to datetime to directly overwrite the original data.
Next, extract the geographic coordinates: observe these coordinates, and you can also use regular expression matching, Which is omitted here.
Next is the region, which is located in the "coordinate" attribute. It is not convenient to use regular expression matching. Therefore, we use the series partitioning method, that is, to split this attribute by characters and extract items with fixed positions. Through observation, you can use symbols to separate them, which is exactly the same as 4th items.
Similarly, you can extract the name of a residential area. The only difference between the name and the administrative area is that the matching number must be set to 2.
Shows the final ETL process:
These ETL plug-ins are executed sequentially.If the result of a plug-in depends on the previous plug-in, it must be placed behind the dependent plug-in..
Click Start processing to start the ETL process. All error logs are saved in a dedicated dataset for further analysis and processing. No error occurred during this processing. The system prompts that the ETL process has been successfully completed,The following is a new column after ETL.:
For the sake of insurance, we save all the settings loaded this time as a task named "ETL cleaning task". If the same data is encountered, we can perform the same task, you do not need to configure it again.
You can see the XML file saved for this task in the generated system configuration file:
<Doc name = "ETL data cleansing task" Description = "task description" group = ""> <nodes> <children x = "-2.01" Y = "0" z =" 0 "group =" 0 "Key =" data filtering _ 1 "Weight =" 1 "> <data collection =" all "name =" data filtering "newdatasetname =" "scriptcode = ""Return string. isnullorempty (item ['price']); // 1 indicates filtering the data, otherwise, "canremove =" true "Start =" 0 "End =" 1000 "layer =" 0 "/> </children> <children x =" 3 "Y = "0" z = "0" group = "1" Key = "data generation and preprocessing _ 2" Weight = "1"> <data collection = "all" name = "Data generate and pre-process "size =" 1001 "layer =" 0 "> <children type =" regular filter "collumname =" price "newcollumname =" new price "isaddnewcollum =" true "targetdatatype = "int" scriptcode = "\ D +" Index = "0"/> <children type = "regular filter" collumname = "attribute 3" newcollumname = "year" isaddnewcollum =" true "targetdatatype =" string "scriptcode =" \ D {4} "Index =" 0 "/> <children type =" String Conversion time "collumname =" year "newcollumname =" "isaddnewcollum =" false "targetdatatype =" datetime "format =" YYYY "/> <children type =" collumname = "coordinate" newcollumname = "area" isaddnewcollum = "true "targetdatatype =" string "splitchar = ", "Index =" 3 "splitpause =" splitpause "/> <children type =" series partitioning "collumname =" coordinate "newcollumname =" cell name "isaddnewcollum =" true "targetdatatype =" string "splitchar = ", "Index =" 2 "splitpause =" splitpause "/> <children type =" regular filter "collumname =" coordinate "newcollumname =" lag "isaddnewcollum =" true "targetdatatype =" double "scriptcode =" \ D {2 }. \ D + "Index =" 0 "/> <children type =" regular filter "collumname =" coordinate "newcollumname =" LNG "isaddnewcollum =" true "targetdatatype =" double "scriptcode = "\ D {3 }. \ D + "Index =" 0 "/> </data> </children> </nodes> <paths/> </DOC>
Iii. Summary
This section describes how to use this tool for ETL cleaning. The next section will officially begin the data analysis process. Coming soon.
Analysis of Beijing house price using self-made data mining tools (ii) Data cleansing