Python Data Cleansing series of string processing detailed

Source: Internet
Author: User
Preface

Data cleansing is a complex and cumbersome (Kubi) work, and is also the most important part of the entire data analysis process. Some people say that an analysis project 80% of the time is cleaning the data, which sounds strange, but in the actual work is true. There are two purposes for data cleansing, and the first is to make the data available by cleaning. The second is to make the data more suitable for subsequent analysis work. In other words, there is "dirty" data to be washed and clean data to be washed.

In data analysis, especially in text analysis, character processing takes a lot of effort, so understanding character processing is also an important capability for data analysis.

String processing methods

First, let's start by understanding what the underlying methods are.

First we understand the split split method of the string

Str= ' I like apple,i like Bananer ' Print (Str.split (', '))

The result of splitting the character str with a comma:

[' I like Apple ', ' I like Bananer ']

Print (Str.split ("))

The result of splitting according to the space:

[' I ', ' like ', ' apple,i ', ' like ', ' Bananer ']

Print (Str.index (', ')) print (Str.find (', '))

Two search results are:

12

If it is not found, index returns an error, and find returns-1

Print (Str.count (' I '))

The result is:

4

Connt frequency used to count the target string

Print (Str.replace (', ', '). Split ('))

The result is:

[' I ', ' like ', ' apple ', ' I ', ' like ', ' Bananer ']

Here replace replaces the comma with a space, and splits the string with a space, just to take each word out.

In addition to the usual methods, the more powerful character-processing tool, Feizheng, is the expression.

Regular expressions

Before we can use regular expressions, we need to understand many of the methods in regular expressions.

Let me take a look at the use of the next method, first understand the difference between the match and search methods

str = "Cats is smarter than Dogs" Pattern=re.compile (R ' (. *). * ') Result=re.match (PATTERN,STR) for I in Range (Len (r Esult.groups ()) +1): Print (Result.group (i))

The result is:

Cats is smarter than dogs
Cats
Smarter

In this form of pettern matching rule, the match and search methods return the same result.

At this point, if you change the pattern

Pattern=re.compile (R ' is (. *?). *)

Match returns none,search with the result:

Is smarter than dogs
Smarter

Next we understand the use of other methods

str = "138-9592-5592 # number" Pattern=re.compile (R ' #.*$ ') number=re.sub (Pattern, ", str) print (number)

The result is:

138-9592-5592

The above is the purpose of extracting numbers by replacing the contents of the # Number with the empty implementation.

We can also replace the crossbar of the number

Print (Re.sub (R '-* ', ", number))

The result is:

13895925592

We can also use the Find method to print the found string.

str = "138-9592-5592 # number" Pattern=re.compile (R ' 5 ') print (Pattern.findall (str))

The result is:

[' 5 ', ' 5 ', ' 5 ']

The overall content of the regular expression is much more, we need to understand the rules of matching strings, the following is the specific matching rules.

Vectorization String Functions

When cleaning up scattered data to be analyzed, it is often necessary to do some string normalization work.

data = PD. Series ({' Li ': ' 120@qq.com ', ' Wang ': ' 5632@qq.com ', ' Chen ': ' 8622@xinlang.com ', ' Zhao ': Np.nan, ' sun ': ' 5243@gmail.com '} ) Print (data)

The result is:

It is possible to make preliminary judgments about the data by means of some methods of conformity, such as using contains to determine if each data contains keywords.

Print (Data.str.contains (' @ '))

The result is:

You can also split the string to extract the required string.

data = PD. Series ({' Li ': ' 120@qq.com ', ' Wang ': ' 5632@qq.com ',     ' Chen ': ' 8622@xinlang.com ', ' Zhao ': Np.nan, ' sun ': ' 5243@ ' Gmail.com '}) Pattern=re.compile (R ' (\d*) @ ([a-z]+) \. ( [A-z] {2,4}) ') Result=data.str.match (pattern) #这里用fillall的方法也可以result =data.str.findall (pattern) print (result)

The result is:

Chen [(8622, Xinlang, com)]
Li [(QQ, COM)]
Sun [(5243, Gmail, com)]
Wang [(5632, QQ, COM)]
Zhao NaN
Dtype:object

In this case join we need to extract the name of the mailbox before

Print (Result.str.get (0))

The result is:

or need the domain name that the mailbox belongs to

Print (Result.str.get (1))

The result is:

Of course, it can be extracted by slicing, but the extracted data is not accurate.

data = PD. Series ({' Li ': ' 120@qq.com ', ' Wang ': ' 5632@qq.com ',    ' Chen ': ' 8622@xinlang.com ', ' Zhao ': Np.nan, ' sun ': ' 5243@ ' gmail.com '}) print (Data.str[:6])

The result is:

Finally we understand the string method of Vectorization

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.