Pandas Cheats "seventh chapter"

Source: Internet
Author: User

Original: Chapter 7

# usual opening
%matplotlib inline

import pandas as PD
import matplotlib.pyplot as Plt
import NumPy as NP

# make diagram Table bigger and prettier
pd.set_option (' Display.mpl_style ', ' Default ')
plt.rcparams[' figure.figsize '] = (5)
plt.rcparams[' font.family ' = ' sans-serif '

# need to show a lot of columns in Pandas 0.12 
# in Pandas 0.13 you don't need
pd.set_option (' Display.width ', 5000) 
pd.set_option (' Display.max_columns ', 60)

One of the main problems with messy data is: How do you know if it's messy?

Here we will use the NYC 311 service to request the dataset because it is large and somewhat inconvenient.

Requests = Pd.read_csv ('.. /data/311-service-requests.csv ')
7.1 How do I know if it's messy?

Let's look at a few columns here. I know there are some problems with the ZIP code, so let's look at it first.

To see if there is a problem with the column, I usually use. Unique () to view all the values. If it's a column of numbers, I'll draw a histogram to get the feel of the distribution.

When we look at the unique values in the incident zip, it will soon be clear that this is a mess.

Some problems: Some have been parsed into strings, some are floating-point presence Nan part zip code 29616-0759 or 83 some pandas unrecognized n/A values, such as ' N/a ' and ' no clue '

What we can do: normalize N/A and no clue to Nan values see what happened at 83 and decide what to do to convert everything into a string

requests[' Incident Zip '].unique ()
Array ([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0, 11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0, 11219.0, 10025.0, 10310.0, 11236.0, Nan, 10033.0, 11216.0, 10016.0, 10305.0, 10312.0, 1002 6.0, 10309.0, 10036.0, 11433.0, 11235.0, 11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0, 1045 9.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0, 10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10 007.0, 10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0, 11211.0, 11412.0, 10458.0, 11229.0, 10 065.0, 10030.0, 11222.0, 10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0, 10022.0, 11232.0, 11 040.0, 11226.0, 10281.0, 11102.0, 11208.0, 10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0, 11 203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0, 10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0, 11209.0, 10021.0, 10037.0, 11413.0, 11375.0, 11238.0, 10473.0, 11103.0, 11354.0, 11361.0, 11106.0, 11385.0, 10463.0, 10467. 0, 11204.0, 11237.0, 11377.0, 11364.0, 11434.0, 11435.0, 11210.0, 11228.0, 11368.0, 11694.0, 10464.0, 11415. 0, 10314.0, 10301.0, 10018.0, 10038.0, 11105.0, 11230.0, 10468.0, 11104.0, 10471.0, 11416.0, 10075.0, 11422. 0, 11355.0, 10028.0, 10462.0, 10306.0, 10461.0, 11224.0, 11429.0, 10035.0, 11366.0, 11362.0, 11206.0, 10460. 0, 10304.0, 11360.0, 11411.0, 10455.0, 10475.0, 10069.0, 10303.0, 10308.0, 10302.0, 11357.0, 10470.0, 11367.0, 1137 0.0, 10454.0, 10451.0, 11436.0, 11426.0, 10153.0, 11004.0, 11428.0, 11427.0, 11001.0, 11363.0, 10004.0, 1047 4.0, 11430.0, 10000.0, 10307.0, 11239.0, 10119.0, 10006.0, 10048.0, 11697.0, 11692.0, 11693.0, 10573.0, 83.0
       , 11559.0, 10020.0, 77056.0, 11776.0, 70711.0, 10282.0, 11109.0, 10044.0, ' 10452 ', ' 11233 ', ' 10468 ', ' 10310 ', ' 11105 ', ' 10462 ', ' 10029 ', ' 10301 ', ' 10457 ', ' 10467 ', ' 10469 ', ' 11225 ', ' 10035 ', ' 10031 ', ' 11226 ', ' 10454 ', ' 11221 ', ' 10025 ', ' 112 29 ', ' 11235 ', ' 11422 ', ' 10472 ', ' 11208 ', ' 11102 ', ' 10032 ', ' 11216 ', ' 10473 ', ' 10463 ', ' 11213 ', ' 10040 ', ' 10302 ', ' 1 ' 1231 ', ' 10470 ', ' 11204 ', ' 11104 ', ' 11212 ', ' 10466 ', ' 11416 ', ' 11214 ', ' 10009 ', ' 11692 ', ' 11385 ', ' 11423 ', ' 1 1201 ', ' 10024 ', ' 11435 ', ' 10312 ', ' 10030 ', ' 11106 ', ' 10033 ', ' 10303 ', ' 11215 ', ' 11222 ', ' 11354 ', ' 10016 ', ' 1 0034 ', ' 11420 ', ' 10304 ', ' 10019 ', ' 11237 ', ' 11249 ', ' 11230 ', ' 11372 ', ' 11207 ', ' 11378 ', ' 11419 ', ' 11361 ', ' 1 0011 ', ' 11357 ', ' 10012 ', ' 11358 ', ' 10003 ', ' 10002 ', ' 11374 ', ' 10007 ', ' 11234 ', ' 10065 ', ' 11369 ', ' 11434 ', ' 11205 ', ' 11206 ', ' 11415 ', ' 11236 ', ' 11218 ', ' 11413 ', ' 10458 ', ' 11101 ', ' 10306 ', ' 11355 ', ' 10023 ', ' 11368 ', ' 10314 ', ' 11421 ', ' 10010 ', ' 10018 ', ' 11223 ', ' 10455 ', ' 11377 ', ' 11433 ', ' 11375 ', ' 10037 ', ' 11209 ', ' 10459 ', ' 10128 ', ' 10014 ', ' 10282 ', ' 11373', ' 10451 ', ' 11238 ', ' 11211 ', ' 10038 ', ' 11694 ', ' 11203 ', ' 11691 ', ' 11232 ', ' 10305 ', ' 10021 ', ' 11228 ', ' 10036 ', ' 10001 ', ' 10017 ', ' 11217 ', ' 11219 ', ' 10308 ', ' 10465 ', ' 11379 ', ' 11414 ', ' 10460 ', ' 11417 ', ' 11220 ', ' 11366 ', ' 10027 ', ' 11370 ', ' 10309 ', ' 11412 ', ' 11356 ', ' 10456 ', ' 11432 ', ' 10022 ', ' 10013 ', ' 11367 ', ' 11040 ', ' 10026 ', ' 104 75 ', ' 11210 ', ' 11364 ', ' 11426 ', ' 10471 ', ' 10119 ', ' 11224 ', ' 11418 ', ' 11429 ', ' 11365 ', ' 10461 ', ' 11239 ', ' 100 39 ', ' 00083 ', ' 11411 ', ' 10075 ', ' 11004 ', ' 11360 ', ' 10453 ', ' 10028 ', ' 11430 ', ' 10307 ', ' 11103 ', ' 10004 ', ' 100 69 ', ' 10005 ', ' 10474 ', ' 11428 ', ' 11436 ', ' 10020 ', ' 11001 ', ' 11362 ', ' 11693 ', ' 10464 ', ' 11427 ', ' 10044 ', ' 113 63 ', ' 10006 ', ' 10000 ', ' 02061 ', ' 77092-2016 ', ' 10280 ', ' 11109 ', ' 14225 ', ' 55164-0737 ', ' 19711 ', ' 07306 ', ' 000000 ', ' NO clue ', ' 90010 ', ' 10281 ', ' 11747 ', ' 23541 ', ' 11776 ', ' 11697 ', ' 11788 ', ' 07604 ', 10112.0, 11788.0, 11563.0 , 11580.0, 7087.0, 11042.0, 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0, ' 10803 ', ' 11716 ', ' 11722 ', ' 11549-3650 ', ' 10162 ' ', ' 92123 ', ' 23502 ', ' 11518 ', ' 07020 ', ' 08807 ', ' 11577 ', ' 07114 ', ' 11003 ', ' 07201 ', ' 11563 ', ' 61702 ', ' 10103
       ', ' 29616-0759 ', ' 35209-3114 ', ' 11520 ', ' 11735 ', ' 10129 ', ' 11005 ', ' 41042 ', ' 11590 ', 6901.0, 7208.0, 11530.0, 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0, 10107.0], Dtype=object)
7.3 Fixing Nan Value and string/floating-point confusion

We can then pass the na_values options to pd.read_csv to clean them up. We can also specify that the type of the incident zip is a string rather than a floating-point.

Na_values = [' NO clue ', ' N/A ', ' 0 ']
requests = Pd.read_csv ('. /data/311-service-requests.csv ', na_values=na_values, dtype={' Incident Zip ': str})
requests[' Incident Zip '].unique ()
Array ([' 11432 ', ' 11378 ', ' 10032 ', ' 10023 ', ' 10027 ', ' 11372 ', ' 11419 ', ' 11417 ', ' 10011 ', ' 11225 ', ' 11218 ', ' 10003 ') ' 10029 ', ' 10466 ', ' 11219 ', ' 10025 ', ' 10310 ', ' 11236 ', Nan, ' 10033 ', ' 11216 ', ' 10016 ', ' 10305 ', ' 10312 ', ' 100 26 ', ' 10309 ', ' 10036 ', ' 11433 ', ' 11235 ', ' 11213 ', ' 11379 ', ' 11101 ', ' 10014 ', ' 11231 ', ' 11234 ', ' 10457 ', ' 104 59 ', ' 10465 ', ' 11207 ', ' 10002 ', ' 10034 ', ' 11233 ', ' 10453 ', ' 10456 ', ' 10469 ', ' 11374 ', ' 11221 ', ' 11421 ', ' 11215 ', ' 1 ' 0007 ', ' 10019 ', ' 11205 ', ' 11418 ', ' 11369 ', ' 11249 ', ' 10005 ', ' 10009 ', ' 11211 ', ' 11412 ', ' 10458 ', ' 11229 ', ' 1 0065 ', ' 10030 ', ' 11222 ', ' 10024 ', ' 10013 ', ' 11420 ', ' 11365 ', ' 10012 ', ' 11214 ', ' 11212 ', ' 10022 ', ' 11232 ', ' 1 1040 ', ' 11226 ', ' 10281 ', ' 11102 ', ' 11208 ', ' 10001 ', ' 10472 ', ' 11414 ', ' 11223 ', ' 10040 ', ' 11220 ', ' 11373 ', ' 1 1203 ', ' 11691 ', ' 11356 ', ' 10017 ', ' 10452 ', ' 10280 ', ' 11217 ', ' 10031 ', ' 11201 ', ' 11358 ', ' 10128 ', ' 11423 ', ' 10039 ', ' 10010 ', ' 11209 ', ' 10021 ', ' 10037 ', ' 11413 ', ' 11375 ', ' 11238 ', ' 10473 ', ' 11103 ', ' 11354 ', ' 11361 ', ' 11106 ', ' 11385 ', ' 10463 ', ' 10467 ', ' 11204 ', ' 11237 ', ' 11377 ', ' 11364 ', ' 11434 ', ' 11435 ', ' 11210 ', ' 11228 ', ' 11368 ', ' 11694 ', ' 10464 ', ' 11415 ', ' 10314 ', ' 10301 ', ' 10018 ', ' 10038 ', ' 11105 ', ' 11230 ', ' 10468 ', ' 11104 ', ' 10471 ', ' 11416 ', ' 10075 ', ' 11422 ', ' 11355 ', ' 10028 ', ' 10462 ', ' 10306 ', ' 10461 ', ' 11224 ', ' 11429 ', ' 10035 ', ' 11366 ', ' 11362 ', ' 11206 ', ' 10460 ', ' 10304 ', ' 11360 ', ' 11411 ', ' 10455 ', ' 10475 ', ' 10069 ', ' 10303 ', ' 10308 ', ' 10302 ', ' 11357 ', ' 10470 ', ' 11367 ', ' 113 70 ', ' 10454 ', ' 10451 ', ' 11436 ', ' 11426 ', ' 10153 ', ' 11004 ', ' 11428 ', ' 11427 ', ' 11001 ', ' 11363 ', ' 10004 ', ' 104 74 ', ' 11430 ', ' 10000 ', ' 10307 ', ' 11239 ', ' 10119 ', ' 10006 ', ' 10048 ', ' 11697 ', ' 11692 ', ' 11693 ', ' 10573 ', ' 000
       83 ', ' 11559 ', ' 10020 ', ' 77056 ', ' 11776 ', ' 70711 ', ' 10282 ', ' 11109 ', ' 10044 ', ' 02061 ', ' 77092-2016 ', ' 14225 ', ' 55164-0737 ', ' 19711 ',' 07306 ', ' 000000 ', ' 90010 ', ' 11747 ', ' 23541 ', ' 11788 ', ' 07604 ', ' 10112 ', ' 11563 ', ' 11580 ', ' 07087 ', ' 11042 ', ' 07093 ', ' 11501 ', ' 92123 ', ' 00000 ', ' 11575 ', ' 07109 ', ' 11797 ', ' 10803 ', ' 11716 ', ' 11722 ', ' 11549-3650 ', ' 10162 ', ' 23502 ', ' 11518 ', ' 07020 ', ' 08807 ', ' 11577 ', ' 07114 ', ' 11003 ', ' 07201 ', ' 61702 ', ' 10103 ', ' 29616-0759 ', ' 3520 9-3114 ', ' 11520 ', ' 11735 '

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.