Original: Chapter 7
# usual opening
%matplotlib inline
import pandas as PD
import matplotlib.pyplot as Plt
import NumPy as NP
# make diagram Table bigger and prettier
pd.set_option (' Display.mpl_style ', ' Default ')
plt.rcparams[' figure.figsize '] = (5)
plt.rcparams[' font.family ' = ' sans-serif '
# need to show a lot of columns in Pandas 0.12
# in Pandas 0.13 you don't need
pd.set_option (' Display.width ', 5000)
pd.set_option (' Display.max_columns ', 60)
One of the main problems with messy data is: How do you know if it's messy?
Here we will use the NYC 311 service to request the dataset because it is large and somewhat inconvenient.
Requests = Pd.read_csv ('.. /data/311-service-requests.csv ')
7.1 How do I know if it's messy?
Let's look at a few columns here. I know there are some problems with the ZIP code, so let's look at it first.
To see if there is a problem with the column, I usually use. Unique () to view all the values. If it's a column of numbers, I'll draw a histogram to get the feel of the distribution.
When we look at the unique values in the incident zip, it will soon be clear that this is a mess.
Some problems: Some have been parsed into strings, some are floating-point presence Nan part zip code 29616-0759 or 83 some pandas unrecognized n/A values, such as ' N/a ' and ' no clue '
What we can do: normalize N/A and no clue to Nan values see what happened at 83 and decide what to do to convert everything into a string
requests[' Incident Zip '].unique ()
Array ([11432.0, 11378.0, 10032.0, 10023.0, 10027.0, 11372.0, 11419.0, 11417.0, 10011.0, 11225.0, 11218.0, 10003.0, 10029.0, 10466.0, 11219.0, 10025.0, 10310.0, 11236.0, Nan, 10033.0, 11216.0, 10016.0, 10305.0, 10312.0, 1002 6.0, 10309.0, 10036.0, 11433.0, 11235.0, 11213.0, 11379.0, 11101.0, 10014.0, 11231.0, 11234.0, 10457.0, 1045 9.0, 10465.0, 11207.0, 10002.0, 10034.0, 11233.0, 10453.0, 10456.0, 10469.0, 11374.0, 11221.0, 11421.0, 11215.0, 10 007.0, 10019.0, 11205.0, 11418.0, 11369.0, 11249.0, 10005.0, 10009.0, 11211.0, 11412.0, 10458.0, 11229.0, 10 065.0, 10030.0, 11222.0, 10024.0, 10013.0, 11420.0, 11365.0, 10012.0, 11214.0, 11212.0, 10022.0, 11232.0, 11 040.0, 11226.0, 10281.0, 11102.0, 11208.0, 10001.0, 10472.0, 11414.0, 11223.0, 10040.0, 11220.0, 11373.0, 11 203.0, 11691.0, 11356.0, 10017.0, 10452.0, 10280.0, 11217.0, 10031.0, 11201.0, 11358.0, 10128.0, 11423.0, 10039.0, 10010.0, 11209.0, 10021.0, 10037.0, 11413.0, 11375.0, 11238.0, 10473.0, 11103.0, 11354.0, 11361.0, 11106.0, 11385.0, 10463.0, 10467. 0, 11204.0, 11237.0, 11377.0, 11364.0, 11434.0, 11435.0, 11210.0, 11228.0, 11368.0, 11694.0, 10464.0, 11415. 0, 10314.0, 10301.0, 10018.0, 10038.0, 11105.0, 11230.0, 10468.0, 11104.0, 10471.0, 11416.0, 10075.0, 11422. 0, 11355.0, 10028.0, 10462.0, 10306.0, 10461.0, 11224.0, 11429.0, 10035.0, 11366.0, 11362.0, 11206.0, 10460. 0, 10304.0, 11360.0, 11411.0, 10455.0, 10475.0, 10069.0, 10303.0, 10308.0, 10302.0, 11357.0, 10470.0, 11367.0, 1137 0.0, 10454.0, 10451.0, 11436.0, 11426.0, 10153.0, 11004.0, 11428.0, 11427.0, 11001.0, 11363.0, 10004.0, 1047 4.0, 11430.0, 10000.0, 10307.0, 11239.0, 10119.0, 10006.0, 10048.0, 11697.0, 11692.0, 11693.0, 10573.0, 83.0
, 11559.0, 10020.0, 77056.0, 11776.0, 70711.0, 10282.0, 11109.0, 10044.0, ' 10452 ', ' 11233 ', ' 10468 ', ' 10310 ', ' 11105 ', ' 10462 ', ' 10029 ', ' 10301 ', ' 10457 ', ' 10467 ', ' 10469 ', ' 11225 ', ' 10035 ', ' 10031 ', ' 11226 ', ' 10454 ', ' 11221 ', ' 10025 ', ' 112 29 ', ' 11235 ', ' 11422 ', ' 10472 ', ' 11208 ', ' 11102 ', ' 10032 ', ' 11216 ', ' 10473 ', ' 10463 ', ' 11213 ', ' 10040 ', ' 10302 ', ' 1 ' 1231 ', ' 10470 ', ' 11204 ', ' 11104 ', ' 11212 ', ' 10466 ', ' 11416 ', ' 11214 ', ' 10009 ', ' 11692 ', ' 11385 ', ' 11423 ', ' 1 1201 ', ' 10024 ', ' 11435 ', ' 10312 ', ' 10030 ', ' 11106 ', ' 10033 ', ' 10303 ', ' 11215 ', ' 11222 ', ' 11354 ', ' 10016 ', ' 1 0034 ', ' 11420 ', ' 10304 ', ' 10019 ', ' 11237 ', ' 11249 ', ' 11230 ', ' 11372 ', ' 11207 ', ' 11378 ', ' 11419 ', ' 11361 ', ' 1 0011 ', ' 11357 ', ' 10012 ', ' 11358 ', ' 10003 ', ' 10002 ', ' 11374 ', ' 10007 ', ' 11234 ', ' 10065 ', ' 11369 ', ' 11434 ', ' 11205 ', ' 11206 ', ' 11415 ', ' 11236 ', ' 11218 ', ' 11413 ', ' 10458 ', ' 11101 ', ' 10306 ', ' 11355 ', ' 10023 ', ' 11368 ', ' 10314 ', ' 11421 ', ' 10010 ', ' 10018 ', ' 11223 ', ' 10455 ', ' 11377 ', ' 11433 ', ' 11375 ', ' 10037 ', ' 11209 ', ' 10459 ', ' 10128 ', ' 10014 ', ' 10282 ', ' 11373', ' 10451 ', ' 11238 ', ' 11211 ', ' 10038 ', ' 11694 ', ' 11203 ', ' 11691 ', ' 11232 ', ' 10305 ', ' 10021 ', ' 11228 ', ' 10036 ', ' 10001 ', ' 10017 ', ' 11217 ', ' 11219 ', ' 10308 ', ' 10465 ', ' 11379 ', ' 11414 ', ' 10460 ', ' 11417 ', ' 11220 ', ' 11366 ', ' 10027 ', ' 11370 ', ' 10309 ', ' 11412 ', ' 11356 ', ' 10456 ', ' 11432 ', ' 10022 ', ' 10013 ', ' 11367 ', ' 11040 ', ' 10026 ', ' 104 75 ', ' 11210 ', ' 11364 ', ' 11426 ', ' 10471 ', ' 10119 ', ' 11224 ', ' 11418 ', ' 11429 ', ' 11365 ', ' 10461 ', ' 11239 ', ' 100 39 ', ' 00083 ', ' 11411 ', ' 10075 ', ' 11004 ', ' 11360 ', ' 10453 ', ' 10028 ', ' 11430 ', ' 10307 ', ' 11103 ', ' 10004 ', ' 100 69 ', ' 10005 ', ' 10474 ', ' 11428 ', ' 11436 ', ' 10020 ', ' 11001 ', ' 11362 ', ' 11693 ', ' 10464 ', ' 11427 ', ' 10044 ', ' 113 63 ', ' 10006 ', ' 10000 ', ' 02061 ', ' 77092-2016 ', ' 10280 ', ' 11109 ', ' 14225 ', ' 55164-0737 ', ' 19711 ', ' 07306 ', ' 000000 ', ' NO clue ', ' 90010 ', ' 10281 ', ' 11747 ', ' 23541 ', ' 11776 ', ' 11697 ', ' 11788 ', ' 07604 ', 10112.0, 11788.0, 11563.0 , 11580.0, 7087.0, 11042.0, 7093.0, 11501.0, 92123.0, 0.0, 11575.0, 7109.0, 11797.0, ' 10803 ', ' 11716 ', ' 11722 ', ' 11549-3650 ', ' 10162 ' ', ' 92123 ', ' 23502 ', ' 11518 ', ' 07020 ', ' 08807 ', ' 11577 ', ' 07114 ', ' 11003 ', ' 07201 ', ' 11563 ', ' 61702 ', ' 10103
', ' 29616-0759 ', ' 35209-3114 ', ' 11520 ', ' 11735 ', ' 10129 ', ' 11005 ', ' 41042 ', ' 11590 ', 6901.0, 7208.0, 11530.0, 13221.0, 10954.0, 11735.0, 10103.0, 7114.0, 11111.0, 10107.0], Dtype=object)
7.3 Fixing Nan Value and string/floating-point confusion
We can then pass the na_values options to pd.read_csv to clean them up. We can also specify that the type of the incident zip is a string rather than a floating-point.
Na_values = [' NO clue ', ' N/A ', ' 0 ']
requests = Pd.read_csv ('. /data/311-service-requests.csv ', na_values=na_values, dtype={' Incident Zip ': str})
requests[' Incident Zip '].unique ()
Array ([' 11432 ', ' 11378 ', ' 10032 ', ' 10023 ', ' 10027 ', ' 11372 ', ' 11419 ', ' 11417 ', ' 10011 ', ' 11225 ', ' 11218 ', ' 10003 ') ' 10029 ', ' 10466 ', ' 11219 ', ' 10025 ', ' 10310 ', ' 11236 ', Nan, ' 10033 ', ' 11216 ', ' 10016 ', ' 10305 ', ' 10312 ', ' 100 26 ', ' 10309 ', ' 10036 ', ' 11433 ', ' 11235 ', ' 11213 ', ' 11379 ', ' 11101 ', ' 10014 ', ' 11231 ', ' 11234 ', ' 10457 ', ' 104 59 ', ' 10465 ', ' 11207 ', ' 10002 ', ' 10034 ', ' 11233 ', ' 10453 ', ' 10456 ', ' 10469 ', ' 11374 ', ' 11221 ', ' 11421 ', ' 11215 ', ' 1 ' 0007 ', ' 10019 ', ' 11205 ', ' 11418 ', ' 11369 ', ' 11249 ', ' 10005 ', ' 10009 ', ' 11211 ', ' 11412 ', ' 10458 ', ' 11229 ', ' 1 0065 ', ' 10030 ', ' 11222 ', ' 10024 ', ' 10013 ', ' 11420 ', ' 11365 ', ' 10012 ', ' 11214 ', ' 11212 ', ' 10022 ', ' 11232 ', ' 1 1040 ', ' 11226 ', ' 10281 ', ' 11102 ', ' 11208 ', ' 10001 ', ' 10472 ', ' 11414 ', ' 11223 ', ' 10040 ', ' 11220 ', ' 11373 ', ' 1 1203 ', ' 11691 ', ' 11356 ', ' 10017 ', ' 10452 ', ' 10280 ', ' 11217 ', ' 10031 ', ' 11201 ', ' 11358 ', ' 10128 ', ' 11423 ', ' 10039 ', ' 10010 ', ' 11209 ', ' 10021 ', ' 10037 ', ' 11413 ', ' 11375 ', ' 11238 ', ' 10473 ', ' 11103 ', ' 11354 ', ' 11361 ', ' 11106 ', ' 11385 ', ' 10463 ', ' 10467 ', ' 11204 ', ' 11237 ', ' 11377 ', ' 11364 ', ' 11434 ', ' 11435 ', ' 11210 ', ' 11228 ', ' 11368 ', ' 11694 ', ' 10464 ', ' 11415 ', ' 10314 ', ' 10301 ', ' 10018 ', ' 10038 ', ' 11105 ', ' 11230 ', ' 10468 ', ' 11104 ', ' 10471 ', ' 11416 ', ' 10075 ', ' 11422 ', ' 11355 ', ' 10028 ', ' 10462 ', ' 10306 ', ' 10461 ', ' 11224 ', ' 11429 ', ' 10035 ', ' 11366 ', ' 11362 ', ' 11206 ', ' 10460 ', ' 10304 ', ' 11360 ', ' 11411 ', ' 10455 ', ' 10475 ', ' 10069 ', ' 10303 ', ' 10308 ', ' 10302 ', ' 11357 ', ' 10470 ', ' 11367 ', ' 113 70 ', ' 10454 ', ' 10451 ', ' 11436 ', ' 11426 ', ' 10153 ', ' 11004 ', ' 11428 ', ' 11427 ', ' 11001 ', ' 11363 ', ' 10004 ', ' 104 74 ', ' 11430 ', ' 10000 ', ' 10307 ', ' 11239 ', ' 10119 ', ' 10006 ', ' 10048 ', ' 11697 ', ' 11692 ', ' 11693 ', ' 10573 ', ' 000
83 ', ' 11559 ', ' 10020 ', ' 77056 ', ' 11776 ', ' 70711 ', ' 10282 ', ' 11109 ', ' 10044 ', ' 02061 ', ' 77092-2016 ', ' 14225 ', ' 55164-0737 ', ' 19711 ',' 07306 ', ' 000000 ', ' 90010 ', ' 11747 ', ' 23541 ', ' 11788 ', ' 07604 ', ' 10112 ', ' 11563 ', ' 11580 ', ' 07087 ', ' 11042 ', ' 07093 ', ' 11501 ', ' 92123 ', ' 00000 ', ' 11575 ', ' 07109 ', ' 11797 ', ' 10803 ', ' 11716 ', ' 11722 ', ' 11549-3650 ', ' 10162 ', ' 23502 ', ' 11518 ', ' 07020 ', ' 08807 ', ' 11577 ', ' 07114 ', ' 11003 ', ' 07201 ', ' 61702 ', ' 10103 ', ' 29616-0759 ', ' 3520 9-3114 ', ' 11520 ', ' 11735 '