Original: Chapter 8
Import Pandas as PD
8.1 parsing Unix timestamp
It's not easy to deal with Unix timestamps in pandas-it took me a long time to solve the problem. The file we use here is a package popularity file that I found on my system/var/log/popularity-contest.
Here's an explanation of what this file is.
# Read it, and remove the last row
Popcon = Pd.read_csv (' ... /data/popularity-contest ', sep= ',) [: -1]
popcon.columns = [' Atime ', ' ctime ', ' package-name ', ' mru-program ', ' tag ' ]
The column is the access time, the time of creation, the program that the package name recently used, and the label.
Popcon[:5]
atime |
ctime |
package-name |
mru-program |
tag |
0 |
1387295797 |
1367633260 |
perl -base |
/usr/bin/perl |
1 |
1387295796 |
1354370480 |
Login |
/bin/su |
2 |
1387295743 |
1354341275 |
libtalloc2 |
/usr/lib/x86_64-linux- gnu/libtalloc.so.2.0.7 |
3 |
1387295743 |
1387224204 |
LIBWBC Lient0 |
/usr/lib/x86_64-linux-gnu/libwbclient.so.0 |
4 |
1387295742 |
1354341253 |
libselinux1 |
/lib/x86_64-linux-gnu/libselinux.so.1 |
The magical part of the timestamp parsing in pandas is that numpy datetime has been stored as a Unix timestamp. So what we need to do is tell pandas these integers are actually data times-it doesn't need to do any conversion.
We need to first convert these to integers:
popcon[' atime '] = popcon[' atime '].astype (int)
popcon[' ctime '] = popcon[' CTime '].astype (int)
Each numpy array and pandas sequence have a dtype-this is usually int64,float64 or object. Some of the available time types are datetime64[s],datetime64[ms] and Datetime64[us]. Similarly, there are timedelta types.
We can use the Pd.to_datetime function to convert our integer timestamp to DateTimes. This is a constant time operation-we don't actually change any data, just change how pandas sees it.
popcon[' atime '] = pd.to_datetime (popcon[' atime '], unit= ' s ')
popcon[' ctime '] = pd.to_datetime (popcon[' CTime '), Unit= ' s ')
If we look at Dtype now, it's <m8[ns], we can tell that M8 is a shorthand for datetime64.
popcon[' Atime '].dtype
Dtype (' <m8[ns] ')
So now we consider atime and CTime as time.
popcon[:5]
atime |
ctime |
package-name |
mru-program |
tag |
0 |
2013-12-17 15:56:37 |
2013-05-04 02:07:40 |
perl-base |
/usr/bin/perl |
1 |
2013-12-17 15:56:36 |
2012-12-01 14:01:20 |
Login |
/bin/su |
2 |
2013-12-17 15:55:43 |
2012-12-01 05:54:35 |
libtalloc2 |
/usr /lib/x86_64-linux-gnu/libtalloc.so.2.0.7 |
3 |
2013-12-17 15:55:43 |
201 3-12-16 20:03:24 |
libwbclient0 |
/usr/lib/x86_64-linux-gnu/libwbclient.so.0 |
4 |
2013-12-17 15:55:42 |
2012-12-01 05:54:13 |
libselinux1 |
/lib/x86_ 64-linux-gnu/libselinux.so.1 |
Now let's say we want to see all the packages that are not libraries.
First of all, I want to remove everything with a time stamp of 0. Note that we can use a string in this comparison, even if it is actually a timestamp inside. This is because the pandas is very powerful.
Popcon = popcon[popcon[' atime '] > ' 1970-01-01 ']
Now we can use the Magic String feature of pandas to see the line with the package name that does not contain LIB.
nonlibraries = popcon[~popcon[' package-name '].str.contains (' Lib ')]
Nonlibraries.sort (' CTime ', ascending=false) [: 10]
Atime |
CTime |
Package-name |
Mru-program |
Tag |
57 |
2013-12-17 04:55:39 |
2013-12-17 04:55:42 |
Ddd |
/usr/bin/ddd |
450 |
2013-12-16 20:03:20 |
2013-12-16 20:05:13 |
Nodejs |
/usr/bin/npm |
454 |
2013-12-16 20:03:20 |
2013-12-16 20:05:04 |
Switchboard-plug-keyboard |
/usr/lib/plugs/pantheon/keyboard/options.txt |
445 |
2013-12-16 20:03:20 |
2013-12-16 20:05:04 |
Thunderbird-locale-en |
/usr/lib/thunderbird-addons/extensions/langpac ... |
396 |
2013-12-16 20:08:27 |
2013-12-16 20:05:03 |
Software-center |
/usr/sbin/update-software-center |
449 |
2013-12-16 20:03:20 |
2013-12-16 20:05:00 |
Samba-common-bin |
/usr/bin/net.samba3 |
397 |
2013-12-16 20:08:25 |
2013-12-16 20:04:59 |
postgresql-client-9.1 |
/usr/lib/postgresql/9.1/bin/psql |
398 |
2013-12-16 20:08:23 |
2013-12-16 20:04:58 |
postgresql-9.1 |
/usr/lib/postgresql/9.1/bin/postmaster |
452 |
2013-12-16 20:03:20 |
2013-12-16 20:04:55 |
Php5-dev |
/usr/include/php5/main/snprintf.h |
440 |
2013-12-16 20:03:20 |
2013-12-16 20:04:54 |
Php-pear |
/usr/share/php/xml/util.php |
Well, it's cool, it says I've recently installed DDD. and PostgreSQL. I remember installing these things.
The whole message here is that if you have a timestamp in seconds or milliseconds or nanoseconds, you can "convert" to Datetime64 [the-right-thing], and Pandas/numpy will handle the rest.