Example of using Python to read external data files

Source: Internet
Author: User
Tags in python
Whether it is data analysis, data visualization, or data mining, everything is based on data. Using Python for data analysis, the most important step is how to import data to Python before data analysis, data visualization, and data mining can be implemented.

 

In this article, we will give a detailed introduction to how to obtain external data from Python. We will learn the following four aspects of data acquisition:

 

1. Read data from text files, such as txt files and csv files.

 

2. Read workbooks, such as Excel files.

 

3. Read the data files generated by the statistical software, such as the SAS and SPSS data sets.

 

4. Read database data, such as MySQL data and SQL Server Data

 

 

 

I. Reading text file data

 

As we all know, the pandas module in Python is a powerful tool dedicated for data analysis. In "pandas learning for Python data analysis (I)" and "pandas learning for Python data analysis (II) this section describes the pandas module application in detail. Next we will introduce how pandas reads external data.

 

 

 

1. Read txt data

 

In [1]: import pandas as pd

 

In [2]: mydata_txt = pd. read_csv ('C: \ test_code.txt ', sep =' \ t', encoding = 'utf-8 ')

 

 

For Chinese text files, reading fails due to encoding problems, as shown in the figure above. How can we deal with this encoding problem? There are two solutions:

 

1) when the txt or csv data in the original file is not in uft8 format, you need to save it as UTF-8 encoded;

 

2) if the original data file is in uft8 format, set the encoding parameter of the read_csv function to UTF-8 for normal reading.

 

 

 

Save the original data as utf8 data and re-read the txt data.

 

In [3]: mydata_txt = pd. read_csv ('C: \ test.txt ', sep =' \ t', encoding = 'utf-8 ')

 

In [4]: mydata_txt

 

 

The txt text file data enters the Python pocket.

 

 

 

2. Read csv data

 

Csv text files are a very common data storage format, and their storage capacity is much larger than that of Excel workbooks. Let's take a look at how to use Python to read csv data files:

 

In [5]: mydata_csv = pd. read_csv ('C: \ test.csv ', sep =', ', encoding = 'utf-8 ')

 

In [6]: mydata_csv

 

 

If you are good at summing up, you will find that both the txt file and the csv file can be read through the read_csv function in the pandas module. This function has more than 20 parameters, similar to the read. table function in R. To view the specific parameter details, you can view the help documentation: help (pandas. read_csv ).

 

 

 

II. Reading workbooks

 

The workbook mentioned here is an Excel table, which can be an xls workbook or an xlsx Workbook. In daily work, a lot of data is stored in an Excel Worksheet. If we need to use Python to analyze or process it, the first step is to read the Excel data. Let's take a look at how to read an Excel dataset:

 

In [7]: mydata_excel = pd. read_excel ('C: \ test.xlsx', sep = '\ t', encoding = 'utf-8 ')

 

In [8]: mydata_excel

 

 

3. Read data files generated by statistical software

 

When you integrate data sources, you may be troubled by the fact that your computer stores datasets that come with many statistical software, such as the R language dataset, SAS dataset, and SPSS dataset. The problem arises. If you have installed these software on your computer, you can naturally see these datasets and easily convert them into text files or workbooks, if you have not installed a large statistical analysis software such as SAS or SPSS on your computer, how do you view these datasets? Rest assured that Python is omnipotent and can read data sets of many statistical software. The following describes several methods for reading data sets from Python:

 

 

 

1. Read the SAS dataset

 

You can use the read_sas function in the pandas module to read the SAS dataset. Try this function to read the SAS dataset. The following figure shows a dataset opened with SAS. If SAS is not installed on your computer, you can use Python to read the data.

 

 

In [1]: import pandas as pd

 

In [2]: mydata_sas = pd. read_sas ('G: \ class. sas7bdat ', encoding = 'utf8 ')

 

 

2. Read the SPSS dataset

 

Reading the SPSS data is a little more complicated. I tested it many times and checked a lot of information. I did not have to worry about it. I finally got it done. To read the SPSS data file, you must install the savReaderWriter module for your Python. You can download and install the module at: https://pypi.python.org/pypi/savreaderwriter/3.4.2.

 

 

 

Install the savReaderWriter module

 

You can run this command to install the savReaderWriter module: python setup. py install

 

The following figure shows how SPSS data is opened in SPSS:

 

 

In [1]: import savReaderWriter

 

In [2]: mydata_spss = savReaderWriter. SavReader ('employee _ data. Sav ')

 

In [3]: mydata_spss

 

 

3. What should I do?

 

The extract function writes a dataset into a csv format, so that Python can easily read the csv dataset. Everything can be done with flexibility ~

 

 

 

4. Read database data

 

More data is stored in databases such as MySQL, SQL Server, and DB2. In order to connect Python to the database, scientists have specially designed the Python DB API interface. We still use examples to illustrate how Python connects to and operates databases.

 

 

 

1. Connect Python to MySQL

 

The MySQLdb module is an intermediate bridge connecting Python and MySQL. Currently, it can only run in Python2.x, but it does not mean that Python3 cannot connect to the MySQL database. Here we will introduce a very flexible and powerful module, namely the pymysql module. I like it because it can be disguised as the MySQLdb module. For more information, see the following example:

 

In [1]: import pymysql

 

In [2]: pymysql. install_as_MySQLdb () # disguise as the MySQLdb module

 

In [3]: import MySQLdb

 

 

 

Connect Python and MySQL using the Connection function

 

In [4]: conn = MySQLdb. Connection (

 

...: Host = 'localhost ',

 

...: User = 'root ',

 

...: Password = 'Snake ',

 

...: Port = 3306,

 

...: Database = 'test ',

 

...: Charset = 'gbk ')

 

Use the conn cursor method (cursor) to pave the way for subsequent database operations.

 

In [5]: cursor = conn. cursor ()

 

In [6]: SQL = 'select * from memberinfo'

 

Execute SQL statements

 

In [7]: cursor.exe cute (SQL)

 

Out [7]: 4

 

In [8]: data = cursor. fetchall ()

 

In [9]: data

 

 

We found that data stores data sets in the format of tuples. In pandas learning (1) of Python data analysis, constructing a DataFrame data structure can only be built through arrays, data boxes, dictionaries, lists, and other methods. But here is the data in the format of tuples. How should we deal with it? You can use the list function to quickly convert the data in the list format.

 

In [10]: data = list (data)

 

In [11]: data

 

 

The DataFrame function in the pandas module is used to convert the data list above to the data box format of Python:

 

In [14]: import pandas as pd

 

In [15]: mydata = pd. DataFrame (data, columns = ['id', 'name', 'age', 'gender'])

 

In [16]: mydata

 


 

In the end, you must close the cursor and connection after reading the data. If you do not close the data, computer resources will be wasted.

 

In [19]: cursor. close ()

 

In [20]: conn. close ()

 

 

 

2. Connect Python to SQL Server

 

We recommend that you use the pymssql module to connect to the SQL Server database using Python. The syntax of this module is the same as that of pymysql. Here we will not explain the meaning of each step, directly run the code:

 

In [21]: import pymssql

 

In [22]: connect = pymssql. connect (

 

...: Host = '192. 18.1.6 \ SqlR2 ',

 

...: User = 'sa ',

 

...: Password = '1q2w3e4r !! ',

 

...: Database = 'Heinz _ Ana ',

 

...: Charset = 'utf8 ')

 

 

 

In [23]: cursor = connect. cursor ()

 

In [24]: SQL = 'select * from HeinzDB2_10'

 

In [25]: cursor.exe cute (SQL)

 

In [26]: data = cursor. fetchall ()

 

 

 

In [27]: data [0]

 

Out [27]: (67782, '1970-05-01 ', 'stage 2', 2013, 1.0)

 

 

 

In [28]: mydata = pd. DataFrame (list (data), columns = ['consumerid ',

 

...: 'Purdate ',

 

...: 'Phase ',

 

...: 'Changetinratio ',

 

...: 'Totalamt '])

 

 

 

In [29]: mydata. head ()

 

 

This article describes how to use Python to read external data, the following cleaning, processing, analysis, and even mining sections will be performed smoothly. The content of this issue came out a little late, mainly because the work was busy. I will continue to work harder later. Thank you for your support and interaction. In the next section, we will introduce how the caret package implements feature selection in the R language.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.