Introduction: Application Background and spit Groove
Continue to renew the blog at least once a month, summarize the good habits of learning experience.
This time the theme is on the combination of R and Excel, also called How to correctly feed the Excel file to R processing
Divided into:
1. xlsx Package Installation and Precautions
2, using VBA to achieve the conversion of XLSX batch CSV
And, this object, for the same as me, those who have been programming from R, have been using Excel for data analysis ... Programming Daniel please Pat
The reason to study this, because recently work on a live, to the original Excel side of the report migrated to the R end, automatically output visual graphics, and create a PDF or PPT.
The job can be divided into four stages:
1) source data collation and Construction & demand analysis
2) According to the requirements, r language data processing and output processing after the Figure + chart
3) Use Markdown or other means to automatically copy the chart into the report.
4) Report The use of human editors to integrate data.
All the requirements in addition to data preparation is not automated, others are automated, can save the province. The combination of R itself and xlsx is not good.
and r reading the XLSX data is the first threshold I've encountered.
This live data is manually downloaded from the company's web-based database and stored in the XLSX (SQL Direct database access is difficult to open). Try to download the CSV format directly from the database side, but there are errors in the format, and the direct CSV format file size is too large (a single file from a few megabytes into a few 10 trillion), so ultimately decided to save in the xlsx format, and then another plan.
I believe that for those who migrated from Excel to r work, the same problem would be encountered:
First, the xlsx package
First try to solve with R package. That is, the xlsx package.
The xlsx package is prone to problems when loading. The basic reason is that the Java environment is not configured, or the environment variable reference fails. Therefore, to configure the Java environment First, load the Rjava package.
Baidu a bit, the online has a lot of solutions. I mainly refer to this post, the operation steps are:
1. Install the latest version of Java. If you are using 64-bit R, please download 64-bit java.
: http://www.java.com/en/download/manual.jsp
To install under C:\Program Files\java, Win8 is especially careful not to install as C:\Program Files (x86). Maybe r is not very good for a folder like x86 when reading the path, I was in x86 for the first time, and reading was a failure.
2, load the environment in R, that is, a line of code, the path to be based on your Java version to make changes.
R
Sys.setenv (java_home='c:\\program files\\java\\jre1.8.0_45\\'
Then load the Rjava package or the XLSX package to succeed.
After the XLSX package has been successfully loaded, the xlsx file can be read directly with READ.XLSX, and it can also specify the rows and segments to read, as well as the table and the number of tables, which can be saved as xlsx files, which is still a powerful package.
But there are two problems with this approach:
1, not all corporate computers are free to configure the Java environment. Many people have restricted access. And some in-house applications are configured in a Java environment. Even if you're looking for it to install Java, some internal applications may fail with a version number compatibility problem and get a little bigger.
2, with the XLSX packet reading data, when the amount of data is smaller than the speed is relatively fast. However, if the xlsx itself is larger, contains more data, read.xlsx efficiency will be very low, as data.table package fread read fast and save memory. But the Fread function does not support the read-in of xlsx ...
(see this post, with tens of thousands of rows of data, thefread only took about 10 seconds, saving at least a few times more than regular read.table or read.csv )
In conclusion, due to the complexity and compatibility of Java environment, as well as the limit of the read speed of the XLSX package itself, the method of reading the XLSX package with the XLSX package is more suitable for:
1, personal computer, how do you want to play it doesn't matter, or tall Linux, Mac environment
2, the data volume is not particularly large, and Excel file is very clean, requires detailed operation
Unfortunately, this method did not suit me, so I had to find another way.
second, using VBA to convert xlsx batch to CSV format
The above attempt has found that xlsx itself is the root cause of this complex problem. In contrast, R supports text formats such as CSV, and has fread the artifact to handle a certain amount of data, or to convert xlsx to CSV format.
Taking this as a way of thinking, after referring to two data, I successfully rewritten a section of VBA, you can select the desired xlsx, and then in its directory to create a new CSV folder, the xlsx batch conversion to CSV format.
The code is as follows:
1 Subgetcsv ()2 'This is online to see the xlsx batch conversion, while rewriting a xlsx batch conversion csv format3 '1) Batch Conversion CSV reference: http://club.excelhome.net/thread-1036776-2-1.html4 '2) Create folder reference: Http://jingyan.baidu.com/article/f54ae2fcdc79bc1e92b8491f.html5 'Setting the screen here does not move, warning ignore6Application.DisplayAlerts =False7Application.ScreenUpdating =False8 DimData asWorkbook9 'here, with GetOpenFileName pop up a multi-select window, select the xlsx file we want to convert to CSV,TenFile = Application.getopenfilename (multiselect:=True) One 'with LBound and UBound . A fori =LBound(file) to UBound(file) -Workbooks.Open filename:=file (i) - Setdata =ActiveWorkbook thePath =data. Path - 'This is set to be saved in the CSV folder below the directory, then you can adjust the - 'refer to the first method inside - on Error Resume Next +Vba. MkDir (Path &"\csv") - withData +. SaveAs Path &"\csv\"&Replace(Data. Name,". xlsx",". csv"), Xlcsv A. CloseTrue at End with - NextI - 'A popup dialog box indicates that the conversion is complete, then go to the CSV in the appropriate place to view it - MsgBox "has been converted"& (I-1) &"a document" -Application.ScreenUpdating =True -Application.DisplayAlerts =True in End Sub
The operation is simple:
Copy the code into the VBA Editor in Excel, and then run the GETCSV macro, jumping out of a window and asking you to select the xlsx file you want to convert. (Multiple options available)
Select later, wait a while, then go back to the xlsx file, there will be more than one CSV folder, which is the text file we are going to import R.
The benefits of this approach are:
1, easy to operate, directly rely on the VBA operation of Excel, do not configure the Java environment, then the cost of communication/change of computer cost is small
2, especially suitable for a certain amount of data, but the data format neat files, such as data from a data side read into. With fread You can also control the read line (skip=nnn), and code writing is neat and convenient. Even if there are some different data, you can use VBA in advance to operate, simple and convenient.
In conclusion, my last use is the second method.
However, the first method, if you have a Java environment or try to configure the better. Because some of the well-known packages have to rely on this Java environment. The most typical is the rwordseg bag (Chinese participle)
Additional information:
When I first approached the task, I really didn't think I was using VBA to solve the problem. There are also a lot of people who will use Python to solve similar problems, but now I can not spare time to learn python.
But the black Cat white cat catches the mouse is the good cat ... After learning R, I was forced to learn to change the VBA code, the heart rubber cork stoppling
For VBA, there are some references such as:
1) VBA action on the folder:
Excel VBA-traverse files, folders, and batch build txt in a folder
This it introduces many applications that use VBA to create folders, rename, and so on. Although the individual feel that using R to do dir () or list.files more intuitive, but can refer to.
How do I create a folder in Excel with VBA?
Two ways to create a folder, I directly used the first type ...
2) VBA Merge worksheet, generate catalog:
tutorials introduced by Excelhome
#一宏微教程 # How do I do with the worksheets in the No. 008 to No. 010 issue--excel file?
Also at the folder level, using R for file operations, you can view the following tutorials:
Using R for File system management
-------------------------adhere to each month to write a study summary, mainly related to R, Welcome to the spit Groove communication. My blog: Http://www.cnblogs.com/weibaar
R Language Learning notes: How to correctly feed Excel files to R processing