American group R language data Operation combat

Source: Internet
Author: User
Tags plotly


First, Introduction


In recent years, with the continuous innovation of distributed data processing technology, such as Hive, Spark, Kylin, Impala, Presto and other tools continue to innovate, data Warehouse/Business Analysis Department has become the standard of various enterprises and institutions. In this context, the ability to explore and excavate data values and to have refined data operations becomes the key to determining the success of a data team.



In the process of data from the backstage to the foreground, the data display is the last step of the key link. Comparing data to charts and organizing appropriate content is often a quicker and more intuitive delivery of information than cold tabular presentations, providing better decision support. From structured data to final presentation, a series of exploration and analysis processes are needed to complete the precipitation of product ideas, and this process is accompanied by a large number of data processing two times.



These occasions the R language has a unique advantage. This article will be based on the company to store food and Beverage technology Department of Fine Data Operation practice, introduce R in data analysis and visualization of engineering capabilities, hope to be able to offer, but also welcome industry counterparts to provide us with more suggestions.


II. Data Operation product Classification and advantages of R 2.1 Data operation product Categories


In the process of enterprise data operations, considering usage scenarios, product features, implementation roles, and available tools, you can roughly divide your data operational requirements into four categories, as shown in the following table:



Table one data operational requirements classification
Product Application Scenarios Product Features Implementation Role Tools
Analysis Report Explore, organize and interpret data that is not fixed in the pattern, form a one-time data analysis report and provide decision support Based on people's interpretation of data; demand divergence Data analyst, data engineer Excel, SQL, R, Tableau, and more
Report-based Products Development through drag-and-drop or simple code, data assembly and report presentation with fixed patterns The development efficiency is high, the development threshold is low, the report expression ability is poor Data Analyst Reporting tools
Custom-made analytical products Create repeatable data analysis products and provide decision support for fixed-mode data and analysis methods The development efficiency is high, supports the deep application to the data, the development process can be reused, extensible, the development threshold of the developer with certain programming ability is lower; product interactivity is weak Data analyst, data engineer Python, R, Tableau, and more
Tailor-made display products Highly customizable products for fixed-mode data to meet personalized data presentation needs through enhanced interaction and user experience Rich presentation style, strong interaction ability, only suitable for developers with front-end capability, low development efficiency, poor data two-time processing ability Front End Engineer Echarts, Highcharts, etc.
2.2 R's advantages in data operation


As described in the previous section, highly customized data processing, visualization, analysis, and so on, are often required in the refinement of the process, and Excel, Tableau, and enterprise reporting tools are not exhaustive, but are precisely the strengths of R. In general, R has the following characteristics, which are the "Swiss Army knife in the field of data analysis":


    • Free, open source, extensible: Up to 2018-08-02, "The CRAN Package repository features 12858 available packages. "The package on CRAN involves all aspects of Bayesian analysis, operations research, finance, genetic analysis, genetics, and continues to be new and iterative.
    • Programmable: R itself is an interpreted language that can be used to control execution through code, and can be invoked with Python and the Java language through software packages such as Rpython and Rjava.
    • Powerful data-handling capabilities:
      • Data source access: Through Rmysql, Sparkr, elastic and other software packages, can be achieved from MySQL, Spark, Elasticsearch and other external data engine to obtain data.
      • Data processing: Built-in vector, list, matrix, data.frame, etc., and can through Sqldf, Tidyr, Dplyr, reshape2 and other software packages to achieve two of data processing.
      • Data visualization: Ggplot2, plotly, Dygraph and other visual packages enable highly customized chart rendering.
      • Data analysis and mining: R itself is a statistical analysis of the language by the statisticians, through self-programming or third-party package calls, can easily achieve linear regression, variance analysis, principal component analysis and other analysis and mining functions.
    • The initial prototype service framework:
      • WEB Programming Framework: For example, students who are not proficient in front-end and system development, develop their own data applications through the shiny package.
      • Service capability: For example, through the Rserve package, you can implement the C/S architecture service for R and other language communication.


Python and R are good choices for data-centric applications, and both languages are used for reference in the development process. "The closer we get to statistical research and data analysis, the more we tend to R; the more people who approach engineering environments, the more they tend to python", Python is an all-rounder "athlete", and R is more like a "swordsman" in the field of statistical analysis, "Python has not built a huge code base comparable to CRAN." , R has an absolute leading edge in this area. Statistics is not the core mission of Python. " There are a lot of "Python VS R" discussions on each technology website, and interested readers can learn and make their own choices.


Third, R data processing, visualization, repeatability of the ability to analyze


For the programmer with the ability of programming, or the analytical ability of developers, in a series of long-term data analysis engineering, the use of R can not only meet the "one development, lifetime benefit", but also to meet the "flexible, graphic rich" requirements. The data processing capabilities, visualization capabilities, and repeatable data analysis capabilities of R are described below.


3.1 Data processing


In an enterprise-class data system, data cleansing, computing, and integration are done through data warehousing, Hive, Spark, Kylin and other tools. For data operations projects, although R operates on a result dataset, it does not avoid the need to perform two data processing at the query layer.



In the data query layer, the R ecosystem is readily available with numerous component support, such as querying the MySQL library table through the Rmysql package, and using the Elastic package to search the Elasticsearch indexed document. For new technologies such as Kylin, when the R ecosystem component support is not followed up, the query interface can be encapsulated using Python, Java and other system languages, and the third-party query interface calls are made within R using Rpython, Rjava components. The data obtained through the query component is generally present in data.frame, list, and other types of objects.



In addition, R itself has a relatively complete two-time data processing capacity. For example, the Data.frame object can be processed by sqldf using SQL, you can use Reshape2 for wide format and narrow format conversion, you can use Stringr to complete various string processing, other functions such as sorting, grouping processing, missing value filling, etc. Also have the perfect language itself and the support of the ecology.


3.2 Visualization of data


Data visualization is a key element in the process of data exploration and results presentation, while "R is a free software environment for statistical computing and graphics.", the Drawing (visualization) system is also one of the greatest benefits of R 。



Currently, there are three sets of visualization systems supported by R mainstream:


    • Built-in system: includes base, grid, and lattice three built-in release packages to support graphical drawing in a relatively simple way.
    • Ggplot2: Developed by RStudio's chief scientist Hadley Wickham, Ggplot2 is supported by a set of graphic grammars that support highly customizable visualizations in a combination of diagram stacks. This concept has also gradually affected the data visualization solutions including plotly, Ali Antv and so on. Up to 2018-08-02,cran has landed 40 Ggplot2 expansion packs, reference links.
    • Htmlwidgets for R: This system began to evolve and grow in 2016 with the support of the RStudio, providing a JavaScript-based visualization of the R interface. Htmlwidgets for R as a bridge for front-end visualization (for front-end engineers) and data analysis visualization (for data engineer), plays a combined advantage between two sets of technology areas. As of 2018-08-02, after more than two years of development, there are already 101 third-party packages based on Htmlwidgets developed by CRAN, reference links.


In the process of actual data operation analysis, we can solidify the routine diagram presentation and visual analysis process, realize code reuse and improve development efficiency. It's a sample of some of the visualization components that the data team at the group's food and Beverage Technology Department has accumulated:




Figure A Visualization Component example


Based on the Visual Component library, a visualization process can be completed with a single line of code, greatly improving the development efficiency. The code for the last four Quadrant Matrix Analysis sample diagram is as follows:


vis_4quadrant(iris, ‘Sepal.Length‘, ‘Petal.Length‘, label = ‘Species‘, tooltip = ‘tooltip‘, title = ‘‘, xtitle = ‘萼片长度‘, ytitle = ‘花瓣长度‘, pointSize = 1, annotationSize = 1)


The function declaration of the visualization component is attached to the four Quadrant matrix analysis:


Vis_4quadrant <-function (df, x, y, label = ', tooltip = ', title = ', Xtitle = ', Ytitle = ', ShowLegend = T, J Itter = T, Centertype = ' mean ', Pointshape = +, pointsize = 5, pointcolors = collocatcolors2, linesize = 0.4, Linetype = ' dashed ', LineColor = ' black ', annotationface = ' sans serif ', annotationsize = 5, Annotationcolor = ' black ', annotation Deviationratio = Gridannotationface = ' sans serif ', gridannotationsize = 6, Gridannotationcolor = ' black ', Gridannota  Tionalpha = 0.6, Titleface = ' sans serif ', titlesize =, Titlecolor = ' black ', xytitleface = ' sans serif ', xytitlesize  = 8, Xytitlecolor = ' black ', Griddesc = C (' A zone ', ' Zone B ', ' C ', ' D '), Datamissinginfo = ' data incomplete ', Rendertype = ' widget ')   {# Draw a grouped scatter plot # # Args: # df: Data frame; required fields; data that needs to be plotted should have at least three columns # x: String, required field, column name mapped to the x-axis, and a column of DF, which must be a numeric type or a date type #  Y: string, required field; column name mapped to Y-axis, corresponding to a column of DF # Label: string; Map to a text comment on a point # tooltip: a string; the hover information on the map to the point # Title: string; Title # Xtitle: string; X Axis Title # Ytitle:String, Y-axis title # Showlegend:bool; defines whether the partition legend shows # Jitter:bool; defines whether the disturbance # Centertype: string; Defines the center point type, mean represents the average, and median represents the median  # Pointshape: Shaping, defining Point Type # pointsize: value, defining point Size # Linesize: value, defining line Width # Linetype: string, defining Linetype # LineColor: string; defining Line Color # Annotationface: string; Define comment Font # annotationsize: value; define comment Font size # Annotationcolor: string; define comment Font Color # annotationdeviatio Nratio: value; Define the annotation text upward offset factor # Gridannotationface: string; define grid Comment Font # gridannotationsize: value; Define grid comment font size # Gridannotationco Lor: string; Define grid comment Font Color # gridannotationalpha: value; Define grid comment text Transparency # Titleface: string; Define title Font # titlesize: value; define title font size # t Itlecolor: string; define title font Color # xytitleface: string; define x, Y axis Title font # xytitlesize: value; define x, Y axis title font size # Xytitlecolor: string; define X , Y-axis title font Color # Griddesc: string Vector # Datamissinginfo: string, data problem hint text # Rendertype: string; Define render result type, widget corresponding HTMLWIDG ET components, HTML corresponding HTML content # code implementation slightly}
3.3 Repeatable Data analysis


Data operations analysis is often a repetitive, labor-intensive process, and eventually a set of data analysis framework, the data analysis framework to adapt to specific data, to support enterprise data decision-making.



RStudio provides a data analysis report output scheme based on literature programming through Rmarkdown + KNITR, and developers can embed R code in Markdown documents to execute and render results (rendering can be HTML, PDF, Word document format), actual In the process of data analysis, developers can finally form a set of data analysis templates, each time to adapt to different data, you can produce a new data analysis report.



The Rmarkdown itself has a simple page layout capability and can be extended using Flexdashboard, so the solution not only enables a repeatable analysis process, but also enables highly customized presentations of analysis results, using HTML, CSS, JavaScript Front-end Big Three the details of the presentation and interaction of data analysis reports. The result is a rapid and efficient production of human savings and data analysis results.


Iv. Service Transformation of R 4.1 R Service Framework


R itself is both a language and a cross-platform operating environment, with powerful data processing, analysis, and data visualization capabilities. In addition to being a personal statistical analysis tool in the WINDOWS/MACOS environment of a PC, it can also run in a Linux service environment, so that R can be used as an analysis and presentation engine, and the peripheral functions such as caching, security check, and permission control are done through the system development language such as Java. Develop an enterprise reporting system or data analysis (mining) framework, rather than just use R as a desktop software.



The enterprise Reporting system or data analysis (mining) framework design scenario is as follows:




Figure II R Service Framework4.2 foreach + Doparallel multi-core parallel scheme


As an explanatory language developed by a statistician, R runs a single-threaded process on a CPU core and needs to load all of the data into memory for processing, so computing performance is a soft rib of R compared to system languages such as Java and Python. In the case of large data sets, it is necessary to make the data calculation part through the distributed computing engine such as Hive, Kylin and so on, so that R can only process the result data set, and also through the Doparallel + foreach scheme to improve the computational efficiency through multi-core parallelism, the code example is as follows:


library(doParallel)library(foreach)registerDoParallel(cores = detectCores())vis_process1  <- function() {    # 可视化过程1 ...}vis_process2  <- function() {    # 可视化过程2 ...}data_process1 <- function() {    # 数据处理过程1 ...}data_process2 <- function() {    # 数据处理过程2 ...}processes <- c(‘vis_process1‘, ‘vis_process2‘, ‘data_process1‘, ‘data_process2‘)process_res <- foreach(i = 1:length(process), .packages = c(‘magrittr‘)) %dopar% {    do.call(processes[i], list())}vis_process1_res  <- process_res[[1]]vis_process2_res  <- process_res[[2]]data_process1_res <- process_res[[3]]data_process2_res <- process_res[[4]]
4.3 Graphical data Report rendering performance


In the process of data analysis, R is the most important role in the graphics engine, so it is necessary to understand its graphics rendering performance. For the mainstream Rmarkdown + Flexdashboard-based data analysis report rendering scheme, the performance test results are as follows:



System environment:


    • 4 Core cpu,8 G memory, 2.20GHz frequency.
    • Linux version 3.10.0-123.el7.x86_64.


Test method:


    • The test is time-consuming to render 100 times in different degree of complexity rendering mode with different concurrency.


Test results:



Table two data analysis report rendering performance test0m27.012s
rendering mode concurrency 1 concurrency 2 concurrency 3 concurrency 4 concurrency Degree 5 concurrency 6
rmarkdown + flexdashboard 1m14.087s 0m39.192s 0m28.299s 0m20.795s 0m21.471s 0m19.755s
Rmarkdown + flexdashboard + dygraphs 1m48.771s 0m52.716s 0m39.051s 0m30.224s 0m28.948s
rmarkdown + flexdashboard + ggplot2 2m6.840s 1m1.529s 0m42.351s 0m31.596s 0m35.546s 0m34. 992s
rmarkdown + flexdashboard + ggplot2 + dygraph 2m30.586s 1m16.696s< /td> 0m51.277s 0m40.651s 0m41.406s 0m41.288s


According to the test results:


    • The average rendering time of a single application is longer than 0.74s, depending on the computational complexity of the rendering time (the "foreach + Doparallel Multi-core parallel scheme" described in the previous section accelerates the process). Based on experience, most applications can perform rendering in seconds.
    • The rendering throughput does not increase as the number of concurrent requests exceeds the CPU cores due to the single-core monolithic mode limit. The corresponding service-side machine configuration needs to be matched against the actual business scenario and the concurrent execution cap is set when the request is forwarded. For internal operational Data systems, a single 4-Core 8 G machine is basically able to meet the requirements.
Five, R in the U.S. mission data products in the ground practice


Our food and beverage data team has been using R as an auxiliary development language for data products since 2015, and as of August 2018, it has been successfully applied to management-oriented-week data reporting, analytical tools for data Warehouse governance, data Dashboard for internal operations and analysts, For large customer sales of brand business data analysis system and other projects. All of the current custom-built analytical products for the department are preferred to use R for development.



In addition, we are gradually settling the R visualization and analysis components, developing the R engine-based configuration BI product development framework, in order to further reduce the use of R threshold, improve the popularity of R.



It's a group-to-store dining data team. The ETL Dependency visualization tool developed with R during data governance:




Figure three ETL dependency relationship visualization toolVi. Conclusion


To sum up, R can play a key technical lever in enterprise data operation practice, but as a domain language for statistical analysis, the development of R is mainly driven by statisticians for a long time. With the recent surge in data explosion and application, R has more and more industry support, such as Microsoft acquisition of R-based enterprise data Solution Provider Revolution Analytics, in SQL Server 2016 integrates R, and from Visual Studio 201 5 began to formally integrate the R development environment through RTVS, a series of events marked Microsoft's focus on r in the field of data analysis.



At home, the China R Conference, initiated by the statistical capital, has held 11 sessions since 2008, which has promoted the development of r users in the country. As of August 2018, the group's R developers were around 200. However, the users and applications of R are still relatively narrow compared with Java/python and other system languages.



The author's purpose in writing this article is also a new, more advantageous option for students who work in data-related jobs.


About the author


Shangsan, the head of data system and data product team of the food and Beverage Technology Department of the Group, joined the American group in 2015, and has long been engaged in the development of data platform, data warehouse and data application. Since 2013, we have had some experience and output in using R to quickly meet business needs and save research and development costs. We are also actively promoting R development in the U.S. team of research and development and business analysis.


Recruit Small Ads


Students interested in data engineering and unlocking the value of the data through the service business can send resumes to [email protected]. We have a number of unknown but meaningful areas for you to explore in data warehousing, data governance, data product development frameworks, data visualization, sales-oriented, and merchant-side data-based innovative products.




American group R language data Operation combat


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.