Simple talk-How to use Python and r combinations to complete tasks

Source: Internet
Author: User

Overview


Unlike those data science competitions, in real data science, we may not be doing algorithmic development, but defining requirements and harnessing data. So, how to better combine the real business, so that the real value of the data has become a more meaningful topic.


The complete process for a data science project is typically five steps:

Requirement definition = "Data acquisition =" Data governance = "Data analysis =" Data visualization


I. Definition of requirements


Requirement definition is the biggest difference between the data science project and the Data Science competition, in the real situation, we tend to be unclear about the objective function, the independent variable and the constraint condition. It is necessary to analyze the problems systematically by means of interviews, papers, documents, etc., and quantify the practical problems to solve the abstract problems, determine the independent variables, constraints and objective functions. In the real situation, the demand is often multi-change, temporary, how to grasp the demand has become the whole project follow-up key elements.


Second, data acquisition


The forms of data acquisition mainly include:


    1. calls to existing databases

    2. calls to existing APIs

    3. self-designed crawler


In the data acquisition, the comparison of the first one is the development of reptiles, here R although developed rvest package, but compared to django-scrapy such a complete crawler scheduling system cannot help but eclipse, so in the first step, I suggest using Python to do crawler development.


Third, data governance


Data governance The first step is the definition of data, and the definition of data through Python's various ORM frameworks and the admin system, can be very good to complete the definition and management of the Data Warehouse. Through airflow we can also be very good on the ETL process to do the whole process of monitoring.

So, in the second step, I still recommend using Python as a tool for data governance.


Iv. Data analysis


Data analysis first involves exploratory analysis, which is the strength of the R language, suitable for a variety of powerful data visualization, we can use R to quickly understand the overall characteristics of the data, through the data.table and rcpp we can also quickly improve the performance of R stand-alone, Save the Cython write wrapper embarrassment. Python is less flexible than R in exploratory analysis because of the need for more constrained parsing operations. At least matrix multiplication I prefer to accept intuitive%*% rather than Np.dot (). So, in the third step, I recommend using R to complete the analysis of the data.


V. Visualization of data


data visualization is the world of JS, but thanks R in the eco-friendly to JS packaging developers, now the vast majority of the market in the BI domain will be involved in the JS Library has been encapsulated in the R language, such as Echarts, Highcharts, Rcharts, D3 and so on. On the other hand, through shiny, we quickly and significantly simplified the BI build process, skipped the underlying jquery, Boostrap, WebSocket, and more, and built a bi system directly for the business scenario, helping us clear the way to quickly build a bi prototype, Instead of working hard to change the template in tornado. Obviously, using R for data visualization can greatly reduce our development time. So, fourth, I also recommend using R to do the work of visualizing the data.


Summarize


So the normal data science project is done, we need to deliver a crawler management system (Django-scrapy), a Data Warehouse management system (Django), a process monitoring system (airflow), a BI Analysis system (shiny), Truly complete the monitoring and maintenance of the entire data science project, and in this process we continue to iterate over our data products, optimize processes, refine models, and ultimately nurture the business.


In summary, Python is the basis of data science, and R as the superstructure is a good solution, of course, all this is based on the data developers have strong development skills, otherwise the randomness of Python and R will lead to a huge tragedy.

Simple talk-How to use Python and r combinations to complete tasks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.