How to Create A Big Data Modeling Platform with Open Source Components?
Finally, I can just post a chat to blow water. . . Oh no, it's a technical discourse article, there are official activities? That must keep up! Cough~
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
This article is not the title of the party, it will introduce the technology related to building a big data modeling platform based on the web interface and the relationship between the components. This is a real project in an enterprise and it has been commercialized. As one of the core developers, I have also witnessed the growth of the entire platform.
Due to the limited space, no specific code will be involved, but the relationship between each component and the business scenario and data processing process will be explained as much as possible, and knowledge related to the big data field will also be interspersed.
Since I have been deeply involved in the training field for many years, please forgive me if there is any nonsense and shallowness. If you are still a learner or a developer who has just stepped into the field of big data, then this article is worth your collection.
2. Project background
The birth of the project can be traced back to three or four years ago. When Alibaba's data plus platform was still in a free trial, the big data modeling platform we made has been commercialized and put into order, and we have reached cooperation with Huawei and China Unicom. And enter the Unicom WoChuang space, let’s take a picture to experience it first:
So, at this time, a passerby should come out and say: You can achieve commercialization completely because the big manufacturers have not yet built products that can crush you~ I can only say: emmm, you. . . that's right!
But in fact, there are many reasons for whether a product can be commercialized. Large manufacturers have obvious advantages in various aspects, but this does not mean that other products have no opportunities. In addition to technical strength, team size, project funds, product positioning, market The environment is equally important.
When I took over the project, it was already a semi-finished product. The so-called big data modeling platform is actually a general-purpose product positioning. It is more about the integration of functions. It can be said that it is standard big data development. The main composition of the team is Developers, of course, also include data analysts.
The core function of the whole product is to realize the complete process of data collection, data source management, data cleaning, statistical analysis, machine learning, and data visualization. The difficulty lies in the formation of data flow, which is controllable and easy to be managed. Even after many years, I still feel that although this project does not involve complex scenarios and various data analysis optimization solutions, it is definitely helpful to me. It allows me to truly understand and operate every aspect of data analysis. A process can also be said to open up the two lines of Ren and Du. Anything you do in the future is actually just the optimization of a certain link or a fixed data flow in a specific scenario. After all, common things have been made. Is the calculation of a certain fixed index or model training still a problem?
Every time I go to an interview about this project in the future, the other party will say: This young man has caught up with a good project. Of course, the project itself is one aspect, and my own summary is also important. By graduating at that time, I also forced myself to thoroughly understand the project, not only in terms of technology, but also in terms of products, design, etc. I turned him into my master's thesis, of course after obtaining the software copyright (later I found out that it was just a graduation thesis and it didn't seem to be relevant).
The background introduction ends here. The following feature film begins. If you are sure that this is what you want, please like, follow, and support briefly after reading. Also welcome to bookmark and comment on your thoughts.
3, meet the technology stack
In order to facilitate the introduction of the scene and the processing of related technologies, it will be divided according to different functional modules. First, a complete project architecture diagram will be given.
Looking at this picture now, the architecture is a bit old, but I think the history should be truly restored. At that time, the big front-end related technology had just exploded, and the project had been developed for a while before I took over, so this is What it should look like at that time also represents that period of hard work.
Recalling that at that time, there were really few big data materials and project cases. Most of them were bragging with PPT. Some of the core technologies related to big data in big factories were not accessible to the grassroots. Arrived, so it can be regarded as advancing in groping.
1. Functional module framework
Due to the limited space of the article, only part of the functions of the big data modeling platform will be introduced. If you are interested in the processing technology of certain links, you can scan the QR code at the bottom of the article to join the WeChat group (CSDN official provides for content partners. WeChat group), will regularly broadcast live interactions with fans.
In a real project, since it is an enterprise-level application, there will inevitably be a series of authority management functions such as departments and employees. This article only focuses on the big data processing related processes, so some less important parts are deleted.
2. Data source management
For the management part of the data source, all the data to be analyzed is stored on HDFS. At the same time, because it is mainly for statistical analysis, all the processed data is structured offline data: it can be pulled from a relational database or uploaded by the user. , After the upload is completed, it exists in the form of a Hive table. Only the name of the data source, its belonging, and the information of the corresponding Hive table are recorded in the platform, and the subsequent data process will not modify the original data, so the same copy Data may be used in multiple data processes, so an existing Hive table can be declared as multiple data sources. In fact, multiple association relationships are established, and all data displayed to users in the modeling platform are one by one. Source node.
Insert picture description here
The Sqoop component is used when pulling the relational database, which is spliced into a complete command according to the database connection parameters filled in by the user and executed on the server. For data files uploaded by users, you need to specify column names, column types, column and row separators, and automatically create a corresponding structure of Hive tables based on the information, so that the data can be recognized normally after importing the data.
3. Data processing flow
For the modeling platform, one of the most basic functions is to allow users to customize the data flow, which can be applied to teaching in enterprises or universities. The method we adopted is to encapsulate some common statistical analysis functions and complete machine learning libraries into a functional node (mainly implemented through Hive QL, Spark Mllib, RHive), and each node will have a corresponding configuration Parameters, all users need to do is drag, combine, configure, and run.
For the front-end process design UI component, we chose GooFlow. The data process can be saved and modified. It is actually a big JSON reflected in the database, which records the line direction, node configuration, etc., when the process is opened again The canvas will be restored, and at the same time, the configuration information of each node in the entire process must be saved.
After the project process starts, each step will generate a result table as the data source for the next operation. The final running result will generate a result table, which can be directly displayed in a table, download the result data, or drag and drop a visualization component. Display after configuration.
It should be noted here that GooFlow is a component that needs to be authorized. You can also choose other components to replace it. At present, the public on the Internet is only a trial version, or an anti-piracy version with a mining program, so if you want to use it Or contact the author greatly.
Insert picture description here
4. Other functional modules
For some other functions, they are more conventional Web application development. For example, the part of the visual display is based on the encapsulation of the option configuration of Echarts, allowing users to configure the effect of the chart through the interface. The simpler way to query the data to be displayed from Hive is HiveJDBC.
The data interaction between the front and back ends uses a combination of JSP tags and Ajax, which is relatively old. The persistence layer uses Hibernate. Although I personally prefer MyBatis, it is impossible to refactor by my own efforts. Just work hard.
Because it is an article about the technology stack, I did not use too much text. I think it will be more clear and direct to show it with architecture diagrams and flowcharts. If you have something you want to discuss, you can leave a message in the comment area~