Data is the core of the system, and in addition to the service-oriented architecture, data-oriented architecture can be considered. The data-oriented service architecture needs to support heterogeneous data sources, support dynamic Data and static data, support both public cloud deployments and private cloud deployments, and provide a variety of data applications and data products, as shown in:
Generally, in order not to affect the normal operation of the business system, the different data sources will be pooled, the acquisition and ingestion of technology, then the storage of data and a series of processing, and ultimately through a variety of solutions to form data applications derived data products.
From the point of view of development, can be divided into infrastructure, operational tools, development tools and solutions layer four, from the data itself, can also be divided into data sources, dynamic Data, static data and data application 4 levels, each other is overlapping.
Data source
The data source determines the width of the data, and the quantity determines the thickness of the data. Even when it comes to data applications, which are related to specific business areas, the value of the data is not appearing out of thin air. Therefore, the data of the business system is the first, is also the most easy to obtain, the direct value is also higher.
Next is the user's behavior data, the management user undergoes the product itself to induce and the limitation, but the user's behavior data is also to some extent manifests the user is good. Past usability tests have even created usability engineering, and today, the user experience is typically verified through user behavior data.
The advent of the Internet of Things (IoT) underscores the importance of sensor data. Sensor data is relatively high-frequency data, related to time series, can consider time-related data storage, as well as data migration. Location data can be regarded as a special kind of sensor data, and the location data can be used to describe the physical space location, which is a very useful data, especially for mobile internet applications.
Social is almost ubiquitous (anything can be social), and social attributes enable the application to have a certain social attribute, and thus have more value. e-mail may be a more ancient Internet application, can be seen as a special social data, data collection can be achieved through the standard POP3/IMAP4 protocol, in-app social data needs to be organized, for third-party social platforms, generally provide API interface services, As long as you notice access control.
The wide range of media, targeted access to data needs crawler-related technology, digital media restrictions on the crawler is a challenge. In contrast, the access to social media and the generic interface from the media is easier to get.
Whether it is a customer's website or a competitor's website, it also needs the help of crawler technology, which will supplement the data of the business system.
Document data is mostly unstructured data, typically a file system and a nosql winning field. For many enterprises, often paper document data processing process, with the development of AI technology, in particular, OCR-related technology gradually mature, all documents are data resources.
Dynamic Data
Dynamic Data acquisition process and static data is similar, the key lies in the analysis process, for dynamic data, analysis is actually happening. For example, the amusement park uses the bracelet to collect the user's information, these bracelet recorded the user's related behavior, the amusement park can use this data to personalize the user to recommend some services, this makes the customization service possible during the user tour. Based on dynamic data, these scenarios make it possible to generate more business opportunities between the enterprise and the user.
For dynamic Data, a real-time processing method is required. Delay is a key factor to consider, and time is where money manifests itself. By reducing multi-tenant resource constraints and the use of cloud services, you can reduce latency, improve performance levels, and be able to process large traffic data in real time.
The data flow is similar to the traditional ETL process, when the data is extracted and the data is initially converted and cleaned, the specific process is closely related to the target. Data stream processing is the core part of Dynamic Data processing, which can be further cleaned and stored, and can be directly introduced into the analysis method, which is connected with the flow-type application behind.
Data governance is a process from using fragmented data to using consolidated master data, from comprehensive data governance with little or no organization and process governance to business scope, from trying to deal with the chaos of master data to the orderly flow of master data. Data governance is critical to ensuring that data is accurate, shared, and protected. Effective data governance ultimately embodies the value of data by improving analytic algorithms, reducing storage and computing costs, mitigating disaster preparedness risks, and improving safety compliance.
Data security is the security of the data itself, mainly refers to the use of encryption methods to protect data proactively, such as data security, data integrity, two-way identity authentication, but also the security of data protection, mainly for data storage active protection, such as through the disk array, data backup, remote disaster recovery and other means to ensure data security. The security of data processing refers to how to effectively prevent database damage or data loss caused by hardware failure, human error operation, program defect, virus or hacker, and some sensitive or confidential data may not be qualified to read, and result in data leak. The security of data storage refers to the readability of data outside the system running.
Data operation refers to the analysis of Dynamic Data mining, the hidden in the vast amount of data in the form of a compliant release, for data consumers. The data operation of dynamic data is a very challenging topic.
static data
For the operation of static data, more like a batch form, is an offline analysis, more like traditional OLAP, this can have higher performance processing power. This means getting the data from a variety of data sources before parsing. Static data processing is divided into two stages, such as a retail terminal to analyze last month's data to determine the business activity of the month, whether the customer can be based on the purchase behavior to distribute customized coupons and so on.
Specific analytic calculations can be performed on a private cloud or on a public cloud. For a certain scale of data, especially exploratory data analysis, can generally be in the private cloud computing, and even directly on the private cloud to provide data applications and data products. When the data size and computing resources need to reach a certain level, you can consider migrating the public cloud. This is a data-oriented hybrid cloud structure, in order to make the migration simple and convenient, need to ensure the consistency of the environment, YARN is the best choice for resource scheduling. Of course, Mesos also deserves attention.
The storage of static data is typically mass storage, and NoSQL is an inevitable choice based on the urgent need for read-oriented performance. Of course, data warehousing is still a good choice for large amounts of structured data.
Data application
Data applications include computational frameworks, algorithms, visualization of data, and application-specific rendering. Whether it's an enterprise app or a mobile app, or an interactive web app, you can use data to calculate results. Streaming applications and search applications are closely related to the computational framework and can be implemented through storm and Elasticsearch, or through the spark framework.
Business Intelligence (BI), traditionally data mining based on data warehousing, discovers the potential value of data. In the data-oriented architecture, Bi's analytic method can be unchanged, only change the way of calculation, and also can give a speech to the analytic method.
A reporting system can be thought of as one of the core visualizations. Traditional reports are formed for static data, and dynamic data is combined with static data to form real-time reports.
Stochastic analysis is a kind of exploratory data analysis, is a kind of data groping and try, can use tools such as Hive,pig,sparksql, clear the direction of further exploration. Statistical analysis is a more specific off-line analysis, based on statistical model of data analysis processing.
Machine Learning (machines learning, ML) is a multidisciplinary interdisciplinary, simulating or implementing human learning behavior to acquire new knowledge or skills, which is at the heart of AI, with many frameworks such as Mahout and SPARKML.
Deep learning is a new field in machine learning research, which originates from artificial neural network, and multilayer perceptron with multiple hidden layers is a deep learning structure. Deep Learning represents attribute categories or characteristics by combining lower-level features to form more abstract higher levels, to discover distributed characteristics of the data. As with machine learning methods, deep machine learning methods also include supervised learning and unsupervised learning. The learning model established under different learning frameworks is very different. Personally recommended TensorFlow.
Data-oriented architecture for all stacks