Big Data graph database: Data sharding and Data graph database

Source: Internet
Author: User

Big Data graph database: Data sharding and Data graph database


This is excerpted from Chapter 14 "Big Data day: Architecture and algorithms". The books are listed in



In a distributed computing environment, the first problem facing massive data to be mined is how to evenly distribute data to different servers. For non-graph data, this problem is often solved more intuitively. Because records are independent and unrelated, there is no special constraint on the Data splitting algorithm, as long as the server load is balanced as much as possible. Due to the strong coupling between graph data records, improper data sharding may not only cause load imbalance between machines, but also greatly increase network communication between machines (see Figure 14-5 ), considering that graph mining algorithms often have the characteristics of multiple rounds of iterative operation, this will obviously enlarge the unreasonable impact of data slicing and seriously slow down the overall operating efficiency of the system, therefore, the rational splitting of graph data is very important for the operation efficiency of offline mining graph applications, but this is a potential problem that has not yet been well resolved.

What is a reasonable or good way to slice graph data? What is its judgment standard? As shown in the preceding example, the two main factors for measuring the rationality of graph data slicing are Server Load balancer and total network communication volume. If Server Load balancer is considered separately, it is best to evenly distribute graph data to each server as much as possible, but this does not ensure that the total amount of network communication is as small as possible (see Figure 14-5 right side cutting method, load Balancing, but network communication is large). If network communication is considered separately, you can place all nodes in the dense connected subgraph on the same machine as much as possible, this effectively reduces network traffic, but it is difficult to achieve load balancing between machines. A large dense connected subgraph will lead to a high load on a machine. Therefore, a reasonable slicing method needs to find a stable balance between the two factors, in order to achieve optimal overall system performance.



The following describes two methods to cut graph data from different starting points, and describes typical Segmentation Algorithms and their corresponding mathematical analysis. You must first emphasize one point: when selecting a specific splitting algorithm, the more complicated the algorithm is, the more likely it will be adopted in the actual system. Readers can think about the truth and give answers later.


14.3.1 Edge-Cut)

Now the problem is: given a huge graph data andPHow to cut a machinePSubgraphs? There are two different ways to solve this graph cutting problem.

The cropping method represents the most common idea. The cutting line can only pass through the edge connecting the graph node and divide the complete graphPSubgraphs. Figure 14-6 shows that the seven nodes are distributed to three machines, and the left side shows the cutting edge method. The number of the graph nodes indicates the machine number that the nodes are distributed.


After the graph data is cut by the cutting edge method, any graph node will only be distributed to one machine, but the split edge data will be saved on both machines, in addition, the cut edge means remote communication between machines during graph computing. Obviously, the additional storage overhead and opening/selling of the system depend on the number of edges to be cut. The more edges a graph uses during cutting, the higher the storage overhead and the higher the opening and sales of communication.

As mentioned in the previous article, there are two considerations for determining whether the graph data partition is reasonable: Server Load balancer and machine traffic. Therefore, for the cropping method, all specific cutting algorithms are pursuing the following goals: how to minimize the number of cut edges by assigning graph nodes evenly to different machines in the cluster as much as possible.


That is, the method for finding the least cutting edge under the condition that each machine is distributed to the nodes as evenly as possible. Where, |V|/PIndicates that all nodes arePThe value obtained from the average calculation of machines,L≥1 indicates the imbalance adjustment factor.LThe size of can control the uniformity of node allocation. When the value is 1, it is required to be completely evenly distributed. The larger the value, the higher the degree of imbalance allowed.

From the formal description above, we can see that lamdaWhen the value is about 1, this problem is essentially a balance in graph cutting.PBalanced p-way Partitioning however, because of the high time complexity of the graph cutting algorithm, it is not suitable for processing large-scale data, so it is rarely used in real large-scale data scenarios.

In the actual graph Computing System, the common strategy is to randomly divide nodes. That is, the hash function is used to evenly divide nodes to each machine in the cluster without careful consideration of edge Cutting. Both Pregel and GraphLab adopt this policy. This method is fast, simple, and easy to implement. However, from theorem 14.1, we can prove that this method will cut the vast majority of edges in the graph.

According to Theorem 14.1, if the cluster contains 10 machines, the cut edge proportion is about 90%, that is, 90% of the edges are cut, and if there are 100 machines, 99% of edges are cut. It can be seen that this splitting method is very inefficient.


14.3.2 Vertex-Cut)

The cut-point method represents different ways of cutting a chart. Unlike the cutting edge method, when cutting a graph, the cutting line can only pass through the graph node rather than the edge, the graph nodes to be cut and cut may appear in multiple subgraphs at the same time. The right side of Figure 14-6 is the cut point method. It can be seen that the node in the graph center is cut into three parts, that is, this node will appear in the three subgraphs after the cut at the same time.

In contrast to the cut-edge method, in the cut-point method, each side is distributed to only one machine and not stored repeatedly, however, the cut nodes are repeatedly stored in multiple machines, so there is also an additional storage overhead. In addition, the problem caused by such cutting is that graph algorithms constantly update the value of graph nodes during iteration, because a node may be stored on multiple machines, that is, the multi-copy data problem exists. Therefore, the consistency problem of the value data of the graph node must be solved. A typical solution will be provided for this problem in the following section.

So, since the edges in the cut-point method are not cut, Do machines do not need to communicate with each other? This is not the case. communication overhead is still incurred when maintaining the data consistency of the cut graph node values. Therefore, for the sharding method, the goal of all specific algorithms is to distribute Edge Data evenly to the machines in the cluster, minimize the number of open graph nodes.



That is, the method for finding the least average number of copies under the condition that each machine is evenly distributed to the edge as much as possible. Where, |E|/PIndicates that all edges arePThe value obtained from the average calculation of machines,L≥1 indicates the imbalance adjustment factor.LCan control the uniformity of edge distribution. When the value is 1, the full equi-tion is required. The larger the value, the higher the degree of imbalance allowed.

Similarly, because the time complexity of complex graph cutting algorithms is too high, edge random balancing is the most commonly used in the real system.






In the real world, the edge distribution of most graphs follows the power law. theories and practices have proved that for graph data that follows this law, the edge random equi-tion Method belongs to the cut-point method, which is stronger than the node random equi-tion Method in the cut-edge method, and its computing efficiency is at least one order of magnitude higher. Therefore, for graph data in general situations, the cut point method is much better than the cut edge method.


Please think: Why isn't the more complex and effective segmentation algorithms become more popular?

A: Generally, graph mining algorithms are divided into two stages.

Phase 1: centralized graph data segmentation and distribution; Phase 2: distributed graph computing.

If a complex graph cutting algorithm is used, the system load balancing is good and there is less inter-machine traffic. Therefore, the second-stage operation efficiency is high, but the complex algorithm is not only a high development cost, the time cost paid in the first stage is also very high, and even the time cost is higher than the efficiency benefit generated in the second stage. Therefore, we need to weigh the global efficiency when selecting the splitting algorithm.




How to perform table sharding for large databases

Pt %>
<%
'Use ADOX to obtain the description of fields in Access ---------------
Function OpenConnectionWithString (strMDBPath, strTableName, strColName)

Dim cat
Set cat = server. CreateObject ("ADOX. Catalog ")
Cat. ActiveConnection = "Provider = Microsoft. Jet. OLEDB.4.0; Data Source =" & strMDBPath
OpenConnectionWithString = cat. Tables (strTableName). Columns (strColName). Properties ("Description"). Value

Set cat = Nothing
End Function
Response. Write OpenConnectionWithString (server. MapPath ("./database name. mdb"), "table name", "target field name ")
%>
Not tested

How to analyze and process big data?

From the perspective of big data analysis, big data is no longer simply a fact of big data. The most important reality is to analyze big data, only through analysis can we obtain a lot of intelligent, in-depth, and valuable information. So more and more applications involve big data, and the attributes of these big data, including the quantity, speed, and diversity, all present the increasing complexity of big data, therefore, the big data analysis method is particularly important in the big data field. It can be said that it is the decisive factor that determines whether the final information is valuable. Based on this understanding, what are the common methods and theories of big data analysis? 1. Visual analysis. Users of big data analysis have big data analysis experts and common users. However, the most basic requirement of both is visual analysis, because visual analysis can intuitively present the characteristics of big data and be easily accepted by readers, it is as simple and clear as reading pictures. 2. data mining algorithms. The core theory of big data analysis is data mining algorithms. Different data mining algorithms are based on different data types and formats to present the characteristics of data, it is precisely because these statistical methods (which can be called truth) recognized by statisticians around the world can go deep into the data and explore recognized values. This is also because the algorithms used for data mining can process big data more quickly. If an algorithm takes several years to come to a conclusion, the value of big data will not start. 3. predictive analysis. One of the final application fields of big data analysis is predictive analysis, which extracts the characteristics from big data and establishes a model scientifically. Then, it can bring new data through the model, to predict future data. 4. semantic engine. The diversity of unstructured data brings new challenges to data analysis. We need a tool system to analyze and extract data. The Semantic engine needs to be designed with enough artificial intelligence to actively extract information from data. 5. data quality and data management. Big Data Analysis is inseparable from data quality and data management. high-quality data and effective data management can ensure the authenticity and value of analysis results in both academic research and commercial application fields. The foundation of big data analysis is the above five aspects. Of course, if you go deeper into big data analysis, there are still many more distinctive, deeper, and more professional big data analysis methods. Big Data Technical : the ETL Tool extracts data from distributed and heterogeneous data sources, such as relational data and flat data files, to the temporary middle layer for cleaning, conversion, and integration, finally, it is loaded into a data warehouse or a data set to form the foundation of online analysis and processing and data mining. Data Access: relational databases, NOSQL, and SQL. Infrastructure: cloud storage and distributed file storage. Data Processing: NLP (Natural Language Processing) is a discipline that studies the Language problems of computer interaction. The key to Natural Language processing is to allow computers to "understand" Natural languages. Therefore, Natural Language processing is also called NLU (Natural Language Understanding), also known as Computational Linguistics. On the one hand, it is a branch of language information processing, and on the other hand, it is one of the core topics of AI (Artificial Intelligence. Statistical analysis: hypothesis Test, significance test, difference analysis, correlation analysis, t-test, variance analysis, Chi-square analysis, partial correlation analysis, distance analysis, regression analysis, simple regression analysis, multivariate regression analysis, gradual regression, regression prediction and residual analysis, ridge regression, logistic regression analysis, curve estimation, factor analysis, clustering analysis, principal component analysis, factor analysis, rapid clustering method and clustering method, discriminant analysis, and corresponding analysis multi-dimensional correspondence analysis (optimal scale analysis) bootstrap technology. Data Mining: Classification, Estimation, Prediction, correlation grouping, or association rules), description and visualization, Descripti ...... remaining full text>

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.