A discussion of big data and databases

Source: Internet
Author: User
Tags add time

A few days ago on the water wood community, found that there are still Daniel, read about the big data and database discussion, found it is quite interesting, confined to space and layout, I did part of the finishing.
First look at this person's analysis, the industry is still very familiar with the status quo, not a university professor is the industry pioneer.
Big Data is a scenario, not a model. The programme has a programme of pressure,
Can only exert various tricks to "solve" the problem. Since it is a scheme, it includes the storage, operation, input and output, and so on. On the operational model, because of the better use of inexpensive hardware, a computational model such as Hadoop/mapreduce, as well as storm and other models, is practiced. There is also a great change in storage.

In fact, big data most need to solve the storage system problem is largely the I/O and Computing task relationship. Although RDBM has taken into account the characteristics of the storage system, the design takes into account the cost of reading and writing data, but these analyses and studies are based on the hardware architecture of the 70 's and 80 's. Now the hardware architecture, including the network architecture, and the 70 's 80 's the difference has been very large.
For example, the 80 's can not imagine the PC has 2T of hard disk, even in the late 90, such a large storage system must be on the disk matrix, now a single disk drive can be solved. However, the mechanical component of the hard disk head does not upgrade like a capacity, not just the same as the previous turtle speed (of course, much faster, but still slow), but also as fragile as before. This direct
has led to the question of the rationality of the paradigm in RDBMS: the spatial significance of solving redundancy is not significant, but random reading and writing makes the head speed problem prominent.

Back to the relationship between big data and the database. The database actually has many models. A relational model is just one of them. However, the basis of the relational model, the relational algebra, solves a large number of data storage-related problems at the mathematical level (for example, the Cartesian product allows the different data sources of the independent storage to be infinitely extended into a virtual table, and the mapping solves the choice and positioning of table or virtual table data, so that regardless
Storage table how big, or how small, data storage, search is not a problem). Because of the theory support of the relational model, the relational database has the status quo of unification world. However, there are many kinds of data storage schemes. Key-value is one of them, Oodb is also a kind, even if the direct storage JSON can also be a kind. These storage modes do not have the same mathematical support as the relational model,
So that they are second-class citizens, third-class citizens from the outset. But second-class citizens are also citizens, no matter how much they shrink.

The other is newsql. This kind of database uses the relational database model, but relax normalization. In other words, data storage and query or two-dimensional relational model, Cartesian product, projection these are the fundamental of the database, and SQL is easy to implement and use these databases, the only difference is that they do not recommend the use of paradigms, but the use of "redundant" data
Bring a "good" place to make the database more efficient. For example, a column database is a typical newsql database. The column database presents the storage and reading of the data, and the column associations are strongly correlated with the rows, which shows that most of the time the user is concerned about the same column, or the same columns, not all columns of the same row, and from the storage, they also find that the data in the same column is very similar, and if the data is stored together, it Compression algorithm (not listed is compression). For example, there is a list of countries, traditional rdbm will have a table to store the country, and then get a nation_id, in other places using the ID rather than the country name. The idea of new SQL, however, is to write the name of the country directly in all countries where it is used, because there are a few countries around the world
1 million records, the other really meaningful is more than 100 records, compression is not a thing at all. The advantage of this is that each query does not need to use Descartes to accumulate another table, and only need to read the same place, the data is out. That is, the chances of the head repositioning are much less.

And there's the index. On RDBM, the index is used to speed up the query. However the use of the index, which lets the reading and writing speed drop of two or three quantity groups. To solve this problem, some people make a direct copy of the data, rather than using the index. In other words, if there is a, B, c three columns, A and B are indexed, they are saved, B, c a table, a,, C another table. Requires a to be indexed when fetching, B, C, you need B to index the other one. This method looks stupid, but it works. Of course, the amount of data has been raised several times, which is a problem that has to be considered.

The third is the updating of the data. Previously always thought to update the database to find the original record, change the data on the line. But now found that the change of a record and write a large number of records is not very different, if the amount of change, the latter advantage is greater. So now many database systems are essentially read-only databases, which can only add records and cannot be changed. The changes are recorded by adding a new record and recording the add time, and then merging with the original record when it is read out.

With Read-only, the storage of data can be greatly optimized. A block record, for example, is compressed directly with the LZO algorithm and placed on HDFS--it's a bit of a headache to change a compressed record.

Changed a place, immediately resonate. Look at this person's analysis.
1. Very much agree that big data is a platform view, it is not only related to data storage and access, but also related to data import, export, analysis, application and so on.
2. The core problem of big data is distributed, including distributed data storage, distributed computing (including distributed SQL Engine, distributed Data mining algorithm, ...). )。 Many MPP databases can be thought of as big data categories, such as Teradata, Oracle ExaData, Greenplum, IBM DPF ...
3. Relative to data redundancy, I think the paradigm is mainly about data consistency. This is critical in transactional applications.
4. Many of the features in a relational database are good, such as paradigms, consistency constraints, indexes, SQL optimizer based on statistics, not the big data platform, but because of the constraints on the CAP's quasi-side, which are difficult to implement on a distributed system, Therefore, it is necessary to make some trade-offs or develop different versions to meet different applications.
Many of the SQL on Hadoop/sql in HDFS are developing a statistics-based SQL Optimizer, as well as adding some relatively simple indexes.

Do not know how to say, accidentally dragged into the RAC and Exadata, see what they say.
Oracle/rac and Oracle Exadata are not a thing. The Exadata storage can be built on a regular Oracle server or an Oracle RAC.
Share nonthing architecture is not necessarily cheap, Teradata sells very expensive. The cost of Hadoop-based scenarios is effective because the data redundancy provided by 1.HDFS ensures that a node's downtime does not result in data loss and complete system downtime, enabling these scenarios to be deployed on inexpensive hardware.
2.hadoop/yarn/tez/spark provides a free and open source distributed computing framework that reduces the cost of developing big data solutions. Big data is essentially distributed computing, share nothing is the inevitable choice of scalability in distributed computing; Because the more share, the weaker the scalability.

Finally do not forget to take PG and MONGO to make a comparison, this is the rhythm of the frame ah.
MongoDB's memory management is too primitive, if the activity data is larger than memory, the performance drops sharply. While no schema results in large data volumes and more memory.
Both PostgreSQL and MongoDB have spatial index support, and the capabilities should be similar. PostgreSQL's transactional support kills MongoDB in seconds. Improved performance after PostgreSQL 9.4.
So mongodb in this respect is inferior to PG

Java enterprise-Class generic rights security framework source SPRINGMVC MyBatis or Hibernate+ehcache Shiro Druid Bootstrap HTML5

"Java Framework source code download"

A discussion of big data and databases

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.