Old money says big Data (1)----Big data OLAP and OLTP analysis

Source: Internet
Author: User

1. First of all, let's not take big data to say things, first analysis of OLAP and OLTP.

OLAP: Online analytical Processing (OLAP) systems are the most important applications of data warehouse systems and are specifically designed to support complex analytical operations, with a focus on decision support for decision makers and senior management.

OLTP: Online transaction processing (oltp,on-line Transaction processing) application, which stores data that is called operational data or business data.

So, in terms of positioning, OLAP is used to do data analysis (class BI), OLTP is suitable for doing some transactional classes of data management such as the production of orders data.

As a popular example, a small-scale e-commerce site, there will be a single process, then the order generated by the next process will be in the OLTP database, and if the e-commerce CEO wants to see the operation of this month, if the order statistics, in theory should be in the OLAP database (or warehouse).

So essentially, OLAP is read-oriented and OLTP is primarily written.

Then, we are going to do a basic analysis, which is the common analysis method:

    1. Ad-hoc query: Ad hoc queries are users according to their own needs, flexible choice of query conditions, the system can be based on the user's choice to generate the corresponding statistical reports.
    2. Fixed field analysis: That is, the user's query conditions are fixed, we can follow the defined fields to provide reports, such as weekly, monthly
    3. Keyword query: For example, the user's address is Beijing Chaoyang District xxxxx, then provide according to Beijing XXX keyword search query
    4. Statistical class query: such as generating some box diagram, thermal map, etc.

The simple analysis is that in OLTP, there will be 1, 3 types of queries in the case of reasonable design, and in OLAP the 1,2,3,4 class will be queried.

2. Next, we analyze the problem of traditional technologies:

As you know, no matter in the system, there is no hardware constraints, such as disk IO, memory, CPU (often ignored), network IO. The average SATA hard disk read and write speed is between 50~75m, the ordinary network is a gigabit switch, that is, 100M transmission speed.

So we're here to analyze the characteristics of the database: (This article does not discuss the specific implementation of the database)

    1. The reason why a database can make a faster query is because of the presence of indexes (and caches), and the structure of the index implementations of different databases is slightly less the same. Indexes also need to be maintained.

Combined with the analysis we talked about earlier, you can think that the performance of the database on the query is still relatively easy to achieve optimization (combined with the database cache), but we need to note that if the query at the same time there is aggregation (group By,sum,count), then the pressure will fall on the IO, such as sorting (because the machine memory is limited, must be sorted by hard disk) This time the pressure will fall on the IO (please review the performance mentioned above), so when we need to return a larger number of data bars (especially paging), then the database will become very very slow.

    • Many people will use the database for data cleansing, but also because of the problem of Io, resulting in slowing
    • We must not ignore: when the data is not large, there will be slow analysis of the problem is due to the limited capacity of CPU computing.

So to synthesize my analysis, we can draw a few conclusions:

    • Problems with databases are limited in computing resources
    • In itself, there is no way to support keyword queries (search engines).
    • Mainly in the query + statistics scenario, the database will have a problem, in fact, Ad-hoc query if there is no statistics, we +hash by the way the library can do very fast.

3. Is the open source Big data solution ready?

Next, I use some of the techniques used in the work to do some analysis, I hope you can have some understanding of the solution to this thing

We do the comparison in several aspects, structure, efficiency, maturity, learning difficulty and so on.

Hadoop+hive+tez:

    • Architecture: Hive is currently a data warehouse on Hadoop, the underlying technology is Tez (DAG MapReduce), using yarn as a resource management platform, providing class SQL interfaces, HQL, and using a database as a metadata management tool.
    • Maturity: Hive is now used by a very wide number of people, so the overall is more mature.
    • Efficiency: Hive currently combines orcfile+ compression overall or relatively fast, but also does not meet some Ad-hoc query requirements within 3 seconds of return
    • Learning Difficulty: Hql,hadoop The difficulty of getting started is not high, so the learning curve is relatively simple.

Summarized as follows:

    1. Hive at present, this software is suitable for OLAP data Warehouse class analysis, data cleansing and other real-time requirements of the scene is not high
    2. Hive does not support search by keyword, so it cannot be searched
    3. The Hive index is weaker and does not reach the performance of the database.
    4. Hive does not meet the fast-returning Ad-hoc query of 3 seconds and 5 seconds (even if the HDFS data is added to memory)
    5. There are insert,update and other primary transaction operations, so it can be considered possible to do OLTP in the future.
Spark+hadoop:
    • Architecture: One of the better technologies in Spark technology is-spark SQL, which enables the use of SQL to manipulate the rdd of Spark, and of course the spark SQL will eventually be used by Spark's engine to be converted into the mapreduce of Spark.
    • Maturity: is still telling development.
    • Efficiency: Overall higher than the hive rate, but if the amount of data is very large, there is no particularly good effect. Data is added to the memory query (no statistics) very fast.
    • Learning curve: Need to learn hadoop,spark and so on, relatively steep.

Summarized as follows:

    1. The amount of data is not particularly large, fully loaded into memory, can provide a non-statistical class query in seconds.
    2. cannot be fully loaded into the statistical analysis of memory, the result and Hive+tez combination will not be too much, and will not lead particularly much.
    3. Suitable for certain Ad-hoc query scenarios and OLAP scenarios, can not do OLTP
    4. No index, no accurate search, is a windfall scan.
Impala+hadoop:
    • Architecture: Impala Technology is currently performing well, discarding the mapreduce design, combined with the HDFS cache can do better performance
    • Maturity: more mature
    • Efficiency: With parquet, performance is close to Hive+tez because it does not need to be started at a certain level of analysis faster than hive
    • Learning curve: Learning SQL and Impala itself, so the difficulty is general.

Summarize:

    • Impala has a good performance, but on big data sorting, you need to limit the number of rows returned, and joins between large tables are also a problem.
    • Suitable for certain Ad-hoc query scenarios and OLAP scenarios, can not do OLTP
    • No index, no accurate search, is a windfall scan.

So comprehensive look, the current open-source Big Data SQL solution, no one is perfect, are more or less a flaw, we need to search engine +nosql+redis and other solutions, to complete a lot of scenes.

We need to have a conclusion on performance: the requirement of 5 seconds, basically not suitable for this big data SQL solution means to do. Need to use more expensive databases, or wait for the open source technology to mature.

Old money says big Data (1)----Big data OLAP and OLTP analysis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.