My knowledge and understanding of big data-related technologies

Last Update:2015-08-28 Source: Internet

Author: User

Tags hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this post, my experience and understanding of big data-related technologies has focused on the following aspects: NOSQL, clustering, data mining, machine learning, cloud computing, big data, and Hadoop and Spark.
Mainly are some of the basic concept of clarifying things, and more messy, cautious into.
* 1. Nosql
My understanding is that NoSQL is primarily used to store some unstructured data, which is too much of a relational database and how files are stored (such as how video files are stored in a way that is appropriate for using files).

* * 1.1 NoSQL Categories:
Columnstore: Hbase (BigTable's Open source implementation), which stores structured data. Cassandra
Document storage: MongoDB CouchDB Domino
Key-value pair storage: Memcachedb Rddis berkeleydb (bdb,oracle)
Figure Storage: neo4j

Object storage:
XML database:

Of the above categories, the first four classes are often mentioned.

* * 1.2 Common NoSQL
Columnstore Distributed Database Hbase
HBase is a database in Hadoop and is an open source implementation of Google BigTable.
Key-value pair storage Memcachedb

The introduction on the wiki:
Memcachedb is a distributed key-value storage system designed for persistent. It isn't a cache solution, but a persistent storage engine for fast and
Reliable Key-value based object storage and retrieval. It conforms to Memcache protocol (not completed, see below), so any memcached client
Can has connectivity with it. Memcachedb uses Berkeley DB as a storing backend, so lots of features including transaction and replication is supported.

The difference from Memcache memcached
Memcached: Mainly used in the mysql+memcached architecture (mem-cache-d). Memcache should be a similar concept to memcached.

Memcachedb's front-end cache is memcached. Memcachedb can persist memcached cached things into Berkeley DB. Memcachedb is the product of Sina.

A distributed model of NoSQL databases:

Shard Model: Data is fragmented alphabetically.

To prevent data loss and to keep the system available when certain nodes fail, the data is copied and stored to different nodes. These nodes may have a master-slave, peer-to relationship.
The master node in the master-slave model is primarily responsible for updating the data and copying the updated data to the slave node. Reads the primary corresponding data from the node.
Nodes in the peer model have the same status and can respond to updates and reads to the data.

In addition, the Shard model can be combined with the master/slave model/peer model. In a drag-and-drop scheme in sharding and master-slave replication, the role of Sharding is in the distributed storage of data; the role of Master-slave replication is to provide backup for each shard node.
Increase data security;

* 2. Distributed System & Cluster
* * The difference between distributed systems and clusters
The nodes in the cluster are often physically close, and the nodes of the distributed system can be spread over the Internet. But in frameworks like Hadoop, the distinction between these two concepts is becoming blurred.

* * Cluster
The biggest bottleneck of the cluster, and how to solve the bottleneck?
Disk IO.

* * Distributed System
Distributed operating system: seemingly not well-known. Not a concept with a distributed system? My understanding is that distributed operating systems are a sub-concept of distributed systems.

Distributed File System
Also known as distributed database, such as HBase.

* 3. Data mining

* * Concept

Data mining is a technique applied to datasets (and also big data).

Data mining is one of the steps in Database Knowledge Discovery (KDD), and it is the product of rapid growth of huge amount of useful data. Data mining refers to the process of revealing hidden, previously unknown and potentially valuable information from a large amount of data in a database.

Data mining is an analytical step in the process of KDD (knowledge discovery, Knowledge Discovery in database). The patterns found may include: Grouping of data Records (group of), Uncommon Records, records of dependencies, etc.

The overall goal of data mining is to extract information from the data set and transform it into understandable structures. The information must be previously unknown.

The frontier of data mining: Big Data mining based on Hadoop

Relationship to data analysis
Data analysis is one thing, including data mining is a large class of data analysis methods. is to find the information you want from the data, analyze the characteristics of the data, do some processing of the data, and then look at some of the problems it shows.
Relationship to the database
data structure → database → DW → data mining →web data Mining

* * Relationship between data mining and other concepts such as machine learning \ Statistics

Data mining is very similar to machine learning concepts.

Discovering patterns (and knowledge) in a big data set uses cross-cutting parts of artificial intelligence, machine learning, and database systems.

The vast majority of data mining techniques come from the machine learning field, but machine learning research often does not use massive amounts of data as a processing object. Therefore, the data mining needs to transform the machine learning algorithm, which makes the algorithm performance and space occupancy reach a practical point.

For data mining, the database provides data management techniques, machine learning and statistics to provide data analysis techniques. Statistics tend to be obsessed with the beauty of the theory and neglect the actual utility, so many of the techniques provided by the statistical community are often more in the machine learning World
Research, become an effective machine learning algorithm before you can enter the field of data mining. Therefore, statistics mainly through machine learning to the impact of data mining, and machine learning and database is the data mining two major support technology.

* * Pre-treatment
Before you use the data mining algorithm, set the target data. Use the Data Warehouse frequently. The target collection is cleaned, data cleansing removes data that contains lost data, and observations that contain noise.

* * Data mining often contains six types of common tasks

: Cluster classification (mail and Junk mail)

Clustering (cluster analysis) can be formalized as a multi-objective optimization problem.
Data mining and machine learning tend to use the same algorithms, but often have different goals.

* * Association recommendations and related content recommendations,
The Association recommendation is what we often say about shopping basket analysis, that is, the use of the purchase of a product by the user at the same time to buy what the rules to find the potential link between the goods. Relevance recommendation is based on the recommendation of user behavior analysis, and the related content recommendation is based on the intrinsic characteristics of the content
Recommendation, only related to the content itself, and the user's behavior is completely unrelated, so the relevant content recommended model is a "cold start" algorithm, do not need any historical browsing access to data support.

Association recommendations can be implemented in two ways:
Correlation recommendation based on product Analysis and association recommendation based on user analysis.
The Association recommendation for product analysis refers to finding common ground among them by analyzing the characteristics of the product.

and based on user analysis of the recommendation is through the analysis of the user's historical behavior data, may find that many users buy xx also bought XXX, then can be based on this discovery to recommend, this method is Data Mining association rules mining, one of the most classic
The case is the story of Wal-Mart's beer and diapers.
(Bundled sales)

Association recommendations based on user behavior analysis:

The implementation principle of association rules is from all the user shopping data (if the amount of data is too large, you can choose a certain time interval, such as a year, a quarter, etc.), looking for when the user bought a commodity on the basis of the purchase of B goods accounted for the proportion of the number,
When this ratio reaches a predetermined target level, we think that there is a certain correlation between the two products, so when the user buys a product but has not purchased a B product, we can recommend the B product to such user.

Association rule mining generally adopts Apriori algorithm based on frequent sets.

* * Common algorithms
KNN algorithm and its application
KNN (K-nearest Neighbor
algorithm), k nearest neighbor algorithm, by calculating the distance or similarity between individual samples to find the most similar to each sample of K-individuals, KNN is generally used for classification algorithm is not in the given classification rules of the training set based on the overall sample classification.
The principle of the algorithm is as follows:
Based on the results of similarity between content, the principle of KNN to implement relevant content recommendations, as long as the comparison of each content with the similarity of all content points in descending order, and to take the first K content as the content of the most relevant content recommended to the user.
Distance and similarity measurement is the basic algorithm of KNN, because the individual similarity or proximity distance of KNN will be calculated by choosing some method in distance measure and similarity measure.

* * Distance and similarity measurements:

In data analysis and data mining, we often need to know the size of differences between individuals, and then evaluate the similarity and category of individuals. The most common is the analysis of data, classification and clustering algorithms in data mining, such as KNN and K-means.

Which methods can be used to measure the difference between two individuals, mainly divided into distance measurement and similarity measure.
Distance measurement: Used to measure the distance an individual has in space, the farther away it is, the greater the difference between individuals. Distance measurements can be divided into Euclidean distances (measuring the absolute distance between points in a multidimensional space), Markov distances, and so on.

Similarity measure:
That is to calculate the similarity between individuals, in contrast to distance measurement, the smaller the value of similarity measure, the greater the difference between individuals.

Cosine similarity degree.

* * S curve
The S-shaped curve is the most typical type of growth curve, and perhaps your site's user visits or sales are growing in a similar trend. After discovering this rule, we can analyze it by data statistic method, for this kind of regular linear curve,
The most common method is regression analysis? ：

* 4. Machine learning

Machine learning is a late term, it is to solve the prediction problem for the purpose of a series of modeling \ Optimization \ Design algorithm process.

* * Learn about machine learning from the example of mango buying:

From 36Kr: From buying mangoes to machine learning: building a mango sweet and sour rule:
if (color is bright yellow and size are big and sold by favorite vendor): Mango is sweet.
if (soft): Mango is juicy.
etc.

These rules are made by people and then handed to the computer program for automatic detection. But these rules can be given by the machine?

Machine learning Algorithms (classification, regression) are an improved version of the common algorithms that make your program "smarter" and can automatically learn from the data you provide.

Correlation model, according to the physical characteristics of mango: color \ size \ Shape \ origin, etc., you can predict which mango is sweet \ ripe \ Juicy.

* * The relationship between machine learning and other concepts
What is the relationship between machine learning and AI?

AI is the biggest, machine learning is a recent popular branch of AI.

Machine learning is usually analyzed by means of statistical analysis, and statistics plays a very important role in machine learning.
Neural networks are a very broad-based approach in machine learning, or a modeling approach.

Pattern recognition and data mining use machine learning models to solve, and of course also use statistical analysis models.

The relationship between machine learning and statistics
Statistical methods can be used for machine learning, such as clustering \ Bayesian, and so on. Of course there are many other ways of machine learning.

Machine learning is the research of algorithms, which is devoted to the study of algorithms that can be automatically improved based on experience. Neural network is a model, a mathematical model. Can be used for pattern recognition and machine learning.

In statistics and data science, machine learning is a computer science to do statistics. Statistical analysis: Linear regression

Machine learning and Pattern recognition

Pattern recognition is more in favor of applications, such as specific face \ Text \ Speech for targeted research, and machine learning is more emphasis on methods, not for a specific problem but for the whole process or a part of the process to do a general algorithm research

Machine learning emphasizes learning ability, machine learning ability under the guidance of algorithm, such as neural network (nonlinear regression), neural network is a very fire model in learning algorithm.

The distinction between data mining and machine learning is:
Data mining problems generally have huge data, especially the problem that the computational efficiency is more important than the statistic precision, usually stand in the commercial angle; machine learning is more inclined to the direction of artificial intelligence.
Data mining: Beer diapers to put a piece.
Li Ka-shing has $ more than 10 billion, the national average disposable income of 13279 yuan: it can be dug out: Li Ka-shing is rich.

Machine learning is a method that is commonly used for pattern recognition.

* 5. Cloud computing

Cloud computing can be divided into three tiers: IaaS, PAAs, and SaaS. The following is an example of IBM's SmartCloud for analysis.

* *--------------------------IBM SmartCloud---------------

Emerging businesses represented by cloud computing, Big data analytics, mobile, social, security (CAMSS)
The SmartCloud is composde by three parts:

Iaas

OpenStack Open Cloud OS (Distributed system?) Private cloud can also be provided by cloud vendors?

Paas

BlueMix makes it quick and easy for organizations and developers to create, deploy, and manage applications (web, mobile, big data, new smart devices, etc.) on the cloud.
And Bluemix is a key development target for developers and entrepreneurs, which is rare in IBM's history of cloud computing, a major shift in IBM's transformation.

"Bluemix abroad is on SoftLayer and will be on OpenStack in the country," Hu Shizhong explains, meaning that bluemix can land on a domestic platform that supports OpenStack.

Saas

Compose, a company this offers MongoDB, Redis, and other database as a service
(DBaaS). ---the copyright issue??

High Availability (HA), failover,
Capability vs Avaliablity

* 6. Big Data:

Big data refers to techniques for discovering huge potential value in a big data set.
Big data refers to the large amount of data, not the general large, massive. Datasets are too large or too complex to be used by traditional data processing. Used to refer to some predictive analysis used to explore the potential value of data.
relational database management systems and desktop statistics tools and visualization packages often have problems dealing with big data. Requiring a lot of parallel software running on thousands of servers requires new processing to support decision-making value discovery and process optimization

The big data concept was first presented in 2001, and Gartner defined Big data as: high volume (amount of the data), and out
(and/or high variety (range of data tyeps and sources)

Big Data represents a collection of information (with 3v features, the need for specific techniques and analytical methods to convert it to value)

* * History
Originally, the HPCC and Quantcast file System were the only publicly available platforms that could handle exabytes data (Distributed File sharing framework (for data storage and querying)) Google launched the MapReduce framework based on HPCC:
The MapReduce framework provides a parallel processing model for processing massive amounts of data, as well as supporting implementations. In the MapReduce framework, the query is split and divided into different parallel nodes and executed in parallel (MAP),
The query results are collected and distributed (Reduce). Hadoop is an open source implementation of the MapReduce framework. (Hadoop mapReduce.) Google's mapreduce on Bigmap, Hadoop mapreduce on HBase)

* * Relationship to HPC cloud computing
Big Data is the translation of bigdata, in fact, data mining, data analysis of massive data. For previous research academia, the use of HPC technology, the industry is now inclined to use "cloud computing."

* * A case of specific problems
5 Describe the main process of big data mining with your most familiar big data mining tool

* 7. Hodoop

When you install Hadoop, you need to repeat the steps of deploying Hadoop on multiple nodes.

* * Query analysis Tools on top of Hadoop
Hive
Hive is a Hadoop-based data warehousing tool that provides a way to query a Hadoop database. Hive's query Language, HQL, is very similar to SQL queries.

Hive problem
A.txt B.txt 1 million rows (ip,username) such as 127.0.0.1 Zhangsan
Solve the problem: the number of IP addresses (a.txt) in two texts, and the number of individual user IPs in the
Describe key processes and write critical code

CREATE TABLE A from a.txt
CREATE TABLE b from B.txt

SELECT DISTINCT IP FORM A

Pig
The pig learning curve is steep, unlike traditional SQL queries. Pig is also a query tool similar to hive.

Impala (Impala)
Cloudera Company, similar to hive.

Mahout (Library of machine learning, data mining)

* * HDFS
HDFs architecture, and the read and write process of HDFs

One NameNode and multiple DataNode

* * YARN
Used to replace MapReduce in older versions of Hadoop.
Architecture Arch

RM (ResourceManager) AM (Applicationmanager) NM (NodeManager)

Typically there is only one RM node, and there are multiple NM nodes.

* 8. Spark
* * The difference between spark and MapReduce
Spark is, on the other level, more like Hodoop, a big data platform or framework.
MapReduce is part of the Hadoop framework and is a computing framework.

MapReduce is a computational model of Hodoop. The computational model in Spark is a dag (directed acyclic graph)

Hadoop is more suitable for batch processing, and spark is better suited for machine learning that requires iterative iterations

Spark can run either in memory or on a hard disk.

Spark is much faster than Hadoop, officially claiming 100 times times faster (10 times times faster if it's all running on disk)

Spark provides: Spark SQL MLlib

Hadoop and Spark are both Big Data frameworks
–they provide some of the most popular tools used to carry out common Big data-related tasks.

Hey is not mutually exclusive, as they is able to work together.

Spark does not provide its own distributed storage system. For this reason
Many Big Data projects involve installing spark on top of Hadoop, where Spark's advanced
Analytics applications can make use of data stored using the Hadoop distributed File System (HDFS).

My knowledge and understanding of big data-related technologies

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More