First knowledge of Hadoop
Preface
I had always wanted to learn big data technology in school, including Hadoop and machine learning, but ultimately it was because I was too lazy to stick with it for a long time, plus I was prepared for the offer, so the focus was on C + + (although C + + didn't learn much), Plan to have a spare time in the big three to learn slowly. Now internship, need this knowledge, this for me, except the school recruit when delivery C + + post has a little influence, undoubtedly there are many advantages.
So, for a long time to come, I had to spend a lot of my life outside of C + + on big data.
So first of all, let's get to know this Hadoop tool that handles big data.
Introduction to Big Data
Big Data )refers to a collection of data that cannot be captured, managed, and processed by conventional software tools within a tolerable time frame.
4v features: volume velocity (high speed) variety value (value) .
The value of big data is reflected in the following aspects:1) companies that provide products or services to a large number of consumers can use big data for precise marketing ; 2) small and beautiful model of the middle-long tail enterprises can use big data to do service transformation ; 3) traditional businesses that have to transform under the pressure of the internet need to capitalize on the value of big data with the times.
What is Hadoop?
Hadoop is a distributed system infrastructure developed by the Apache Foundation . Users can develop distributed programs without knowing the underlying details of the distribution, making full use of the power of the cluster for high-speed operations and storage.
The core design of the Hadoop framework is:HDFS and MapReduce. HDFS provides storage for massive amounts of data, and MapReduce provides processing and computation for massive amounts of data .
The core architecture of Hadoop
Hadoopconsist of many elements. The bottom of theHadoop distributed File System(HDFS), which storesHadoopfiles on all storage nodes in the cluster. HDFSthe previous layer isMapreduceengine, the engine is made up ofjobtrackersand thetasktrackerscomposition. Through theHadoopthe core Distributed file system of distributed computing platformHDFS,MapReduceProcess , and data warehousing toolsHiveand distributed DatabasesHbasethe introduction, basically covers theHadoopall the technical cores of the distributed platform.
Hdfs
For external clients, HDFs is like a traditional hierarchical file system. Its primary purpose is to support the flow of large files (PB level ) written in the form of access. You can create, delete, move, or rename files, and so on. The files stored in HDFS are divided into chunks, the size of the block (typically 64MB), and the number of copied blocks that are determined by the client when the file is created. but the architecture of HDFS is built on a specific set of nodes, which is determined by its own characteristics. These nodes consist of Namenode(only one) and Datanode.
Namenode inHDFSprovide metadata services internally, which manages the file system namespace and Controls access for external clients.NameNodedetermines whether to map files toDataNodeOn the copy block. NameNodein a callFsimageFiles stored in the file system.The namespace information. This file and a record file containing all the transactions (here isEditlog) is stored in theNameNodeon the local file system. Fsimageand theEditlogThe file also requires a copy to prevent file corruption orNameNodesystem is missing.
DataNodeto beHDFSProvide storage blocks, usually organized in a rack, the rack connects all systems through a switch. DataNoderesponse fromHDFSread-write requests from the client. They also respond to responses fromNameNodecommands for creating, deleting, and copying blocks. NameNoderely on eachDataNode's regular Heartbeat (Heartbeat) message. Each message contains a block report,NameNodeblock mappings and other file system metadata can be validated against this report. IfDataNodeheartbeat messages cannot be sent,NameNodeA fix will be taken to re-copy the missing block on that node.
Mapreduce
The simplest MapReduce application contains at least 3 parts: A Map function, a Reduce function, and a main The/c7> function. the main function combines job control and file input / output. At this point,Hadoop provides a number of interfaces and abstract classes, providing a number of tools for Hadoop application developers to debug and performance metrics.
Mapreduceitself is used toSoftware framework for parallel processing of large data sets.MapReducethe root cause is the functionType programming.Mapand theReducefunction. It can contain many instances by two (manyMapand theReduce) is composed of operations. Mapfunction takes a set of data and converts it to a key/value pairs(key-value) list, one key for each element in the input field/value pairs. Reducefunction AcceptMapfunction to generate the list and then shrink the key according to their key/List of value pairs(that is, keys with the same key/value pairs are merged together to form a list)。
One that represents the client on a single primary system.MapReduceApplications are calledJobtracker. Similar toNameNode, it isHadoopthe only control in the cluster isMapReduceapplication's system. After the application has been submitted, it will provide theHDFSthe input and output directories in the. JobtrackerUse File block information (physical amount and location) to determine how to create additionalTasktrackerdependent tasks. MapReduceThe application is copied to each node that appears in the input file block. A unique subordinate task is created for each file block on a specific node. EachTasktrackerreport Status and completion information toJobtracker.
What are the advantages of Hadoop?
Hadoop is a software framework that enables distributed processing of large amounts of data. Hadoop data processing in a reliable, efficient and scalable manner.
Hadoop is reliable because it assumes that compute elements and storage will fail, so it maintains multiple copies of working data, ensuring that it can redistribute processing against failed nodes .
Hadoop is efficient because it works in parallel and speeds up processing by parallel processing.
Hadoop is also scalable and can handle petabytes of data.
Summarized as follows:
High reliability. Hadoop the ability to store and process data on a bitwise basis is worth trusting.
High scalability. Hadoop is to allocate data between the available computer sets and complete the calculation tasks, and these clusters can be easily extended to thousands of nodes.
Efficiency. Hadoop The ability to dynamically move data between nodes, and to ensure the dynamic balance of each node , processing speed is very fast.
High level of fault tolerance. Hadoop ability to automatically save multiple copies of data and automatically reassign failed tasks.
Low cost. Hadoop is open source compared to data marts suchas all-in-one machines, commercial data warehouses, and QlikView,yonghong z-suite The software cost of the project will therefore be greatly reduced.
Savor big Data--start with Hadoop