Big Data The first day of the 1. Hadoop Ecosystem 1.1 Hadoop v1.0 architecture MapReduce (for data calculation) HDFS (for data storage) 1.2 Hadoop v2.0 Architecture MapReduce (for data calculation, HADOOP provides compute framework) other non-Hadoop computing Frame yarn (user management and allocation of cluster resources, including hardware and software resources) HDFS (for storing data) 1.3 Hive (MR-based data warehouse) is similar to SQL and is typically used for offline data processing (using MapReduce) as Hql-> Mr's language translator uses: for log, multidimensional data analysis 1.4 pig (MR-based data analysis tool) defines a data flow language Pig Latin uses: for data offline analysis 1.5 Mahout (Data Mining Library) provides a variety of mathematical statistics algorithm Library 1.6 HBase ( Distributed database) consists of table, column family, row key, timestamp 1.7 Zookeeper (distributed Collaboration Service) uses: Unified naming, State synchronization, cluster management, configuration synchronization, etc. 1.8 Sqoop (data Synchronization Tool) Purpose: For data transfer between Hadoop and traditional databases 1.9 Flume (log Collection Tool) Purpose: Collect log file composition: Agent, collector1.10 Oozie (Job flow scheduling tool) use: to the different framework and operation of unified management and scheduling, improve resource utilization; job status monitoring and alerting 1.11 Version Evolution Apache version ———— original Eco recommended 2.X.XCDH version ———— packaged release recommended Cdh5.0.02.hdfs ———— Hadoop Distributed File System 2.1 Advantages 1. High-fault-tolerant data automatically saved when multiple copies of the replica are lost, the automatic recovery is 2. Suitable for batch mobile computing instead of data data location exposed to the calculation Framework 3. Suitable for large data processing 4. Streaming file access One-time write, multiple reads ensure data consistency 5. Can be built on a cheap machine provides fault tolerance and recovery mechanism through multiple .2 Disadvantage 1. Low Latency data Access 2. Small file access 3. Concurrent write, File random modification 2.3 schema 1. NameNode ———— Storage Metadata Active NameNode: Activate NameNode, Master Master (only one), manage HDFs namespaces, block mapping information, configure replica policies, etc. standby NameNode: Hot Spare NameNode , when the active NameNode fails, quickly switches to the new active namenode2.secondary NameNode ———— backup Namenode3.datanOde ———— Store the file data slave (can have multiple), store the actual data block and execute the data block read/ Write 4. Block A. The file is cut into a fixed-size block of data with a default size of 64MB, configurable, and if a file is less than 64MB in size, it is stored separately as a block B. In general, there are three copies of each data block 2.4 Read and write principle 1. read the file 1. The client sends a read file request to Namenode (via distributed FileSystem) 2.NameNode to find out if the file exists. Returns the query-related results to the client 3. If the file exists, the data block store information is returned to the client, and the client reads the file from the data block (via Fsdata InputStream) 4. The client closes the Read data connection (via Fsdata inputstream) 2. Write file 1. The client sends a write file request to Namenode (via distributed FileSystem) 2.NameNode to find out if the file exists. Returns the query-related results to the client 3. If the file does not exist, the client sends the file size to Namenode and opens up the relevant data block (via Fsdata OutputStream) 4. After the first Datanode receives the first packet (note that it is not a block of data, the packets in the network ), pass the packet to the next Datanode node, and start receiving the next packet 5. Fourth and so on, after receiving, the information is returned to the client 6. The client closes the Write data connection (via Fsdata outputstream) and returns the completion information to Namenode ( Via distributed FileSystem) 2.5 access mode 1. Shell command 2.Java API3. Rest API4. Other 2.6 installation process 1. Hardware preparation 2. Software preparation (recommended CDH) 3. Distribute the Hadoop installation package under each node 4. Install JDK5. Modify/etc/ Hosts configuration file 6. Set SSH password-free login 7. Modify the configuration file (use the SCP command to distribute to each node after modification) 8. Start the service 9. Validate 2.7 Common shell commands dfs command Namenode- Format Command dfsadmin command fsck Command balancer command
Big Data First day