As we all know, the big data wave is gradually sweeping all corners of the globe. And Hadoop is the source of the Storm's power. There's been a lot of talk about Hadoop, and the interest in using Hadoop to handle large datasets seems to be growing.
Today, Microsoft has put Hadoop at the heart of its big data strategy. The reason for Microsoft's move is to fancy the potential of Hadoop, which has become the standard for distributed data processing in large data areas. By integrating Hadoop technology, Microsoft allows customers to access fast-growing hadoop ecosystems. And as more and more talented people who are adept at developing the Hadoop platform gush, this is very beneficial for Hadoop development.
main challenge data explosion leads to insight decline: enterprises need to use the appropriate tools to understand the reduced hardware costs and complex data sources brought about by the massive data, clear insight into the nature behind the data. Mixed structure and unstructured data: Enterprises need to analyze both relational and non relational data at the same time, and more than 85% of the captured data are unstructured data. There are bottlenecks in real-time data analysis: New data sources (social media sites such as Twitter, Facebook, and LinkedIn) generate massive amounts of data in real time, This data cannot be effectively analyzed through simple batch processing to simplify deployment and management: The enterprise needs a simpler and more streamlined deployment and setup experience, and ideally, businesses want to use a small number of installation files that include Hadoop-related projects, rather than choosing from a project. What is Hadoop
Hadoop is a software framework for distributed, dense data processing and data analysis based on HDFS (Hadoop Distributed File System). Hadoop is largely inspired by the MapReduce technology that Google elaborated in the 2004 white Paper. MapReduce works by breaking tasks into small tasks that are hundreds of thousands of pieces, and then sending them to a cluster of computers. Each computer transmits its own part of the information, and MapReduce quickly integrates the feedback and forms the answer.
MapReduce as the core of Hadoop is a processing of large and oversized datasets (terabytes of data). Including streaming data generated by network clicks, log files, social networks, etc., and generate the relevant programming model for execution. The main idea is to draw lessons from the functional programming language, and it also includes the characteristics from the vector programming language.
Windows Azure Hadoop Combat
The following is a detailed description of Hadoop based on Microsoft Azure. If you also want to try, please visit hadoopazure.com to get an invitation. Microsoft uses Apache Hadoop (0.20.203) on Azure
set up a cluster
Once the user is invited to participate in the quiz, you can set up your Hadoop cluster. Access hadoopazure and log on using the Windows Live ID. After logging in, you need to fill in the relevant information:
cluster (DNS) Name: Format < >cloudapp.net cluster Size: Select the number of nodes (from 4 nodes to 32 nodes) and the associated storage settings (capacity from 2TB-16TB) administrator user and password: Set your own username and password, Once set edges can be connected through Remote Desktop or Excel. SQL Azure Instance configuration information: Configurable hive metastore option. If you configure this option to have Web page access to SQL Azure server instances, you will need to provide the target database name and logon credentials. The specified login name must have the following permissions on the target database: Ddl_ddladmin,ddl_datawriter,ddl_datareader.
Click the request cluster when all information is complete. The cluster is then assigned and created for the user. (approximately 5-30 minutes) for cluster allocation, users will see a number of task nodes and a header node called Namenode.
After the cluster is established, users can try to click on the Metro-style icon, and users can see what types of data processing and management tasks can be performed. In addition, you can try the Web to exchange data with the cluster, and you need to first open the port (by default) when using FTP and ODBC mode.
On the Cluster Management page, users can perform basic administrative tasks, such as configuring cluster access, importing data, and managing the cluster through an interactive console. The interactive console supports JavaScript or hive. When users access the Tasks section, users can run MapReduce jobs and see the status of running and those recently completed MapReduce tasks.
Connection Data
Users have multiple ways to upload or access data from a Hadoop cluster located on Windows Azure. This includes uploading directly to the cluster or accessing data stored in other locations.
Although FTP allows files of any size in theory to be uploaded, it is best to control the file size at GB level. In addition, if a user wants a batch job to store data outside of Hadoop, you first need to perform several configurations: set up an external connection, click Cluster Management on the home page, and configure which storage location to use. such as the results of a query that stores Windows Azure Data harsh in a Windows Azure blob or storage on Amazon Web Services (AWS) S3.
to configure the AWS S3 connection you need to enter a security key (public or private), and then access the data in the Hadoop cluster on top of S3 to use the data in Windows Azure data harsh to fill in the username (wlid), Passkey (the data source you want to query or import), the query source, and the hive table name. Before you enter a query in the cluster, be sure to remove the data harsh query limit (100 lines) generated by the tool. To access data stored in Windows Azure BLOB storage, you need to enter the store account name URL to the BLOB storage location, and the user's private passkey value. Run MapReduce Job
You can run one or more mapreduce jobs after you establish and validate the Hadoop cluster and confirm that the data is available.
If the user is not familiar with Hadoop, view the Samples button on the running main page to better understand the entire process. If users are familiar with Hadoop and want to run mapreduce jobs, there are several options. The way users choose depends on how familiar they are with the Hadoop tools, such as the Hadoop command prompt and the language. Users can use Java, Pig, JavaScript, or C # to perform Windows Azure Hadoop mapreduce tasks.
Click Samples, and then open the WordCount configuration page (Figure 2), which includes features and data sources. The source data requires not only the input file name and file path. When a file path is selected locally, it means that the file is stored in the Windows Azure Hadoop cluster, and the source data can also choose AWS S3, Windows Azure Blob Storage, Windows Azure data Harsh or directly retrieved from HDFs.
After configuring the parameters, click Execute. Read the detailed instructions when you run the task samples, some can create tasks from the main page, and some require an RDP connection cluster.
You can monitor the status of a task after it has been performed and when the task completes. You can view the details of a task by managing the account on the Task History page. Includes script, working status, date, time information, etc.
use C # to handle Hadoop tasks
Users can also handle Windows Azure Hadoop mapreduce tasks through C # streaming. The main page has a related example. Users need to upload the required files (davinci.txt, Cat.exe and Wc.exe) to storage locations, such as HDFs, ASV, or Amazon S3. Also requires the IP address of the Hadoop headnode. Then run the following command
js> #cat apps/dist/conf/core-site.xml
Fill in task Run parameters
Hadoop jar Hadoop-examples-0.20.203.1-snapshot.jar-files "hdfs:///example/apps/wc.exe,hdfs:///example/apps/ Cat.exe "-input"/example/data/davinci.txt "-output"/example/data/streamingoutput/wc.txt "-mapper" Cat.exe "-reducer" "Wc.exe"
In the example mapper and reduce read the executable file from stdin and output it to stdout. The resulting Map/reduce task is presented to the cluster to execute and map the file. As shown in Figure 3
The workflow is as follows: First, the Mapper file startup process is added to the Mapper task initialization (if there are multiple mapper, each task will introduce a separate initialization process), The Mapper task converts code and additional code as part of the standard input for mapreduce tasks, and then mapper collects from standard output and converts code to key/value pair. In other words, when the reduce task receives the output from each mapper, it sorts the input data according to the keys in the key-value pair and categorizes the same keys. The reduce () function is then called, and the value associated with the key is processed iteratively, and then a list (possibly empty) is generated.
use HIVEQL query hive Table
Using an interactive Web console, users can query the hive table located on a user-defined Hadoop cluster. However, you need to create a hive Table before querying. The following command is required to create and validate the hive Table when using WordCount mapreduce
hive> LOAD DATA inpath ' hdfs://lynnlangit.cloudapp.net:9000/user/lynnlangit/davincitop10words.txt ' OVERWRITE into TABLE wordcounttable; Hive> Show tables; hive> describe wordcounttable:hive> select * from Wordcounttable;
The hive syntax is similar to SQL syntax, and HIVEQL provides similar queries. By default, Hadoop is case-sensitive.
other ways to connect clusters
The cluster can also be connected through the RDP protocol in the home page. and connect to the cluster's Namenode node server via Remote Desktop. To connect users via RDP, you need to click the "Remote Desktop" button on the admin homepage, then click the download RDP connection file, and then enter the username and password, and if prompted, open the firewall port on the client machine. Once a connection is established, users can manage the cluster using Windows Explorer as if they were managing the local machine.
For example, the example used by this article, the Namenode node server deploys two processors and 14GB of memory, and deploys the Windows Server 2008 R2 Enterprise Edition (Integrated SP1), with Apache Hadoo 0.20.203.1. This cluster includes namenode and some workernode, with a total of 8 processors deployed.
Examples include standard Hadoop tools such as Hadoop Command shell or command-line interface, and Hadoop MapReduce Tracker (Http://[namenode) : 50030) and Hadoop namenode HDFS (http://[namenode]:50070). Use the Hadoop Command Shell to run mapreduce tasks or to perform other administrative tasks using the RDP protocol.
Sqoop provides a bridge between Hadoop and SQL Server (SQL Server 2008 R2 or higher SQL Server parallel database) data transfer. The JDBC driver needs to be downloaded and installed on the Sqoop node. Sqoop is based on the SQL Server Connector design that allows users to transfer data in Hadoop (linux-based) and SQL Server. Sqoop also supports importing exported data between SQL Azure and HDFs via FTP.
With hive ODBC driver (port number: 10000), allow any Windows application to access and query the hive Data warehouse. It also allows Excel to access hive, which moves data directly from hive to Excel and PowerPivot.
Windows Azure Hadoop Advantage
Interesting advantages of Windows Azure Hadoop
Easy to install and use Metro Style management interface
In MapReduce jobs or data queries, the choice of language is diverse, mapreduce jobs support Java, C #, Pig, JavaScript, while queries can be tried hive (HIVEQL).
can use existing skills (Hadoop technology) to match Apache Hadoop version 0.203
There are a variety of connectivity options, including ODBC-driven (SQL Server/excel) RDP and other clients. and the ability to connect to other cloud storage (Windows Azure Blobs, Windows Azure Data Harsh, Amazon Web Services S3)
When Windows Azure Hadoop is officially released, many unknown features are released
Currently, it is only a private beta, with relatively few features.
Pricing hasn't been announced yet
During the test period, there are restrictions on the size of the uploaded file, it is not clear that the official version of the release of the features