New areas for developing SQL Server Hadoop Large data

Source: Internet
Author: User
Keywords Can can run can run Microsoft can run Microsoft big Data can run Microsoft large data DFS

In the context of large data, Microsoft does not seem to advertise their large data products or solutions in a high-profile way, as other database vendors do. And in dealing with big data challenges, some internet giants are on the front, like Google and Yahoo, which handle the amount of data per day, a large chunk of which is a document based index file. Of course, it is inaccurate to define large data so that it is not limited to indexes, e-mail messages, documents, Web server logs, social networking information, and all other unstructured databases in an enterprise are part of large data.

To address the challenges of these data, such as Autodesk, IBM, Facebook and, of course, Google and Yahoo, have deployed an open source platform for Apache Hadoop, without exception. Microsoft also noted this trend, so added Hadoop connectors to their database platforms. This connector enables businesses to move freely between Hadoop clusters and SQL Server 2008 R2, parallel data warehouses, and the latest SQL Server (Denali). Because connectors allow data to move in both directions, users can take advantage of the powerful storage and data processing capabilities provided by SQL Server, as well as using Hadoop to manage a large number of unstructured datasets.

But traditional Microsoft users are unfamiliar with SQL Server Hadoop connectors and are not used to it. This connector is a command-line tool deployed in a Linux environment, and in this article we'll give you a concrete explanation of how the SQL Server Hadoop Connector works.

Apache hadoop cluster

Hadoop is a master-from architecture, deployed in a cluster of Linux hosts. To handle massive amounts of data, the Hadoop environment must contain components:

-master node management from nodes, primarily for processing, managing, and accessing data files. When an external application sends a job request to the Hadoop environment, the master node also acts as the primary access point.

-The named node runs the Namenode daemon, manages the namespace of the Hadoop Distributed File System (HDFS), and Controls access to the data files. The node supports the following actions, such as opening, closing, renaming, and defining how to map a block of data. In a small environment, named nodes can be deployed on the same server as the primary node.

-Each run Datanode daemon from the node, managing the storage of the data file and processing the file's read and write requests. From the node is composed of standard hardware, the hardware is relatively inexpensive, ready toWith。 You can run parallel operations on thousands of machines.

The following illustration shows the interrelationships of individual components in a Hadoop environment. Note that the main node runs the Jobtracker program, each running the Tasktracker program from the node. Jobtracker is used to process requests for client applications and assigns them to different tasktracker instances. When it receives instructions from Jobtracker, Tasktracker runs the assigned task along with the Datanode program and processes the data movement in each operation phase.

you have to deploy the SQL Server Hadoop Connector within the Hadoop cluster

MapReduce Framework

as shown in the previous illustration, the master node supports the MapReduce framework, a technology that relies on the Hadoop environment. In fact, you can think of Hadoop as a mapreduce framework in which Jobtracker and Tasktracker play a key role. The

MapReduce breaks large datasets into small, manageable chunks of data and distributes them across thousands of hosts. It also includes a set of mechanisms that can be used to run a large number of parallel operations, search for PB-level data, manage complex client requests, and analyze the data in depth. In addition, MapReduce provides load balancing and fault tolerance to ensure that operations can be done quickly and accurately. The

MapReduce and HDFs schemas are tightly knit, which stores each file as a sequence of data blocks. Data blocks are replicated across clusters, and other blocks of data in the file are the same size except for the last block of data. Each Datanode program from the node creates, deletes, and replicates data blocks with the HDFs. However, a HDFs file can only be written once. The

SQL Server Hadoop connector

User needs to deploy the SQL Server Hadoop connector to the master node of the Hadoop cluster. The master node also needs to install Sqoop and Microsoft's Java database connection driver. Sqoop is an open source command-line tool for importing data from a relational database, using the Hadoop mapreduce framework for Data transformation, and then returning the data back to the database.

When the SQL Server Hadoop connector is deployed, you can use Sqoop to import the exported SQL Server data. Note that the Sqoop and connectors operate in a centralized view of a hadoop, which means that when you import data using Sqoop, you retrieve the data from the SQL Server database and add it to the Hadoop environment, and conversely, Exporting data refers to retrieving data from Hadoop and sending it to a SQL Server database. The

Sqoop import exported data supports some storage types:

-text files: underlying text files, separated by commas,

-sequence files: binaries, including serialized record data;

- Hive table: Hive The tables in the Data Warehouse, a special Data Warehouse architecture for Hadoop.

  In general, SQL Server and Hadoop environments (MapReduce and HDFs) enable users to process large amounts of unstructured data and consolidate this data into a structured environment for report making and BI analysis.

Microsoft's Big Data strategy is just beginning

SQL Server Hadoop Connector is an important step on the way to Microsoft's big data. But at the same time, because Hadoop, Linux and Sqoop are open source technologies, this means that Microsoft is open to the open source world on a large scale. In fact, Microsoft's plan is not only so, at the end of this year, they will launch a similar to the Hadoop solution, and in the form of services running on the Windows Azure Cloud platform.

next year, Microsoft plans to launch similar services for Windows Server platforms. There is no denying the fact that SQL Server Hadoop connectors are significant to Microsoft and that users can handle large data challenges in a SQL Server environment and that they will bring us more surprises in the future.

(editor: Heritage)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.