E-commerce Big Data Learning notes: Combat

Last Update:2016-05-02 Source: Internet

Author: User

Tags create index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, YARN: The resource management and job scheduling/monitoring into two separate processes.

Consists of two components: ResourceManager and Applicationmaster

2, yarn characteristics:

1) scalability, 2) high Availability (HA), 3) compatibility (1.0 version of the job can also be performed); 4) Improve cluster utilization;

5) supports the MapReduce programming paradigm.

3. The process of Hadoop:

1) NameNode HDFs daemon process;

2) Secondary NameNode Monitor the HDFs status of the auxiliary backstage program, standby NameNode;

3) DataNode is responsible for the HDFS data block written to the local file system, the data block size default 64MB;

4) ResourceManager is a central service, it is responsible for dispatching, starting each job and resource allocation;

5) NodeManager manages each node of yarn cluster, it is responsible for the maintenance of container state (CPU, memory, hard disk, network), and keep heartbeat to ResourceManager;

6) Applicationmaster is responsible for all work within a job lifecycle.

4, HDFs common commands (slightly)

5, Hadoop Common configuration parameters detailed (slightly)

6. Hive's three main interfaces: command line CLI, client clients and Web interface Wui

1) The most commonly used CLI, starting with a hive service at the same time, put the written script into the CLI to execute.

2) Clinet is the client of hive and the user is connected to Hiveserver.

3) Wui is a web tool that accesses hive through a browser.

7. Hive metadata is typically stored in a database, such as MySQL (multi-user) or Derby embedded database (single user).

8. Hive data is stored in HDFS (including external tables and internal tables), and most queries are done by MapReduce.

9. Hive's common processes and services: Use the Hive–service help command to see the services provided by hive.

CLI: command-line interface.

Hiveserver: Client interface.

The Hwi:hive Web interface.

Jar: The hive interface equivalent to the Hadoop jar.

Metastore: The services provided by the meta data.

10, Metastore Three kinds of connection: Single User (Derby), multi-user (MySQL) and remote connection (such as using thrift)

11, the Hive language, does not support insert and update, because the content of the Data warehouse is read and write less, all the data to be determined at the time of loading, his data are stored in HDFs.

12. Hive does not have a specially defined data format, the format is specified by the user, and the user needs to specify three attributes in defining the data format.

1) column delimiter (usually with a space, "\ T")

2) line delimiter ("\ n")

3) How to read the file data (there are three default: Textfile, Sequencefile, Rcfile)

13. Hive does not do any processing of the data, nor does it scan the data, so it does not index.

14. The hive query is implemented through Hadoop, not through its own execution engine.

15, hive execution delay is high, usually offline execution, but the amount of data processed is large.

16, hive Extensibility is very good, can be extended to thousands of Hadoop.

17, Hive--database temp directly into the temp database.

18, SET-V/reset set or reset parameter variables.

such as: Set mapred.reduce.tasks=10;

19,! Executes the external shell command. Such as:! LS--Lists the files in the current directory.

20. DFS executes the HDFS command.

such as: Dfs-mkdir/user/hadoop/warehouse;

21. The Add file/list file/delete file manages the step buffer resource, which can be used on all machines.

22, hive-s silence does not output.

23. Insert data: Insert Overwrite table huserinfo SELECT * FROM Old_userinfo--import data from another table

24, Hive-e ' Set; ' | grep mapred.reduce.tasks; -E indicates that the following commands are executed directly externally.

25, hive-d sitename=www.baidu.com//define a variable

26. Cross-Library query: SELECT * from Hive. Huserinfo;

27. If there is an external SQL script Get_order_sum.sql

Hive-f/home/hadoopuser/scripts/get_order_sum.sql; You can run this script directly

Hive-v-f/home/hadoopuser/scripts/get_order_sum.sql; You can show the SQL in the script as well.

Hive-s-f/home/hadoopuser/scripts/get_order_sum.sql > Test.txt; Silently executes and outputs the results to TXT

28, Hive Common configuration parameters (slightly) in the file. HIVERC Configuration

29. Hive Clear data, hive does not support delete

Use TRUNCATE TABLE tablename;

Clear a partition of data: TRUNCATE TABLE tablename partition dt= ' 2015-10-6 ';

30, deleted the data to the Recycle Bin:. Trash

31, create an index, like SQL.

CREATE INDEX index_name on TABLE tablename (col_name) as Index_type;

32, role management, the right to control the best use of roles to control. Slightly

33. DESC command to view some basic information about the database or table.

34, hive Four ways to import data:

1) Load data local inpath ' 1234.txt ' into table mytable;

2) Load Data inpath '/home/hadoopuser/1234.txt ' into table mytable;

3) Insert Overwrite table mytable

> PARTITION (age)

> select ID, Name, tel, age

> from oldtable;

4) CREATE TABLE MyTable

> As

> select ID, Name, tel

> from oldtable;

35. How do I create a dynamic partition when I create a hive table?

36, Hive data query support regular, when you do not know the name of the column can be used.

SELECT ' regular ' from mytable;

39. Sorting Data

ORDER BY Global Ordering

Sort by only in one reducer

Distribyte by assigns the specified content to the same reducer

Cluster by = Distribyte by + Sort by

40, the practical wording

SELECT * FROM table1 t1,table2 t2,table3 T3 WHERE t1.id=t2.id and t2.id=t3.id;

41, Semi-join than the general inner join more efficient.

42. CTE

With Q1 as (SELECT * from SRC WHERE key>50)

You can use the Q1 temporary table directly below.

43. What is UDF? Udaf? UDTF?

User Defined function (custom functions)

44. What is an analytic function?

45. Three formats for hive storage:

Textfile: No compression, large disk overhead;

Sequencefile: Easy to use, divisible, compressible, concurrent reading, but occupies a larger space;

Rcfile: Compressed, but is expensive to load.

46, the HIVESQL join implementation process:

47. Hive's execution life cycle

48. Example: Order Port module

49, Hivesql actual combat

Create the following script as a. sh file and call it directly.

50, Hivesql actual combat 2

E-commerce Big Data Learning notes: Combat

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More