Big Data Engineering Personnel knowledge map

Source: Internet
Author: User
Tags hadoop ecosystem
What do you need to know about big data-related work in an enterprise? I think we need to look at two aspects: technology and business. In terms of technology, it mainly involves probability and mathematical statistics, computer systems, algorithms, and programming. The business perspective is different from the company's business. For big data engineers

What do you need to know about big data-related work in an enterprise? I think we need to look at two aspects: technology and business. In terms of technology, it mainly involves probability and mathematical statistics, computer systems, algorithms, and programming. The business perspective is different from the company's business. For big data engineers

What do you need to know about big data-related work in an enterprise? I think we need to look at two aspects: technology and business. In terms of technology, it mainly involves probability and mathematical statistics, computer systems, algorithms, and programming. The business perspective is different from the company's business. For big data engineers, they need to learn to use data mining methods to solve practical problems with the help of computer systems and programming tools, in this way, we can mine massive data to boost business growth, and create more value for enterprises in the fierce market competition.

Because the business varies with the company, but the technical points are figured out. Here I briefly summarize the technical knowledge that big data engineers need to master. It mainly involves basic knowledge points related to databases, data warehouses, programming, distributed systems, Hadoop ecosystems, data mining, and machine learning. Of course, what I listed here should be the combination of the personnel of a team, and each person will have a different role in the team. You are welcome to make comments here.

Topic Content Key points Reference
DB/OLTP & DW/OLAP Database/OLTP basic The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems.
Database internal & implementation Architecture, memory management, storage/B + tree, query parse/optimization/execution, hash join/sort-merge join
Distributed and parallel database Sharding, database proxy
Data warehouse/OLAP Materialized views, ETL, column-oriented storage, reporting, BI tools
Basic programming Programming language Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data.
OS Linux
DB & DW system MySQL/Hive/Impala
Text format and process JSON/XML, regex
Tool Git/SVN, Maven
Distributed system & Hadoop ecosystem & NoSQL Distributed system principal theory CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog)
Distributed storage & computing framework & resource management Hadoop/HDFS/MapReduce/YARN Tom White. Hadoop: The Definitive Guide.

Donald Miner, Adam Shook. MapReduce Design Patterns: Building into tive Algorithm and Analytics for Hadoop and Other Systems.

SQL on Hadoop Data (log) acquisition/integration/fusion, normalization, feature extraction Sqoop, Flume/Scribe/Chukwa, SerDe Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive.
Query & In-database analytics Hive, Impala, UDF/UDAF
Large scale data mining & machine learning framework Spark/MLbase, MR/Mahout
Streaming process Storm
NoSQL HBase/Cassandra (column oriented database) Lars George. HBase: The Definitive Guide.
Mongodb (Document database)
Neo4j (graph database)
Redis (cache)
Data mining & Machine learning DM & ML basic Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging
Statistic Data pruning (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, monte Carlo Method, Hypothesis testing
Supervised learning Classifier, boosting, prediction, regression analysis

Han, Jiawei, Michelin Kamber, and Jian Pei .? Data mining: concepts and techniques.

Unsupervised learning Cluster, deep learning
Collaborative filtering

Item based CF, user based CF

Algorithm Classifier Demo-trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), na? Ve Bayes classifiers, neural networks,
Regression Linear regression, logistic regression, ranking, perception
Cluster Hierarchical cluster, K-means cluster, Spectral Cluster
Dimensionality ction PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (multidimen1_scaling)
Text mining & Information retrieval Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce.

Original article address: Big Data Engineering Personnel knowledge graph. Thank you for sharing it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.