Topic |
Content |
Key points |
Reference |
DB/OLTP & DW/OLAP |
Database/OLTP basic |
The relational model, SQL, index/secondary index, inner join/left join/right join/full join, transaction/ACID |
Ramakrishnan, Raghu, and Johannes Gehrke. Database Management Systems. |
Database internal & implementation |
Architecture, memory management, storage/B + tree, query parse/optimization/execution, hash join/sort-merge join |
Distributed and parallel database |
Sharding, database proxy |
Data warehouse/OLAP |
Materialized views, ETL, column-oriented storage, reporting, BI tools |
Basic programming |
Programming language |
Java, Python (Pandas/NumPy/SciPy/scikit-learn), SQL, Functional programming, R/SAS/SPSS |
Wes McKinney. Python for Data Analysis: Agile Tools for Real World Data. |
OS |
Linux |
DB & DW system |
MySQL/Hive/Impala |
Text format and process |
JSON/XML, regex |
Tool |
Git/SVN, Maven |
Distributed system & Hadoop ecosystem & NoSQL |
Distributed system principal theory |
CAP theorem, RPC (Protocol Buffer/Thrift/Avro), Zookeeper, Metadata management (HCatalog) |
Distributed storage & computing framework & resource management |
Hadoop/HDFS/MapReduce/YARN |
Tom White. Hadoop: The Definitive Guide. Donald Miner, Adam Shook. MapReduce Design Patterns: Building into tive Algorithm and Analytics for Hadoop and Other Systems. |
SQL on Hadoop |
Data (log) acquisition/integration/fusion, normalization, feature extraction |
Sqoop, Flume/Scribe/Chukwa, SerDe |
Edward Capriolo, Dean Wampler, Jason Rutherglen. Programming Hive. |
Query & In-database analytics |
Hive, Impala, UDF/UDAF |
Large scale data mining & machine learning framework |
Spark/MLbase, MR/Mahout |
Streaming process |
Storm |
NoSQL |
HBase/Cassandra (column oriented database) |
Lars George. HBase: The Definitive Guide. |
Mongodb (Document database) |
Neo4j (graph database) |
Redis (cache) |
Data mining & Machine learning |
DM & ML basic |
Numerical/Categorical variable, training/test data, over fitting, bias/variance, precision/recall, tagging |
Statistic |
Data pruning (mean, median/range/standard deviation/variance/histogram), Continues distributions (Normal/Poisson/Gaussian), covariance, correlation coefficient, distance and similarity computing, Bayes theorem, monte Carlo Method, Hypothesis testing |
Supervised learning |
Classifier, boosting, prediction, regression analysis |
Han, Jiawei, Michelin Kamber, and Jian Pei .? Data mining: concepts and techniques. |
Unsupervised learning |
Cluster, deep learning |
Collaborative filtering |
Item based CF, user based CF |
Algorithm |
Classifier |
Demo-trees, KNN (K-Nearest neighbor), SVM (support vector machines), SVD (Singular Value Decomposition), na? Ve Bayes classifiers, neural networks, |
Regression |
Linear regression, logistic regression, ranking, perception |
Cluster |
Hierarchical cluster, K-means cluster, Spectral Cluster |
Dimensionality ction |
PCA (Principal Component Analysis), LDA (Linear discriminant Analysis), MDS (multidimen1_scaling) |
Text mining & Information retrieval |
Corpus, term document matrix, term frequency & weight, association rules, market based analysis, vocabulary mapping, sentiment analysis, tagging, PageRank, VSM (Vector Space Model), inverted index |
Jimmy Lin and Chris Dyer. Data-Intensive Text Processing with MapReduce. |