Spark large-scale project combat: E-commerce user behavior analysis Big Data platform

Last Update:2016-04-12 Source: Internet

Author: User

Tags map class shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This project mainly explains a set of big data statistical analysis platform which is applied in Internet e-commerce enterprise, using Java, Spark and other technologies, and makes complex analysis on the various user behaviors of e-commerce website (Access behavior, page jump behavior, shopping behavior, advertising click Behavior, etc.). Use statistical analysis data to assist PM (product manager), data analyst, and management to analyze existing products and continuously improve product design according to user behavior analysis results, as well as adjust company strategy and business. Finally, we achieve the goal of using big data technology to help improve the company's performance, turnover and market share.

1. Curriculum Development EnvironmentDevelopment tools: Eclipselinux:centos 6.4spark:1.5.1hadoop:hadoop-2.5.0-cdh5.3.6hive:hive-0.13.1-cdh5.3.6zookeeper: zookeeper-3.4.5-cdh5.3.6kafka:2.9.2-0.8.1 Other tools: flume-ng-1.5.0-cdh5.3.6, SecureCRT, WinSCP, VirtualBox, etc. 2. Introduction to the contentThe project mainly uses the current Big data field most popular, the hottest technology--spark, has the technical foresight and the cutting edge which the common project cannot compare. The project uses the three most commonly used technical frameworks in the spark technology ecosystem, Spark Core, Spark SQL, and spark streaming for offline computing and real-time computing business Module development. The implementation includes user access session analysis, page jump conversion rate statistics, popular products offline statistics, ADS Click Traffic Real-time statistics 4 business modules. All the business function modules in the project are directly extracted from the actual enterprise projects, the business complexity is absolutely no shrinkage, just in order to better close to the big data combat course needs, to a certain extent, the integration of technology and business integration. The authenticity of the project, business complexity and actual combat type, is definitely not available on the market only a few hours of the demo-class big data projects can be compared. Through the rational integration and transformation of the actual business modules, the project covers almost all the functional points, knowledge points, and performance optimization points in the three technical frameworks of Spark Core, spark SQL and spark streaming. With just one project, you can get a complete grasp of how spark technology can meet all types of business needs in real-world projects! In the project, focus on the real enterprise projects accumulated valuable performance tuning, troubleshooting and data tilt solutions and other knowledge and technology, almost all knowledge and technology is the only network, is any other video courses and books are not included in the valuable experience accumulated! At the same time, the enterprise-class big Data project development process through each business module, covering the whole process of project development, including requirements analysis, program design, data design, coding implementation, testing and performance tuning process, comprehensive restore real big data project development flow. The overall business value of the project is definitely over million dollars! After completing this course, you can significantly improve your spark technology capabilities, combat development capabilities, project experience, performance tuning and troubleshooting experience. If the student has already learned "spark from getting started to mastering (Scala programming, Case combat, advanced features, spark kernel source profiling, Hadoop high-end)" Course, then finish this course, you can fully achieve 2-3 years or so of spark big data development experience level, Formally entering the ranks of Spark's senior development engineers! At the time of a job-hopping or interview, superb Spark technology and the experience of a complex spark big data project are enough to allow you to meet any company interview in the country (including the difficulty of interviewing top Internet companies such as BAT), so that students can learn to master their own lives! In particular, the study of this course requires Java Foundation, Hadoop Foundation, if the learner does not have the relevant foundation, please learn the relevantKnowledge. This course requires students to have a solid spark technology base, if not, it is recommended to learn the North wind network of "spark from beginner to proficient (Scala programming, Case combat, advanced features, spark kernel source analysis, Hadoop high-end)" Course (http://www. Ibeifeng. com/goods-560.html). Note one: Regarding the relationship between spark from beginner to proficient (Scala programming, Case combat, advanced features, spark kernel source profiling, Hadoop high end) and this set of courses, if you have learned the first set of spark technology courses, you can achieve 1~ in the context of mastery 2 years of experience in spark development; If you finish your first spark course and learn the second Spark Project course, and in a context of mastery, you can reach the level of 2-3 years of spark development experience and become a senior/senior development engineer at Spark. Note two: Taking into account the technical basis of the students are not unified, so this project only requires J2SE Foundation, that is, Java BASIC programming can, do not require the Java framework, but also do not use any of the JavaScript, does not involve the integration of third-party technology. The main purpose is to reduce the learning threshold of the course. This course does not cover the development of the Java EE layer, but explains how spark is used in conjunction with Java EE to form the architecture of an interactive big data platform. So the only requirement is that the basics of Java programming and Spark's solid technology can be a learning lesson. Note Three: With regard to the choice of course development language, this course chooses to use Java rather than Scala as a programming language, because Java has the advantage of being unmatched in the development of large, complex big data business systems or platforms, and in truly large and complex projects, Perhaps spark needs to manage a large number of components, which may require a spring framework, or complex database operations that require an ORM class framework, such as MyBatis, which may need to be integrated with Redis, Kafka, zookeeper, and use Java at this time Client API; The above requirements are not what Scala can meet. Using Scala is likely to result in a multi-lingual project, resulting in a significant reduction in maintainability and scalability. (Note that this program is designed to reduce learning difficulties and focus on spark, without using any of the above technologies, just with pure Java-based programming and spark technology; but that doesn't mean you don't have to do the same in real work) the most important features of this course include: 1.The only high-end big data project in the whole network: there is no high-end big Data project in the market, and there is no spark Big Data Project Combat Class course, this course is the only enterprise-class large spark big Data Project course in the whole network! 2.Architecture of Enterprise Big Data projects: Configuration management components, JDBC auxiliary components (built-in database connection pool), domain and DAO models, etc., fully formal large-scale Big Data Project architecture! 3.Interactive Big Data analytics platform architecture: The prototype of this project is not an ordinary big data project for scheduling off-line statistical tasks, but the interactive big data analysis platform composed of Spark and the Java EE system, the spark development in the project is explained by this architecture! 4.Real restore complete enterprise Big Data project development process: The project uses the way to completely restore the enterprise Big Data project development scenario, each business module's explanation includes the data analysis, the requirement analysis, the plan design, the database design, the code realization, the function test, the performance tuning, Troubleshooting and solve data tilt (post-operation) and other links, Real restore enterprise-class big Data project development scenarios. Let students master the development process and experience of real big data Projects! 5.Technology coverage: A project curriculum that covers almost all of the first, middle and advanced technical points of at least 90% of Spark Core, Spark SQL, and spark streaming, and through the course of this project, Can fully exercise the students of Spark big Data Project combat ability, the technology and project Mastery, thoroughly proficient in spark combat development! 6.Real-world performance tuning and troubleshooting experience: the project through the actual functional modules and business scenarios, as well as the instructor has developed the process of processing 1 billion or even more than tens of millions of data-level spark work experience, through a large number of advanced complex performance tuning technology and knowledge, Troubleshooting experience in resolving line errors and failures. Truly help students master the sophisticated spark technology used in enterprise real projects! 7.High-end data tilt Solution: This course explains the high-end and valuable, large number of real-world projects accumulated-data tilt complete solution! Including data tilt problem judgment, diagnosis and localization, and a complete set of 7 kinds of solutions for different types of data tilt, thoroughly help students to solve the most difficult data tilt problem in enterprise projects, known as the most core technical personnel in the Enterprise! The only precious technology in the whole network! 8.The business function is extremely complex: The four functional modules in the project are all extracted from the actual enterprise projects, and the technical integration and improved function modules, including more and more comprehensive technical points than the actual project. All the requirements of the module, all of the complex and real enterprise-level requirements, business modules are very complex, definitely not on the market of the demo-level big data projects can be compared to. After the study, really help students to increase the actual practical enterprise-level project experience! 9.A large number of full-network unique high-end technology: Custom accumulator, on-time proportional random extraction algorithm, two-order, packet-TOPN, page-slice generation and page flow matching algorithm, hive and MySQL heterogeneous data source, Rdd conversion to Dataframe, registration and use of temporary tables, Custom UDAF aggregate Functions (GROUP_CONCAT_DISTINCT), custom get_json_object and other common functions, advanced built-in functions for Spark SQL (if and case when, etc.), window-opening functions (Row_number), Dynamic blacklist mechanisms, transform, updatestatebykey, transform and Spark SQL Consolidation, window sliding windows, high-performance write databases, and more. 10.Industry experience interspersed introduction: through a large number of lecturers in the big data industry experience and seen, to help students enrich the industry experience. 11.High-end source code: Give full spark large big Data project business level source code, the value of millions; a little transformation, two development, can even be used directly in your enterprise's big Data behavioral analysis. 12.On-site Excel hand-drawn and write notes: All complex business processes, architectural principles, Spark technology principles, business requirements analysis, technical implementation of the knowledge of the explanation, using Excel drawing or writing detailed comparison of the way to explain and analysis, meticulous, image thoroughly analyze theoretical knowledge, To help students better understanding, memory and review consolidation.

First, big Data cluster construction

1th Lecture-Course Introduction 2nd Lecture-Curriculum Environment Building: CentOS 6.4 Cluster Construction 3rd lecture-Curriculum Environment construction: hadoop-2.5.0-cdh5.3.6 Cluster Construction 4th Lecture-Curriculum environment construction: HIVE-0.13.1-CDH 5.3.6 Installation 5th Lecture-Curriculum Environment construction: zookeeper-3.4.5-cdh5.3.6 Cluster Building 6th Lecture-Curriculum Environment construction: kafka_2.9.2-0.8.1 Cluster Construction 7th Lecture-Curriculum environment construction: flume- ng-1.5.0-cdh5.3.6 Installation 8th Lecture-Curriculum Environment construction: Offline Log capture Process introduction 9th lecture-Curriculum Environment construction: real-time data acquisition process Introduction 10th lecture-Course Environment construction: Spark 1.5.1 Client Installation and YA-based The submission mode of RN Second, user access session analysis: 11th talk-User access Session Analysis: module introduction 12th-User Access Session analysis: Basic data structure and big Data platform architecture introduction 13th talk-User access session Analysis: Demand analysis 14th-User Access Session Analysis: Technical solution design 15th Lecture-User Access Session Analysis: Data table design & nbsp 16th-User access Session analysis: Eclipse Engineering Construction and tool description 17th-User Access Session analysis: Developing Configuration Management Components 18th-User Access Session Analysis: Introduction to JDBC Principles and additions and deletions to the demo 19th talk-User access session Analysis: Database connection Pooling principle 20th Lecture-User Access Session Analysis: Single case design mode 21st-User Access Session Analysis: Internal classes and anonymous internal classes 22nd talk-User access session analysis: Developing JDBC Auxiliary components (top) 23rd-User Access Session analysis: Developing JDBC Auxiliary components (next) 24th Lecture-User access Session Analysis: JavaBean concept explanation 25th Lecture-User Access session Analysis: DAO mode explanation and Taskdao development 26th talk-UserVisit Session Analysis: Factory mode explanation and Daofactory development 27th-User Access Session analysis: JSON data format and Fastjson introduction 28th talk-User access Session Analysis: Spark context Building and simulation data generation 29th-User Access Session Analysis: Data aggregation by session granularity 30th talk-User Access session Analysis: Filter the session granularity aggregation data by filtering parameters 31st talk-User Access session Analysis: Session aggregation statistics customization accumulator 32nd Lecture-User Access Session analysis: the reconstruction of Session aggregation statistics and its reconstruction session aggregation 33rd-User Access Session Analysis: Session aggregation statistics of the reconstruction filter for statistics 34th talk-User access session Analysis: Session aggregation statistics calculated statistical results and write mysql 35th-User Access Session Analysis: Session aggregation statistics of local testing 36th-User Access Session Analysis: Session aggregation statistics using Scala for custom accumulator & nbsp; 37th Lecture-User Access Session Analysis: Analysis of the realization of the session random extraction 38th-User Access Session Analysis: Session random calculation session number per hour 39th talk-User access session analysis: sImplementation of random decimation algorithm of ession randomly sampled on time 40th-User Access Session Analysis: Session randomly extracted from random index to extract 41st-User Access Session analysis: The session is randomly extracted to extract the session details 42nd-User Access Session Analysis: Session random extraction of the local test 43rd talk-User access session analysis: Top10 Popular categories of demand review and implementation of ideas analysis 44th-User Access Session analysis: Top10 Popular categories All categories visited by session 45th talk-User access session analysis: TOP10 Popular category Calculation of categories click, order and payment number of times 46th talk-User access session analysis: TOP10 Popular category of join category and click to pay the number of orders 47th talk-User access session analysis: TOP10 Popular category of the custom two order key 48th-User Access Session analysis: TOP10 Popular category of two times to sort 49th-User Access Session analysis: Top10 Popular categories get Top10 category and write mysql 50th-User Access Session analysis: TOP10 Popular category of local test 51st talk-User access session analysis: Top10 popular categories using Scala to achieve two-time sorting 52nd-User access Session analysis: TOP10 active session development preparation andTop10 category Rdd generation 53rd-User access Session analysis: TOP10 Active Session calculates the number of sessoin clicks of top10 category 54th-User Access Session analysis: TOP10 active session packet fetch TOPN algorithm get TOP10 active session 55th-User access Session analysis: Local test and stage summary of TOP10 active session Three, enterprise-class performance tuning, troubleshooting experience and Data tilt solution: 56th-User Access Session Analysis: Performance tuning to allocate more resources in real-world projects 57th talk-User access Session Analysis: Performance tuning in the actual project to adjust the degree of parallelism 58th-User Access Session Analysis: Performance tuning in the actual project to reconstruct the RDD architecture and the RDD persistence 59th-User Access Session Analysis: Performance tuning in real projects broadcast big variables 60th Lecture-User Access Session Analysis: Performance tuning using Kryo serialization in real-world projects 61st-User Access Session Analysis: Performance tuning use Fastutil to optimize data formats in real-world projects 62nd talk-User access Session Analysis: Performance tuning to adjust data localization in real-world projects wait time 63rd-User Access Session Analysis: a summary of the principles of JVM tuning and reducing the memory footprint of cache operations 64th-User Access Session Analysis: JVM tuning executor memory and connection wait time 65th talk-User access session analysis: the principle of shuffle tuning 66th-User Access Session Analysis: Shuffle tuning Merge map-side output file 67th talk-User access session Analysis: Shuffle tuning map side memory buffer and reduce side memory ratio 68th Lecture-User access Session Analysis: Shuffle tuning Hashshufflemanager and sortshufflemanager&nbsp 69th-User Access Session Analysis: operator tuning mappartitions Enhance map class operation performance 70th-User Access Session analysis: Use coalesce to reduce the number of partitions after operator tuning filter 71st talk-User access session Analysis: operator tuning using foreachpartition optimized write database performance 72nd-User Access Session Analysis: operator tuning using repartition resolving spark SQL performance issues with low degree of parallelism 73rd talk-User access session Analysis: operator Tuning Reducebykey Local aggregation introduction 74th talk-User access session Analysis: Troubleshooting control shuffle reduce buffer size to avoid oom 75th-User Access Session analysis: Troubleshooting To resolve shuffle file pull failures caused by JVM GC 76th Lecture-User Access Session Analysis: Troubleshooting solution application direct failure due to insufficient yarn queue Resources 77th Lecture-User Access Session analysis: Troubleshooting To resolve errors caused by various serialization 78th talk-User access session Analysis: Troubleshooting's solution operator function returns null-caused problems 79th talk-User access Session Analysis: Troubleshooting solution yarn-client mode caused by network card traffic surge problem 80th Talk-User Access Session analysis: Troubleshooting To resolve JVM stack memory overflow issues in Yarn-cluster mode 81st talk-User access session analysis: Troubleshooting's error persistence and checkpoint usage 82nd-User Access Session Analysis: Principles and phenomena analysis of data skew solutions 83rd-User Access Session analysis: Aggregated source data for data skew solutions and filtering leads to skewed key 84th Lecture-User access session Analysis: Data tilt solution improvement shuffle operations reduce parallelism 85th Lecture-User Access session Analysis: Data tilt solution using random key for double aggregation 86th talk-User access session Analysis: Data tilt solution will reduce Join conversion to map join 87th Session-User Access sessions Analysis: Data tilt Solution Sample sample Tilt key separate join 88th-User Access Session Analysis: Data skew solutions using random numbers and expansion tables for joins Four, page conversion rate of single jump statistics:89th-page single-Hop conversion rate: module Introduction 90th-page Single jump conversion rate: requirements analysis, technical design, data table design 91st-page single jump conversion rate: Write the basic code 92nd talk-page jump conversion rate: page slice generation and page flow matching algorithm real Now 93rd-page jump conversion rate: Calculate the page Flow start page of PV 94th talk-page jump conversion rate: Calculate page Slice conversion rate 95th talk-page jump conversion rate: Write page slice conversion rate to MySQL 96th talk-page hop Conversion rate: Local test 97th-page single jump conversion rate: Production environment Test 98th-User Access Session Analysis: Production environment testing Five, the region's top commodity statistics:99th Lecture-Popular merchandise statistics in each region: module Introduction 100th Lecture-popular commodity statistics in various regions: demand analysis, technical design and data design 101th-all regions Popular product statistics: query user-defined date range of click Behavior Data 102th Speaking-regions Top Products Statistics: Heterogeneous data Sources Query the city data from MySQL 103th-popular merchandise statistics in various regions: Associated City information and RDD conversion to dataframe after registering temporary table 104th Lecture-All regions popular merchandise statistics: Develop custom UDAF aggregation function gro Up_concat_distinct () 105th-popular merchandise statistics in each region: Check the number of clicks on each item in each area and join the city list 106th-popular merchandise statistics in each region: correlate commodity information and use custom Get_json_object functions and built-in If function tag business type 106th lecture-popular merchandise statistics in each region: using the Window function to count the TOP3 popular products in each region 107th lecture-popular merchandise statistics in each region: Use the built-in case when function to mark each area level 108th- Region Top Products Statistics: Writing result data to MySQL 109th-all regions Top products statistics: Spark SQL data Tilt Solution 110th Lecture-all regions popular product statistics: Production environment test Six, the ads click Traffic Real-time statistics:111th-Ads Click Traffic Real-time statistics: Demand analysis, technical design and data design 112th-ADS Click Traffic Real-time statistics: real-time computing for the dynamic blacklist daily user clicks on each AD 113th-AD Click Traffic Real-time statistics: the use of high-performance way to calculate real-time Results written in MySQL 114th-ad Click Traffic Real-time statistics: Filter out the Blacklist users in each batch to generate dynamic blacklist 115th talk-Ad Click Traffic Real-time statistics: Click behavior based on dynamic blacklist 116th talk-Ad click Stream Amount of real-time statistics: calculate the daily province of each city ads clicks on the 117th-ads Click Traffic Real-time statistics: The calculation of top3 popular ads in each of the provinces of the 118th talk-Ads Click traffic Real-time statistics: Calculate the daily ads in the last 1 hours sliding window of the click Trend 1th 19 talk-Ad Click Traffic Real-time statistics: achieve ha high availability for real-time computing programs 120th-AD Click Traffic Real-time statistics: Performance tuning for real-time computing programs 121th-ADS Click Traffic Real-time statistics: Production environment Test 122th Lecture-course Summary: All What have you learned? Goal one. Master the goal of building a big data cluster environment two. Master the goal of building the enterprise-class Big Data Project architecture three. Master J2ee+spark's interactive big data sub-Analysis of System architectureTarget four. Master the development process goal five of enterprise big data projects. Spark Core, Spark SQL, sparkmore than 90% of streaming's technical and knowledge points are applied in the project, and technology and projects are Target six. Use advanced spark technology to develop a variety of complexmiscellaneous Big Data statistics and analysis class business need to Sum function Target seven. Mastering Enterprise-Class high-end performance tuning solutions,Troubleshooting solutions for on-line fault energy and data skew Highlight one, the network's only high-end spark big Data project. Highlight two, build the architecture of big Data Project according to Enterprise standard. Highlights: Based on the high-end Java EE and Spark Interactive analysis of the architecture of the big data platform to explain spark development. Highlight four, adopt the real enterprise-level big Data project development process, including nearly 10 steps. Highlights five, a wide range of technical points, a set of courses covering Spark Core, spark SQL and spark streaming up to 90% of technical points. Highlights six, real enterprise-class performance tuning solutions, troubleshooting solutions for online fault experience, high-end data tilt solutions. Highlights seven, the business function is extremely complex, all adopt the real enterprise level business demand. Highlights eight, contains a large number of network-only spark technology points. Highlights nine, through a large number of lecturers industry experience and experience, as well as feelings. Highlight ten, a full set of full business-level source code, a little transformation can be applied, commercial value in more than million. 1. Courses for peopleThis course is for students with a solid Spark technology foundation with Java Programming Basics (no EE required). 2. How can I learn how to learn this course and give some advice. 4.1, the time of the arrangement proposalThis course is more than 120, if you have enough time, it is recommended to study on the progress of 2-3 speaking daily. If the time is particularly abundant, it is suggested that the relevant video of the key theoretical knowledge be viewed twice. 4.2. Learning RequirementsWhen learning, you can take notes while watching, suggest to watch the video at the same time, the computer to open a notepad. All the theoretical knowledge of the analysis and interpretation must be repeated thinking and understanding, if not understand, the proposal to look at the double-pass; All code, all required to follow the video, manually knocked over the code, take off the video, knock yourself again, to be able to completely knock out the project. 4.3. Instructor Recommendations1. After watching the video, throw away the video, independent go to the lesson in the example to write again, see if they understand, if not correct, you can look back and look at the video, if repeated, to achieve the purpose of real understanding and mastery. 2. For the case of actual combat, be sure to do it yourself, do not meet after listening to the OK 3. Recommended generally listen to the video, generally take a paper and pens, do some records and notes, this is a very good learning habits. 4. Be sure not to rely too much on video, to learn to read the API and use Baidu, learn to think, learn to extrapolate 5. Finally, I wish you to learn something!

Spark Large Project Combat: E-commerce user behavior analysis Big Data platform

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More