Project Flow
1. Data generation
JsSdk and javaSdk.
How does the data reach the nginx server? Uri, splicing, and then http with this information, request to access the nginx server, nginx can obtain and collect this information, and the generated log rules are defined by themselves. Pay attention to high availability (according to actual business scenarios, for example, only statistics such as pv, the lost data is not related to HA, but if you are collecting background user order information, the data can not be lost, you must configure HA) and load balancing .
2. Data collection
Use flume to collect on HDFS (whether the flume configuration should be highly available and whether to aggregate nodes), and the directory is dynamically generated according to time.
3. MR data cleaning
Remove data without time stamps (because we divide and analyze by time, data without time is meaningless);
Remove data whose length is not 4 (this kind of data is regarded as data of reptiles).
ip parsing into regional information (pure database and Taobao ip parsing);
UserAgent analysis: Obtain the name and version number of the browser and system respectively;
LogParser parsing: remove the timestamp. Convert to milliseconds, process the uri parameter list, and store the processed result in the map collection in the form of <K,V> pairs.
Synthesize all the above analysis results, format the data, and store it in HDFS.
5. Data application
Display, machine learning, data mining, decision support, etc.
supplement:
During the interview, the business logic will definitely ask you. Which modules have you worked on, what (how many) indicators did you make in the module, and what are the implementation ideas (how exactly did it do)? Problems encountered? (Don't talk about simple problems such as null pointers, only that your java foundation is too bad)
HIVE (emphasis) Now many ETL engineers and data warehouse engineers are recruited; many companies used Oracle in the past, but they paid for it. Now they directly rebuild a data warehouse and use hive to do it.
If you really do ETL, the work is not difficult, so the development space is not very big, but the business you contact will be a little more.
In the company, if you want to have better development, go to participate in the business. It may not be understood at first, but when you understand the business, your say in the company will become heavier and heavier.
A very important idea of big data: divide and conquer.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.