Research and application of Internet access data preprocessing--based on Hadoop
Donghua University Huang Hang Fai
The main research contents of this paper are: The research and application of the Internet access log data preprocessing technology based on Hadoop environment. First of all, the paper briefly outlines the background and significance of the topic, and introduces the main content of this paper and the relevant research status at home and abroad; then, the Web log mining is briefly summarized, the Web logs preprocessing is emphasized, and each step is summarized. Secondly, this paper introduces the popular large data distributed processing platform--hadoop, and through the research of the current single machine data processing technology, porting it to Hadoop environment. On the basis of the research, a session recognition algorithm based on and search set is proposed, which is based on the user's related account information in the log cookie. It provides more accurate user information data for later data mining. Finally, on the basis of user identification, the search records in the user browsing log are processed in natural language, and the user's search keywords and the classified information of the keywords are extracted. Through these search keywords will be able to sum up the user for a period of time points of interest. Based on the existing research, the paper mainly makes the following innovations: firstly, it discusses a key problem in Internet Web log mining, which is the data preprocessing problem of Web log, and successfully transplanted it into the distributed processing platform Hadoop in view of the shortage of the large data processing mode of single machine at present. Based on the link relationship of User browsing page, an algorithm of session recognition is proposed, and according to user-related account information for user identification, and through the user search records of natural language processing, summed up the user's search keywords and classified information for the later users of interest and hobbies, behavior habits of the excavation work well prepared. The research work in this paper will lay a solid foundation for the future research of Web log preprocessing.
Research and application of Internet access data preprocessing--based on Hadoop
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.