Research on distributed data stream clustering algorithm based on Hadoop mapreduce

Source: Internet
Author: User
Keywords Data mining Hadoop mapreduce clustering distributed clustering data stream clustering
Tags based continuous continuous increase data data mining delete distributed environment

Research on distributed data stream clustering algorithm based on Hadoop mapreduce

Cai Binlei Ningjiadong Zhu Shiwei Guo Qin

With the continuous increase of data flow scale, the existing clustering algorithm based on grid has no effect on the clustering of data streams, can not find any shape clusters in real time, and can not delete the noise points in the data stream in time. In this paper, a distributed data stream clustering algorithm based on grid density in Hadoop platform environment (Pgdc-stream) is proposed, which facilitates the parallel clustering analysis of data flow in the MapReduce framework based on Hadoop, and discovers arbitrarily shaped clusters in the data stream in real time. Define the detection cycle and density threshold function and delete the noise points in the data stream in time. After the initial clustering of the data stream based on the grid density, the algorithm uses the noise point processing strategy based on the density threshold function, periodically detects and deletes the noise points, and adjusts the generated clusters periodically using the parallel analysis model based on the Hadoop mapreduce framework. The experimental results show that Pgdc-stream is better than clustream in clustering quality, scalability and real-time of large-scale data flow.


Research on distributed data stream clustering algorithm based on Hadoop mapreduce

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.