Research on distributed data stream clustering algorithm based on Hadoop mapreduce
Cai Binlei Ningjiadong Zhu Shiwei Guo Qin
With the continuous increase of data flow scale, the existing clustering algorithm based on grid has no effect on the clustering of data streams, can not find any shape clusters in real time, and can not delete the noise points in the data stream in time. In this paper, a distributed data stream clustering algorithm based on grid density in Hadoop platform environment (Pgdc-stream) is proposed, which facilitates the parallel clustering analysis of data flow in the MapReduce framework based on Hadoop, and discovers arbitrarily shaped clusters in the data stream in real time. Define the detection cycle and density threshold function and delete the noise points in the data stream in time. After the initial clustering of the data stream based on the grid density, the algorithm uses the noise point processing strategy based on the density threshold function, periodically detects and deletes the noise points, and adjusts the generated clusters periodically using the parallel analysis model based on the Hadoop mapreduce framework. The experimental results show that Pgdc-stream is better than clustream in clustering quality, scalability and real-time of large-scale data flow.
Research on distributed data stream clustering algorithm based on Hadoop mapreduce
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.