Paper Reading | Clustrophile 2:guided Visual Clustering Analysis

Source: Internet
Author: User

Thesis Address paper video

The left sidebar can import data, or open and previous saved results. The right side shows all the logs, so you can easily go back to the previous state, the upper part of the main area of the view is the data, and the lower half is the cluster view.

INTRODUCTION

Data clustering is a very effective tool for processing untagged data, high-dimensional data. It is difficult to determine the best clustering methods and parameters in the clustering algorithm, and the help of the visual system is needed. Clustrophile 2, a new interactive tool for guiding cluster analysis, guides users through clustering-based exploratory analysis, adjusts user feedback to improve clustering effects, and helps to quickly infer differences between clusters. To this end, Clustrophile 2 offers a novel feature, clustering Tour, to help users select clustering parameters and assess gaps with current analysis goals and user expectations. We evaluated the system through the user study of 12 data scientists. The results show that Clustrophile 2 improves the speed and effectiveness of exploratory clustering analysis between experts and non-experts.

DESIGN CRITERIA

Clustrophile 2 summarizes the 9 design criteria:

Show variation within clusters (quick display clustering effect), allow quick iteration over parameters (real-time update parameters), represent clustering instances Co mpactly (Multi-view display), facilitate interpretable naming (data cluster rename and detach), supported analysis of large datasets (large data support), supporting Reasoning about clusters and clustering instances (supports inference and evaluation of clustering results), Promote multiscale Exploration (Multiscale exploration clustering), Keep a Stateful representation of the present analysis (save current exploration state), Guide users in clustering analysis (user clustering).

In this paper, the author describes the contribution of this article:

    • On the basis of clustrophile system, we add rich clustering algorithm, parameter, Evaluation index and visualization tool.

    • Develop an integrated program that guides users through cluster analysis, clustering tour.

    • A more reasonable clustering quality measurement index is defined, which takes account of user feedback, explanatory, etc.

USER INTERFACE and INTERACTIONS

The main view of the visualization system includes, cluster views, the Assistant recommendation interface, the cluster Journey (Clusting tour) Three parts

Visualization views

In the cluster view, the scatter plot shows the data clustering results projected on the 2D plane after the data has been reduced to a dimension, where the similarity between encoded data points is the same, and the thermal map on the right represents a cluster, while the rows represent different characteristics, and the color shades represent the relative size of the data.

In addition, we can observe and filter the data by observing the data table interface.

Choosing Parameters and Guiding Users Towards a Better clustering

The Clustrophile 2 is the recommended feature in the tuning engagement algorithm and provides a complete measure of the clustering effect:

    • Clustrophile2 supports auto-tuning, removes features with low variance, and supports custom sampling.

    • According to the data characteristics recommended suitable clustering algorithm, according to hierarchical clustering of tree tree recommended the appropriate number of clusters.

    • Comparing the different projection methods, we recommend the projection (dimensionality reduction) algorithm which can satisfy the clustering compactness and the separation as much as possible.
    • The clustering results are quantitatively measured from the degree of skew distribution, the density of sub-clusters, the robustness of the algorithm to noise, and the monotonicity of cost function.
    • By putting clustering results into decision trees for training, you can infer the main characteristics of data points in different clusters
    • Support for analyzing outliers in cluster distributions, removing them and re-clustering

Clustering Tour

By iteratively changing all cluster parameters, the user can dynamically explore the space of a possible clustering solution until a satisfactory solution or dataset is found. However, even under the guidance of parameter selection, possible parameter combinations and clustering solutions are too large to be fully explored manually. Some parameter choices affect clustering results to a large extent, while other parameters have minimal effect on the results. With this concept in mind, we have introduced a clustering tour feature to help users quickly explore possible clustering result spaces. The following interface contains (a) a list of previously explored solutions, (b,c) scatter plots and thermal map visualizations for the current scenario, (e) A set of buttons that users provide feedback, like or reject, (d) a modal selection in which the user can constrain the way the parameter is updated.

The whole process is similar to simulated annealing. Firstly, the system will find out the clustering results with large difference according to the similarity of each kind of clustering. Users browse these solutions in turn, and if they like this scenario, the system makes minor parameter changes based on this scheme, which is equivalent to entering a leaf node. If you are not satisfied, you will return to your father. Until a closing value is reached, such as the discovery time and the number of programs.

USER STUDY
    • Experimental subjects, 12 data science practitioners

    • Target: 1) Understand how data scientists interact with prior knowledge of the data domain 2) How to find a satisfactory solution in an open analysis task

    • Experimental data: A data set of subjects with Parkinson's disease with 8652 rows and 37 features
    • Lab task: Identify different types of Parkinson's disease. We ask participants to identify a cluster instance they are satisfied with, assign names and descriptions for each cluster, and finally verbally explain the importance of their obtaining this result.
    • We divide the candidates into three categories: hackers, script writers, and application users. Each of the 4 people, and 2 of each person understands medical knowledge.

From the experimental results, we prove that Clustrophile 2 proves different types of data analysts. Three of the 12 users (two of which belong to the application user prototype) tend to use clustercour for analysis. Finally, the analysis continues to iterate and iterate over the cluster parameters and selected features until the participants realize that they can only find the result of clustering based on the affected Party or the severity of the disease. These clusters are easily interpreted from the thermal map visualization, and the thermal map visualization can obviously provide very effective information.

Conclusion
    • Parameters and algorithm selection are very important.

    • Clustering tour improves user autonomy and creativity

    • User feedback on results accelerates the exploration process

    • Management, caching of data and processes facilitates the user's exploration

In addition, there are some points that can be improved

    • Add an explanatory clustering study

    • Add more pre-calculations and recommendations

    • Add support for any clustering interface, add code interfaces, and allow users to expand in the framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.