Thesis Address paper video
The left sidebar can import data, or open and previous saved results. The right side shows all the logs, so you can easily go back to the previous state, the upper part of the main area of the view is the data, and the lower half is the cluster view.
INTRODUCTION
Data clustering is a very effective tool for processing untagged data, high-dimensional data. It is difficult to determine the best clustering methods and parameters in the clustering algorithm, and the help of the visual system is needed. Clustrophile 2, a new interactive tool for guiding cluster analysis, guides users through clustering-based exploratory analysis, adjusts user feedback to improve clustering effects, and helps to quickly infer differences between clusters. To this end, Clustrophile 2 offers a novel feature, clustering Tour, to help users select clustering parameters and assess gaps with current analysis goals and user expectations. We evaluated the system through the user study of 12 data scientists. The results show that Clustrophile 2 improves the speed and effectiveness of exploratory clustering analysis between experts and non-experts.
DESIGN CRITERIA
Clustrophile 2 summarizes the 9 design criteria:
Show variation within clusters (quick display clustering effect), allow quick iteration over parameters (real-time update parameters), represent clustering instances Co mpactly (Multi-view display), facilitate interpretable naming (data cluster rename and detach), supported analysis of large datasets (large data support), supporting Reasoning about clusters and clustering instances (supports inference and evaluation of clustering results), Promote multiscale Exploration (Multiscale exploration clustering), Keep a Stateful representation of the present analysis (save current exploration state), Guide users in clustering analysis (user clustering).
In this paper, the author describes the contribution of this article:
On the basis of clustrophile system, we add rich clustering algorithm, parameter, Evaluation index and visualization tool.
Develop an integrated program that guides users through cluster analysis, clustering tour.
A more reasonable clustering quality measurement index is defined, which takes account of user feedback, explanatory, etc.
USER INTERFACE and INTERACTIONS
The main view of the visualization system includes, cluster views, the Assistant recommendation interface, the cluster Journey (Clusting tour) Three parts
Visualization views
In the cluster view, the scatter plot shows the data clustering results projected on the 2D plane after the data has been reduced to a dimension, where the similarity between encoded data points is the same, and the thermal map on the right represents a cluster, while the rows represent different characteristics, and the color shades represent the relative size of the data.
In addition, we can observe and filter the data by observing the data table interface.
Choosing Parameters and Guiding Users Towards a Better clustering
The Clustrophile 2 is the recommended feature in the tuning engagement algorithm and provides a complete measure of the clustering effect:
Clustrophile2 supports auto-tuning, removes features with low variance, and supports custom sampling.
According to the data characteristics recommended suitable clustering algorithm, according to hierarchical clustering of tree tree recommended the appropriate number of clusters.
- Comparing the different projection methods, we recommend the projection (dimensionality reduction) algorithm which can satisfy the clustering compactness and the separation as much as possible.
- The clustering results are quantitatively measured from the degree of skew distribution, the density of sub-clusters, the robustness of the algorithm to noise, and the monotonicity of cost function.
- By putting clustering results into decision trees for training, you can infer the main characteristics of data points in different clusters
Support for analyzing outliers in cluster distributions, removing them and re-clustering
Clustering Tour
By iteratively changing all cluster parameters, the user can dynamically explore the space of a possible clustering solution until a satisfactory solution or dataset is found. However, even under the guidance of parameter selection, possible parameter combinations and clustering solutions are too large to be fully explored manually. Some parameter choices affect clustering results to a large extent, while other parameters have minimal effect on the results. With this concept in mind, we have introduced a clustering tour feature to help users quickly explore possible clustering result spaces. The following interface contains (a) a list of previously explored solutions, (b,c) scatter plots and thermal map visualizations for the current scenario, (e) A set of buttons that users provide feedback, like or reject, (d) a modal selection in which the user can constrain the way the parameter is updated.
The whole process is similar to simulated annealing. Firstly, the system will find out the clustering results with large difference according to the similarity of each kind of clustering. Users browse these solutions in turn, and if they like this scenario, the system makes minor parameter changes based on this scheme, which is equivalent to entering a leaf node. If you are not satisfied, you will return to your father. Until a closing value is reached, such as the discovery time and the number of programs.
USER STUDY
Experimental subjects, 12 data science practitioners
Target: 1) Understand how data scientists interact with prior knowledge of the data domain 2) How to find a satisfactory solution in an open analysis task
- Experimental data: A data set of subjects with Parkinson's disease with 8652 rows and 37 features
- Lab task: Identify different types of Parkinson's disease. We ask participants to identify a cluster instance they are satisfied with, assign names and descriptions for each cluster, and finally verbally explain the importance of their obtaining this result.
We divide the candidates into three categories: hackers, script writers, and application users. Each of the 4 people, and 2 of each person understands medical knowledge.
From the experimental results, we prove that Clustrophile 2 proves different types of data analysts. Three of the 12 users (two of which belong to the application user prototype) tend to use clustercour for analysis. Finally, the analysis continues to iterate and iterate over the cluster parameters and selected features until the participants realize that they can only find the result of clustering based on the affected Party or the severity of the disease. These clusters are easily interpreted from the thermal map visualization, and the thermal map visualization can obviously provide very effective information.
Conclusion
Parameters and algorithm selection are very important.
Clustering tour improves user autonomy and creativity
User feedback on results accelerates the exploration process
Management, caching of data and processes facilitates the user's exploration
In addition, there are some points that can be improved
-
Add an explanatory clustering study
-
Add more pre-calculations and recommendations
-
Add support for any clustering interface, add code interfaces, and allow users to expand in the framework