Finding out how to determine the number of clusters in advance is an interpretation of the Dirichlet process (DP).
Suppose a Chinese restaurant has an unlimited table, and the first customer arrives and sits on the first table. The second customer can choose to sit on the first table, you can also choose to sit on a new table, assuming that the first N+1 customers arrived, there are already K table has customers, respectively sat N1,n2,..., nk a customer, then the first n+1 customers can be ni/(\alpha+n On the table I, NI is the number of customers on table I, and there is a chance to pick a new table for \alpha/(\alpha+n) and sit down. So after n customers sit down, it is clear that CRP has divided the N customers into K-heap, that is, K-clusters, can prove that CRP is a DP.
Note there is a restriction that there can only be one dish on each table, that is, one table likes to eat the same dish.
It can be seen that the more data each table has, the greater the probability of the next one being selected, as it is proportional to the number on the table.
Chinese Restaurant Process (CRP)