Research direction, hotspots and understanding of big data research in data mining

Source: Internet
Author: User

  The top conferences in the field of data mining are KDD (ACM sigkdd Conference on Knowledge Discovery and data Mining), as well as the public awareness of peers to the Conference, which is recognized, The top-ranked conferences are KDD, ICDE, cikm, ICDM, SDM, and periodicals are ACM TKDD, IEEE Tkde, ACM TODS, ACM Tois, DMKD, VLDB Journal, etc. The full names of the meetings and periodicals are as follows:

Meeting

ACM sigkdd Conference on Knowledge Discovery and Data Mining (KDD)

International Conference on Data Engineering (ICDE)

International Conference on Information and Knowledge Management (CIKM)

IEEE International Conference on Data Mining (ICDM)

SIAM International Conference on Data Mining (SDM)

Journals

ACM Transactions on Knowledge Discovery from Data (TKDD)

IEEE transactions on Knowledge and Data Engineering (TKDE)

ACM Transactions on Database Systems (TODS)

ACM Transactions on Information Systems (Tois)

Data Mining and Knowledge Discovery (DMKD)

By reading the latest (14, 15) conference papers in recent days, let's start by talking about what the field of data mining is doing and where the hot research is.

The field of data mining mainly includes the following aspects: Basic theory Research (rule and pattern Mining, classification, clustering, topic learning, temporal spatial data mining, machine learning methods, supervision, unsupervised, semi-supervised, etc.), social network analysis and large-scale graph mining (graph pattern Mining, community discovery, Network clustering coefficient estimation, network relationship mining, network user behavior analysis, network information dissemination, social network application, social referral (information, friends, etc.), Big Data mining (parallel algorithm, distributed extension, multi-source heterogeneous data fusion mining, etc.). Data mining applications (medical, education, finance, etc.). Research hotspots are big data mining, social networking and large-scale graph mining.

Now, what is the essential difference between big data mining and traditional methods? Big Data mining can be divided into three points: algorithm expansion, distributed framework development, multi-source data fusion analysis. by readingKDD ',KDD 'SeveralKDD 'OfBig Data SessionAlmost absolutely every article in the article mentions the algorithm'sScalability。 Thus, today's big data mining and traditional algorithmsThe essential difference is the scalability of the algorithm。 In other words, the algorithms that are being researched are not only able to handle small datasets, but also have a greater range of suitability when data is increased. The expansion of the algorithm, I understand two aspects:Scale out-Vertical Expansion andScale up-Horizontal Scaling。 Vertical scaling is most needed at the bottom of the algorithm, good data structure design, or concurrent design. Scale-out mainly refers to the distributed technology implementation of the algorithm (own distributed algorithm or based on the existing distributed framework implementation). The "Big Data", as described here,Different mining fields (text, graph structure, machine learning, image) correspond to different data volumes。 For text, millions of samples may be "big data"; for machine learning, thousands of samples, dozens of-D, hundreds of-D(MB/GB)Is "Big Data", for large-scale graph mining, tens node, billion-level edge(GB), also "big data"; For image data, millions images(TB)Can be called "Big Data" completely. So, to do the scalability of the algorithm must be used in parallel technology, distributed programming technology? The answer isgenerally required, but not absolutely。 If the algorithm achieves the ultimate, a single computer can also handle "big data" problems, such as:turbograph:a Fast Parallel Graph Engine handing Billion-scale Graphs in A single PC.The paper realizes the work of computer cluster by using thread parallel (multi-core) on a single computer. Some articles are usedMATLABTo complete the experiment(comparing apples to oranges:a scalable solution with heterogeneous hashing, Fast Flux discrimination for large-scale Sparse Nonlinear classification, Online Chinese Restaurant Process), some articles are usedHadoopCluster to complete the experiment, and some to useC/javaLanguage programming Distributed program implementation, some of the use of multi-coreCPUMulti-threaded parallel implementations.Obviously, the implementation of the algorithm is not important, it is important that the algorithm hasScalability。 Multi-source data fusion and mining analysis can also be called Big Data mining, it may not be very large data sets, but through the fusion of a variety of data to find things that have not been done before, or to complete the results of the bad things. Like what:Heterogeneous hashingThe article uses two heterogeneous datasets(text, image)ForRelation-awareAnalysis. Especially Microsoft Research Asia inKDD 'On theu-air:when Urban Air quality inference meets Big Data, this article is a fusion of5Data sets (meteorological data, air qualityPOIData, road network data, trajectory data), using the traditional data mining method for the fusion analysis, has obtained the good effect and has carried on the commercial application. Note: The individual thinks that the algorithm should also consider extensibility, and when faced with an increase in the data set, see if it will also be able to achieve efficient prediction results.

Summary: In big Data research, more is the research of partial theory algorithm. It can be said that data mining is in itself a matter of dealing with data, and in certain situations (when the data set is large or growing), any research point in data mining may encounter "big data" problems. So, what really needs to be done is to find a problem, use the traditional method of mining, and test in the large-scale data set under the traditional algorithm is feasible, if not feasible, propose an improved version of the algorithm or self-implementation of a new, scalable algorithm, this is the process of big data research (also includes heterogeneous data fusion analysis).

Research direction, hotspots and understanding of big data research in data mining

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.