學習雜湊及雜湊在大資料檢索和挖掘中的應用

來源:互聯網
上載者:User

標籤:des   style   blog   http   os   使用   io   ar   for   

http://cs.nju.edu.cn/lwj/conf/CIKM14Hash.htm

Learning to Hash with its Application to Big Data Retrieval and Mining

 

Overview

Nearest neighbor (NN) search plays a fundamental role in machine learning and related areas, such as information retrieval and data mining. Hence, there has been increasing interest in NN search in massive (large-scale) data sets in this big data era. In many real applications, it‘s not necessary for an algorithm to return the exact nearest neighbors for every possible query. Hence, in recent years approximate nearest neighbor (ANN) search algorithms with improved speed and memory saving have received more and more attention from researchers.

【最近鄰搜尋(Nearest neighbor (NN) search)】在機器學習等相關領域扮演著重要的角色,例如【資訊檢索(information retrieval,[??nf??me??n r??triv?l])】和【資料採礦(data mining,[?det? ?ma?n??])】。因此,在這個大資料時代,人們對【大規模資料(massive (large-scale) data sets)】的最近鄰搜尋越來越感興趣。在很多實際應用中,所以用的演算法沒必要對於每一個可能的查詢都返回確切的最近鄰居。因此,最近幾年,可以提高速度和節省空間的【近似最近鄰搜尋(approximate nearest neighbor (ANN) search)】演算法已經受到來自研究者們跟多的關注。

Due to its low storage cost and fast query speed, hashing has been widely adopted for ANN search in large-scale datasets. The essential idea of hashing is to map the data points from the original feature space into binary codes in the hashcode space with similarities between pairs of data points preserved. The advantage of binary codes representation over the original feature vector representation is twofold. Firstly, each dimension of a binary code can be stored using only 1 bit while several bytes are typically required for one dimension of the original feature vector, leading to a dramatic reduction in storage cost. Secondly, by using binary codes representation, all the data points within a specific Hamming distance to a given query can be retrieved in constant or sub-linear time regardless of the total size of the dataset. Hence, hashing has become one of the most effective methods for big data retrieval and mining.

由於雜湊的低儲存耗費和高查詢速度,它被廣泛應用於大資料的近似最鄰近搜尋。雜湊的基本思想是將原始特徵空間的資料點映射成雜湊碼空間的二進位碼,同時也儲存了每一對資料點之間的相似性。二進位碼的表示相對於原始特徵向量的表示有兩點優勢。首先,每一個二進位碼可以通過1bit來儲存,而一個原始特徵向量則需要幾個byte來儲存,導致了儲存耗費的大幅減少。其次,通過使用二進位碼來表示,對於一個給定的查詢,所有的在特定的【漢明距離(Hamming distance)】內的資料點都能夠在常量時間或分段線性時間內被檢索到,而不管資料集的總的大小。因此,雜湊已經成為大資料檢索和挖掘最有效方法之一了。

To get effective hashing codes, most methods adopt machine learning techniques for hashing function learning. Hence, learning to hash, which tries to design effective machine learning methods for hashing, has recently become a very hot research topic with wide applications in many big data areas. This tutorial will provide a systematic introduction of learning to hash, including the motivation, models, learning algorithms, and applications. Firstly, we will introduce the challenges faced by us when performing retrieval and mining with big data, which are used to well motivate the adoption of hashing. Secondly, we will give a comprehensive coverage of the foundations and recent developments on learning to hash, including unsupervised hashing, supervised hashing, multimodal hashing, etc. Thirdly, quantization methods, which are used to turn the real values into binary codes in many hashing methods, will be presented. Fourthly, a large variety of applications with hashing will also be introduced, including image retrieval, cross-modal retrieval, recommender systems, and so on.

為了得到高效的雜湊編碼,對於雜湊函數學習,很多方法採用機器學習技術。因此,學習雜湊,即為雜湊儘可能設計有效機器學習方法,最近已經成為一個非常熱的研究話題,同時在很多大資料領域也有很多應用。這個教程會提供一個學習雜湊的系統的介紹,包括動力、模型、學習演算法、應用。首先,我們會介紹當我們檢索和挖掘大資料時所面臨的挑戰,這是採用雜湊的很好的動力。其次,我們會給出一個關於學習雜湊的基礎和最近發展的綜合性概述,包括無監管雜湊、監管雜湊、多模態雜湊、等。第三,會介紹【量化方法(quantization methods)】,它在很多雜湊方法中用來將真實的值轉變為二進位碼。第四,大量不同的雜湊應用也會被介紹,包括映像檢索,跨模態檢索,推薦系統等等。

References

[1] Peichao Zhang, Wei Zhang, Wu-Jun Li, Minyi Guo. Supervised Hashing with Latent Factor Models. To Appear in Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2014.

[2] Dongqing Zhang, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. To Appear in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 2014.

[3] Ling Yan, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.

[4] Weihao Kong, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.

[5] Weihao Kong, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale Image Retrieval. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2012.

[6] Weihao Kong, Wu-Jun Li. Double-Bit Quantization for Hashing. Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2012.

 

Slides & Outline(投影片&大綱)

TBD(To Be Determined 待決定; )

 

Presenter
  Wu-Jun Li

Dr. Wu-Jun Li is currently an associate professor of the Department of Computer Science and Technology at Nanjing University, P. R. China. From 2010 to 2013, he was a faculty member of the Department of Computer Science and Engineering at Shanghai Jiao Tong University, P. R. China. He received his PhD degree from the Department of Computer Science and Engineering at Hong Kong University of Science and Technology in 2010. Before that, he received his M.Eng. degree and B.Sc. degree from the Department of Computer Science and Technology, Nanjing University in 2006 and 2003, respectively. His main research interests include machine learning and pattern recognition, especially in statistical relational learning and big data machine learning (big learning). In these areas he has published more than 30 peer-reviewed papers, most in prestigious journals such as TKDE and top conferences such as AAAI, AISTATS, CVPR, ICML, IJCAI, NIPS, SIGIR. He has served as the PC member of ICML‘14, IJCAI‘13/‘11, NIPS‘14, SDM‘14, UAI‘14, etc.

李武軍博士目前是中國·南京大學電腦科學與技術系的副教授。從2010 to 2013,他是中國·上海交大的電腦科學與工程系的教員。2010年,他在香港大學電腦科學與工程系榮獲博士學位。在這之前,他分別在2006、2003年在南京大學大學電腦科學與技術系獲得了碩士工學學位和學士理學學位。他的主要研究興趣包括機器學習和模式識別,特別是在大資料的統計關係學習和機器學習。在這些領域,他發表了30多篇同行評審論文,大多在例如TKDE等著名的報刊和例如AAAI, AISTATS, CVPR, ICML, IJCAI, NIPS, SIGIR等頂級會議。他曾擔任ICML‘14, IJCAI‘13/‘11, NIPS‘14, SDM‘14, UAI‘14的程式委員會成員。

學習雜湊及雜湊在大資料檢索和挖掘中的應用

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.