This paper briefly introduces the key technologies and the current research situation of the content-based video analysis and retrieval, and briefly introduces the research hotspots in this area and the work to be done in the future.
First, the issue of the proposed:
The advent of the internet has brought great convenience to mankind, especially to realize the Internet after the sharing of resources, but in the face of this vast resources in the end what is the use of their own value? Since the 90 's, the rapid progress of multimedia technology and network technology, people are entering an information society quickly. Modern technology has been able to use a variety of means to collect and produce a large number of various types of multimedia information data, the emergence of Digital library, Digital museum, Digital cinema, video telephony, interactive TV, conference television, VOD services, distance education and telemedicine and other new forms of service and information exchange means, In the numerous multimedia information The biggest is also the most important one is the video information, the human receives the information about 70% comes from the vision, the video carries the amount of content far more than the voice and the data. With the high expansion of video information today, the problem is the efficient retrieval and browsing of massive video information, that is, how people can quickly and effectively view a large number of video information, and find out what they are interested in.
The traditional video information retrieval scheme is to use the text identifier to retrieve, the specific to the video frame of the query is the frame image with the number and comments to do, first of all the frame image with a description of the text or a number of comments, and then on the retrieval of comments to retrieve, In this way, the query to the frame image becomes a comment-based query. Although this method is simple, but can not fully meet the needs of video data retrieval, the first video data volume is very large, manual way to add comments work is very large, and inefficient; Secondly, the content of the video is rich and difficult to express with the text tag; The text description is a specific abstraction, and the specific label only fits the specific query [1] The final text label is added by the observer and, therefore, influenced by the subjective factors, different observers may have different descriptions. Therefore, an objective and comprehensive video retrieval method is needed, and the content-based video retrieval (content-based-Retrieval,cbvr) comes into being. It searches the video data in large-scale video database according to the content and context of the video. Provides an algorithm that automatically extracts and describes the features and contents of a video without human involvement. Different from the traditional keyword-based retrieval methods. Fusion of image understanding, pattern recognition, computer vision and other technologies.
In recent years, with the extensive application of multimedia information in entertainment, commerce, production, medicine, security, national defense, military and other fields, content-based video retrieval technology has become a hot research topic in recent years. Research on high-efficiency classification, processing and indexing of video data, establishing and perfecting the fast browsing and retrieving mechanism of video information The development of powerful and easy-to-use video information browsing and retrieval system has great theoretical value and great potential for application.
Second, the solution:
Video Labeling: Video labeling is the subjective attribution of a piece of video by manual means, and then the text is used to retrieve it. Video labeling technology has been quite mature, but has its inherent shortcomings, first, manual completion, the workload is very high, and low efficiency. Second, some video and perceptual features are difficult to describe in words. Third, the subjective is very strong, there is no unified standard, different people have different understanding of the same video, will inevitably lead to different labeling results.
Video Summary: A video summary that extracts meaningful parts from the original video in an automated or semi-automatic way, merging them into a compact, video-based overview that fully displays the semantic content of the video. Video summarization Technology also has a certain development, while the content-based video retrieval provides ideas, but with real content-based video retrieval has a certain distance.
video content retrieval based on non-compressed domain : Video content retrieval based on non-compressed domain is analyzed on the basis of low-level feature of video, feature extraction and so on, finally, the essential feature of video is the retrieval basis, and the automatic retrieval is fully realized. Video content retrieval based on non-compressed domain has considerable research results, but because all of its algorithms must be on the basis of full decompression, and video data is not only a large amount of data, but also a large number of computations, so in the concrete implementation is not ideal.
video content retrieval based on compressed domain : The video content retrieval based on compressed domain is analyzed and feature extracted on the basis of the low-level feature of video stream under the premise of incomplete decompression or non-pressure, and finally the automatic retrieval is realized by the essential feature of video. Because the video content retrieval based on compressed domain is carried out without decompression or incomplete decompression, its advantage is: first, greatly reduce the amount of data, second, reduce the amount of data calculation, thus greatly improving the system efficiency.
Third, foreign research status:
1. Qbic is a content-based retrieval system developed by IBM Research Center, which is the first full-featured video database system and a typical representative of content-based retrieval system, which has a far influence on the development of video database. The QBIC system supports sample queries and user sketch queries, extracts color, texture, shape features, and lens and target motion information, and uses R-tree as a high-dimensional index structure to retrieve large image and video databases in conjunction with keywords.
2, Infor Media Digital Video Library project is the Carnegie Mellon University (CMU) on digital Video media processing and management of a major project, is a more complete content-based video analysis prototype system pioneer. The system takes the lead in the application of digital audio processing technology and text processing technology to content-based video analysis, through speech recognition and text recognition to obtain video semantics, auxiliary video segmentation, extract meaningful video clips to generate video summaries, support automatic omni-directional video information query to support content-based video browsing, retrieval and service.
3, VIDEOQ is a fully automatic object-oriented content-based video query system, is a prototype system developed by Columbia University's image and advanced Television laboratory. It expands the traditional search method based on keyword or topic browsing, and proposes a new query technology based on rich visual features and spatio-temporal relation, which can help users to query the object in video, which aims to explore all potential visual clues in video and use it for object-oriented content-based video query. At present VIDEOQ supports a huge video database, meanwhile, VIDEOQ is a web-oriented video search system.
4, visual seek is a Visual feature query system, Webseek is a www-oriented text/image/Video query system, which was developed by Columbia University. The main feature of Visualseek/webseek is that according to the spatial relation retrieval of image region and the visual features extracted from the compression domain, the visual feature is the color set and the texture feature based on wavelet transform, in order to speed up the retrieval speed, a two-fork Tree index algorithm is used. This system has some powerful modules: content-based image retrieval concept, query optimization based on user similarity feedback, automatic extraction of visual information, thumbnail representation of video/image of query result, theme browsing function of image/video, text-based search, operation of query results, etc.
5. Cveps is a software prototype for video retrieval and operating system developed by Columbia University, which supports automatic video segmentation, video retrieval and compression video editing based on key frames and objects.
6. Jakob is a video database query system developed by Plerm University in Italy, which divides video data into lenses from the lens extractor and selects some representative frames from each lens. These represent frames are described according to color and texture, and then the motion characteristics associated with these short sequences are computed and a dynamic description is given. When a query is submitted to the system or an example is queried directly, the query model interprets it, arranges the matching parameters, and gives the most similar shots. The user can browse these results, change the parameters if necessary, and make queries over and over again.
7, Vision is a digital Video library prototype system developed by Kansas University, which combines video processing and speech recognition, and automatically divides video into a large number of logical semantic video clips based on two-segment algorithm based on video and audio content. Add the title decoder and the word indicator to the system to extract the text information through which they indexed the video clip.
8. Gnalgle Soccer Video Search engine is a football video analysis system developed by Alllsterdam University. The system is based on Web applications and has a tree-structured framework. Users can easily find such as goals, yellow cards, red card warnings, substitutions, or search for special players.
9, Rochester University's sports video analysis system, can be better for sports video object and event detection, and finally form a video summary of the wonderful footage, the system has been used for the 2004 Olympic Games, the football game video processing, transmission to the user's mobile phone.
Iv. Domestic Research status:
1, Tv-fi (Tsinghua video Find It) is a video program management system developed by Tsinghua University, features include: Video data warehousing, content-based browsing, retrieval and so on.
2. Ivideo is a video retrieval system developed by the Digital Technology Research Institute of CAS, which is a video retrieval system based on the Java EE platform, which features video analysis, content management, Web retrieval and browsing.
3. Videowser is a prototype system developed by the research group chaired by Professor Hu Xiaofeng and professor according of National Defense University of Science and Technology. The research work of this group mainly focuses on the structure analysis of video, and they study the problems of lens segmentation, key frame extraction and lens clustering, and recently the research group began to study audio feature extraction and retrieval. As well as the research and development of Multimedia Research Center and System Engineering Department, the news program browsing Retrieval system gamma (New Video CAR) and multimedia information query and retrieval systems are developed.
4, academician of Zhejiang University and Pan Yunhe Professor The research group focuses on the research of video retrieval and video similarity measurement, and proposes a video similarity measurement method based on the lens centroid feature vector, which provides a method for video retrieval from the feature of image sequence. In addition, the study group tried to extract information from the closed circuit (closed-caption) in the video stream for video retrieval.
5. The research group chaired by Professor Gao Gao, Peking University, mainly conducts research on face detection and tracking system in complex background, and they designed and implemented a face detection and tracking system based on feature sub-face (Eigen subface), which first used the template matching method to detect coarse ( Using a gray-scale distributed face template, and based on the collection of effective inverse sample set (non-human face sample set), to improve the accuracy of recognition. At present, the research group is conducting lip-reading (lip-reading/speech-reading) studies of integrated audio features and image sequence features.
6, IFind Information Retrieval system is the Microsoft Research Asia of Dr. Hongjiang led by the team developed by the system, the results achieved most prominent.
V. Key Technologies
The first part is the lens segmentation, the second part is the key frame extraction, the third part is the feature-based video index and storage organization.
Lens segmentation:
The main idea of the lens segmentation is based on the difference between the eigenvalues of the two-frame image and the given cut off from value, if the difference is greater than the given cut off from value, it shows that the two frame has a large variation, it can be considered that two frames are different themes, and the lens segmentation between two frames; if the difference is less than the given cut off from value, You can think of two frames as the same theme, and you can continue to compare the next two frames [1][7].
Feature Extraction:
Video features include text features, sound features, and image features.
From a content-based perspective, text features are text messages extracted from the video content itself, mainly the results of automatic speech recognition (ASR) and video character recognition (VOCR). The text information obtained by automatic speech recognition and video character recognition can extract features and index as traditional text. Basic sound features include global and local spectral information, which can also be used to obtain information such as loudness, pitch, brightness, bandwidth, tune-up, or quiet, voice, music, car, explosion and other classified information. Based on this information, people can do sound-based retrieval or filtering. Because the image is an indispensable element in the video, and the image retrieval has been studied for quite a long time, the research of image features is more extensive. For a lens, you typically select one or several keyframes according to a certain criterion, and then extract the image features from the keyframes. Common image features include color, texture, and shape, which are the most commonly used features in current content-based image and video retrieval. In recent years, semantic concept has become the focus of research. Semantic concept feature refers to the descriptive feature of the semantic level of video. It is a machine learning method that uses features such as text, sound, and images to automatically model and extract. Semantic conceptual features allow people to naturally search at the semantic level, while also helping to navigate more effectively.
Automatic retrieval:
In automatic retrieval, the user's valid query input is the first problem, although it is often simply ignored. Most content-based video retrieval systems assume that the user's query input is a sample image, when the text feature exists, the user can use text input, the video fragment as the input system is rarely seen, this input method is not really realistic and effective, because the user may not be able to find a suitable sample image, The text feature is not always present in the content-based video retrieval system [7]. User-given query, based on extracted features, the most commonly used retrieval methods are text retrieval (text features, semantic conceptual features), similarity retrieval (sound, image features, semantic concept features) and machine learning based on the retrieval (sound, image features, semantic conceptual characteristics).
High-dimensional indexing technology:
The experimental data of many retrieval algorithms are only hundreds of or thousands, although sequential searching is used, but the response time of the retrieval is not felt. For large media libraries, it is definitely necessary to index, so you need to study new index structures and algorithms to support fast retrieval. At present, it is generally used to reduce the number of dimensions before using the appropriate multi-dimensional index structure method. Although some progress has been made in the past, it is still necessary to study and explore effective high-dimensional indexing methods to support query requirements for multiple features, heterogeneous features, weights, and primary key features [2].
Vi. Outlook
The International Organization for Standardization is currently working on content-based coding, which links coding closely to content-based retrieval applications. MPEG-4 has begun to consider some of the features of content-based retrieval to some extent. At present, MPEG Standards Organization is committed to the development and improvement of multimedia content description standard MPEG-7. The goal is to develop a standardized framework for multimedia content description to enable efficient representation and retrieval of multimedia content. MPEG-7 defines a range of methods and tools from different perspectives of audiovisual content description. In general, the researchers have started from different techniques of CBVR system, and have achieved corresponding results. Most of the research has followed the research ideas of computer vision, pattern Recognition and database indexing, and some progress has been made in the research of the technology of content-based video retrieval itself, such as related feedback and semantic feature extraction. But these studies are far from satisfying the needs of practical applications. So there is a lot of work to be done in the future for a long time:
(1) Select a more effective video feature. Existing colors, textures and other features do not effectively represent the content of the video. In order to improve the video features of the lens and scene, in the selection process of these features, the user feedback can be combined with the automatic machine learning.
(2) Multi-feature fusion retrieval technology. Most of the current research focuses on visual media, especially images and video. But the information environment of our life is omni-directional, multimedia information also includes typical audio media, as well as graphics, animation and other media. With the deepening of information process, these media data will be more and more, inevitably facing the problem of retrieval. It is necessary to carry on content-based retrieval for digital audio, voice and music, and to search synthetic media such as animation and VRML data. While studying the retrieval of single media, the paper studies the correlation and complementarity of multiple media to improve the efficiency of the retrieval algorithm.
(3) Video related feedback. An important feature of the CBVR system is the interactivity of the information acquisition process, and the Intelligent User query interface is a trend of future development. The query interface should provide rich interactive ability to express the perception of media semantics in the process of active interaction, adjust query parameters and their combinations, and finally get satisfactory results. The research mainly involves how to transform the user's query table to achieve the feature vectors that can perform the retrieval, and how to obtain the user's content perception in order to select the appropriate retrieval features from the interactive process [6].
(4) In the lens detection aspect. After years of development, content-based video retrieval technology has made some progress in the detection of lenses, many different algorithms have been proposed, but there are still some imperfect areas need to be improved, especially in the lens gradient detection because the lens gradient type is many and complex, There is still much work to be done to detect the gradient lens completely and accurately.
(5) Human-Computer interaction function. The final function of video retrieval system is to provide a convenient retrieval platform, so a humanized human-computer interaction platform is essential. For example, a variety of input means, flexible interactive means, effective feedback mechanism and so on, are a human retrieval system must be considered, a retrieval system in human-computer interaction will be good or bad system performance is very important aspects, in this area there are many need to study the work.
(6) Performance Evaluation index. At present, there is no uniform standard to evaluate the performance of the video retrieval system, and the performance of the retrieval system should take into account the various performance that the system has or should have. For content-based video retrieval system, not only the search function is very important, other such as browsing, organization and data mining and other aspects of the ability is also very important, so the measurement of the system must be comprehensive. This research is also becoming a hotspot of research, there are a lot of work worth us to do.
(7) Search based on compressed domain. The development of video compression technology is very rapid, especially the current HDTV-represented compression technology has been closely integrated with the market, the impact is growing. After the video data is compressed, most of the redundant information is eliminated, and the information that is retained is the information that reflects the characteristics of the video. The research in this field can start from two aspects: one is to excavate the video content analysis technology that the existing compression algorithm can support, so that most of the video analysis work can be done directly on the compressed stream, and the other is to develop a new compression algorithm for video retrieval application. So that compressed video can directly reflect the video content characteristics and semantic rules.
(8) Web-based retrieval. The rapid development and wide application of the network not only promote the application of visual information retrieval, but also put forward a new challenge to the technology of visual information retrieval. Web-based text retrieval technology is basically mature, Baidu and Google is the typical representative of this technology, but the Web resources in the text is only a small part, and the most meaningful and most explanatory resources are ubiquitous in the network of visual information, like and video, But at present, the retrieval technology of visual information is very immature, and there is still no mature product.
(9) Semantic-based retrieval. Nowadays the visual retrieval system mostly adopts the method of text or low-level image in the description of the image content, and the traditional image description model based on the low-level feature, the description of the image generally appears in the form of statistic data, in fact, these statistics and people have great differences in the understanding of the image content. First of all, people's understanding of the image content is not only by statistics, people also have the ability to learn, secondly, the image content has "fuzzy" characteristics, can not be expressed with simple eigenvector, and finally, the understanding of the video information is based on human knowledge, and these low-level features can not reflect these experience knowledge. Thus, how to describe the content of visual information, so that it as far as possible with the understanding of the image content, is the key to image retrieval, but also the difficult point, from the perspective of human understanding, the description and understanding of information is mainly in the semantic layer, therefore, How to combine the semantic features into the retrieval system to improve the performance of the retrieval system has become more and more concerned.
Turn: Content-based video analysis and retrieval