Devi Parikh, chief scientist of Facebook AI Research Institute, exclusive dialogue with Shannon Technology

Source: Internet
Author: User
Tags compact



Devi Parikh, chief scientist of the Facebook AI Research Institute (FAIR), is the 2017 IJCAI computer and thought Award winner (Ijcai, one of the two most important awards, known as the "Fields Award" in the International AI Field) and ranked Forbes 2017 "20 Women in the AI research list. She is mainly engaged in computer vision and pattern recognition research, including computer vision, language and vision, general reasoning, artificial intelligence, human-machine cooperation, contextual reasoning and pattern recognition.



From 2008 to now, Devi Parikh has published several papers in three top conferences of computer vision (ICCV, CVPR, ECCV). The visual question answering data set (visual Question anwering), which she chaired, was widely watched, and VQA challenge and VQA workshops were organized on CVPR 2016, which greatly facilitated the solution of the problem of machine intelligence understanding of the picture, and thus gained 2016 The National Science Foundation's Outstanding Young Professors Award (NSF Career Award). Her recent research focuses on the intersection of vision, natural language processing and reasoning, and hopes to construct a more intelligent system through human and machine interaction.



Shannon Technology: You and your team developed a visual quiz data set (VQA, visual Question answering DataSet, Antol et al. ICCV2015; Agrawal et al IJCV 2017) has greatly contributed to the development of this field. This data set includes computer vision, natural language processing, common sense reasoning and many other fields. How do you assess the impact of the VQA dataset to the present? Did you implement the purpose of developing this dataset? How do you expect the VQA dataset (and its advanced version) to affect this area in the coming years?



Devi and Aishwarya:



VQA Data Set Impact:



We have received extensive attention in the short term after the release of the work on VQA-more than 800 papers (Antol et al. ICCV 2015; Agrawal et al IJCV 2017) was also awarded the Best Poster award at the "Object Cognition in Dialogue" seminar in 15 ICCV (Prime Poster Award).



To assess the progress of VQA, we used VQA first edition as a dataset, in 2016 IEEE International computer vision and Pattern Recognition Conference (CVPR-16,IEEE Conference on computer vision and pattern Recogniti On 2016) organized the first VQA Challenge and the first VQA seminar (Antol etal. ICCV 2015; Agrawal et al. IJCV 2017). Challenges and seminars are popular: some 30 teams from 8 countries in academia and industry are involved in this challenge. In this challenge, the accuracy of VQA increased from 58.5% to 67%, a 8.5% increase.



Figure 1. Sample question and answer in VQA data set.



The VQA v1 DataSet and the VQA Challenge not only facilitated the improvement of the original solution, but also spawned a new batch of models and datasets. For example, use spatial attention to focus on the model of the image area associated with the problem (stacked Attention Networks, Yang et al., CVPR16) In a layered way, the image and the problem should be noticed where the attention neural network (hierarchical Question image co-attention, Lu et al., NIPS16) can be dynamically combined with the model of the module, Each module is dedicated to sub-tasks such as color classification (neural Module Networks, Andreas et al., CVPR16), and uses bilinear pooling to combine visual and linguistic features to extract a richer representation of the model (multimodal Compact Bilinear Pooling,fukui et al.,emnlp16).



In addition, VQA has spawned a number of new datasets, including models and related datasets that focus on visual inference and linguistic composition (Clevr:a Diagnostic dataset for compositional Language and elementary Visual Reasoning, Johnson et al, CVPR17); for VQA the first version of the data set, it can be used to study the combinatorial problem of language C-vqa (A compositional Split of the VQA v1.0 dataset , Agrawal et al, ArXiv17); There is also a need for models to overcome the effects of transcendental verbal knowledge, the VQA datasets that must be observed to answer questions (Agrawal et al., CVPR18).



In short, our work on VQA has created a new multidisciplinary sub-domain in artificial intelligence. In fact, shortly after this data set is released, VQA has become a new option when you submit your paper and select related subtopics at some important AI conferences.



Whether the original intention of VQA development was realized:



Although the VQA community has made great strides in improving the performance of the VQA model (the prediction accuracy on the VQA v2 dataset has increased from 54% to 72% in 3 years), we still have a long way to go to fully solve the VQA task. The existing VQA model still lacks many necessary abilities, such as visual Landing (visual grounding), combinatorial (compositionality), common sense reasoning, etc., and these abilities are the core of the solution to VQA.



When we develop datasets, we think that the generalization of the model should be a challenge because it is difficult to expect the model to be trained on the training set, and it will be well extended to the test set. Because during testing, the model may encounter any open issues about the image, and it is most likely that you have not encountered a similar problem during training. We expect researchers to try to use external knowledge to deal with such problems, but there is little at this stage of work. But we have seen some initial progress in that area (e.g., Narasimhan et al eccv 2018, Wang et al Pami 2017), hoping to see more in the future.



Expect the future impact of the VQA dataset:



We hope that the VQA dataset will have a direct and indirect impact on the field. The direct impact is that we expect more new models or technologies to emerge in the coming years to further improve the accuracy of predictions on VQA first and VQA second edition datasets. The indirect impact is that we want more new datasets and new tasks to be developed, such as CLEVR (Johnson, CVPR17), compositional VQA (Agrawal et, ArXiv17), and the need to overcome prior language knowledge VQA (Agrawal et al.,cvpr18), an image-based dialogue (Das et Al.,cvpr17), requires a body-aware quiz (embodied Question answering, das et al.,cvpr18). They are either built directly on top of the VQA dataset or are constructed to address the limitations of existing VQA systems. Therefore, we expect that the VQA dataset (and its variants) will further enhance the capabilities of existing AI systems, construct systems that understand language images, generate natural language, perform actions, and make inferences.



Shannon Technology: Recently, your team released the second edition of VQA (Goyal et al. CVPR 2017), which contains similar pairs of images that have different answers to the same question. Such datasets are more challenging. Often, creating more challenging data sets forces the model to encode more useful information. However, building such a data set consumes a lot of manpower. Is it possible to generate intrusive or adversarial examples in an automated way, thus raising the predictive power of the model to a new level?



Figure 2. VQA 2.0 a picture and an example of a problem in a dataset, each of which corresponds to two similar, but different, images that need to be answered. The picture comes from the thesis Goyal et al CVPR 2017.



Devi, Yash, and Jiasen: building large datasets is really labor-intensive work. There are currently some tasks that automatically generate new question and answer pairs based on existing annotations. For example, Mahendru et EMNLP 2017 uses a template-based approach to generate new questions and answers about basic concepts in everyday life based on the premise of the VQA training set. The study found that adding these simple new question and answer pairs to the VQA training data can improve the performance of the model, especially with respect to language composition (compositionality).



In the problem of data enhancement, it is also an important topic to generate image-related problems. Unlike the above method based on template generation problems, this approach generates more natural problems. However, these models are far from mature and cannot be answered for generation problems. Therefore, it is still very difficult to automatically generate accurate questions and answers for images. To solve this problem, semi-supervised learning and adversarial example generation may provide some better ideas.



It is worth noting that one of the early datasets on image problems was the Toronto COCO-QA dataset developed by Mengye Ren and others in 2015. They use natural language processing tools to automatically convert labels about images into question-and-answer pairs. While such questions and answers often leave strange artifacts, it is an excellent method to convert a task's callout (in this case, subtitles) to another related task (in this case, question and answer).



Shannon Technology: In addition to the VQA mission, you have developed an image-based dialog DataSet--visual Dialog DataSet (Das et al., CVPR, Spotlight). When , you paired two participants on the Amazon Labor crowdsourcing platform (a widely used crowdsourcing data labeling platform) to show one of the pictures and the title of the graph, and the other person can only see the title of the graph. The task requires that only the participants in the caption be presented with a picture to another participant who can see the picture to better visualize the scene of the image. This data set gives us a clear picture of what information in the image people think is more deserving. Do you think that the model is pre-trained to guess what questions people might ask, and that the model has a more human-like attention mechanism that can improve its question and answer capabilities?



Figure 3. An image-based conversation task in which a chat robot needs to have a conversation with a person about the image content. Sample from Das et al, CVPR 2017.



Devi and Abhishek: In these conversations, there are some regularities in the question: The conversation always begins with talking about the most striking objects and their attributes (such as people, animals, large objects, etc.), ending in questions about the environment (for example, "What else is in the image?"). What's the weather like? "etc.). If we can make the model learn to distinguish similar images for the purpose of asking questions and provide answers, so that the questioner can guess the image, we can generate a better visual dialogue model. Das & Kottur et al., ICCV 2017 shows some of the relevant work.



Shannon Technology: Combinatorial nature is a classic problem in the field of natural language processing. You and your colleagues have studied the combination of evaluating and improving the VQA system (Agrawal et al. 2017). A promising direction is the combination of symbolic methods and deep learning methods (for example, Lu et al CVPR 2018, Spotlight). Can you talk about why neural networks generally cannot be generalized systematically and how can we solve this problem?



Figure 4. Example of a composite VQA dataset (C-VQA). The combination of words in the test set is that the model has not been studied in the training set, although each of these combinations appears in the training set. The picture is from Agrawal et al. 2017.



Devi and Jiasen: One of the reasons we think that results are that these models lack common sense, such as how the world works, what is predictable, and what is unpredictable. Such knowledge is the key to how humans learn from examples, or that they can still make rational decisions in the face of emergencies. Today's neural networks are closer to pattern-matching algorithms that are adept at extracting complex correlations between inputs and outputs from the training data set, but in a way they are all they can do. The methods of incorporating external knowledge into neural networks are still scarce.



Shannon Technology: Your work has gone beyond the combination of vision and language to expand into multi-mode integration. In your recently published "embodied Question answering" paper (Das et al CVPR, 2018), you introduced a task that includes active perception, language understanding, goal driven navigation, common sense reasoning, and language landing for action. This is a very attractive direction, it is more realistic, and more closely related to the robot. A challenge in this context is to adapt quickly to the new environment. Do you think a model trained in a 3D room environment (such as the model in your paper) will quickly adapt to other scenarios, such as outdoor environments? Do we have to build a meta-learning (meta-learning) capability in the model to achieve rapid adaptation?



Figure 5. In the embodied QA task, the robot answers questions by exploring the surrounding 3D environment. To accomplish this task, the robot must combine natural language processing, visual inference, and the ability to navigate the target. The picture comes from Das et al CVPR 2018.



Devi and Abhishek: In current instances, they cannot be extended to outdoor environments. What these systems learn is closely related to the specific distribution of images and environments that they receive during training. Therefore, while some generalization of the new indoor environment is possible, for outdoor environments, they have not seen enough examples of outdoor environments during training. For example, in an indoor environment, the wall structure and depth give clues about the feasible path and the infeasible path. In an outdoor environment, the surface of the road (for example, the road or lawn) may be more relevant to the system's ability to pass through the path, and the depth is less relevant.



Even within the indoor range, the generalization of the 3D room to a more realistic environment is a problem that is not completely solved. Meta-learning methods certainly help to better extend to new tasks and environments. We're also thinking about building a modular system that separates perception from navigation, so you just need to re-learn the perceptual module in the new environment and then map the visual input of the new environment (such as a more realistic environment) to the more familiar feature space of the planning module.



Shannon Technology: You have a series of papers to study the premise of the problem in the VQA task (Ray et al emnlp, Mahendru et al. 2017), and your research finds that forcing the VQA model to judge whether a problem premise is established during training can be improved by the combination of the model (c ompositionality) on the issue of generalization ability. There seems to be a common trend in the NLP field, which is to use auxiliary tasks to improve the performance of the model on the main task. But not every auxiliary task will certainly help, could you tell us how we can find useful auxiliary tasks?



Figure 6. The VQA problem often contains some hidden premises that will prompt some of the image information. So Mahendru et al. constructs the "Problem correlation prediction and Interpretation" dataset (Question relevance prediction and explanation, Qrpe). The example shown in Mahendru et al emnlp 20,171 shows some of the assumptions that the "error-based detection" model detects.



Devi and Viraj: In a 2017 paper published by our lab Mahendru, the author's goal is to make the VQA model more intelligent in answering questions that are unrelated or never before, by whether the premise of the reasoning question is established. We had an idea that expanding the dataset in such a way might help the model to split the object and its attributes away, which was the essence of the combinatorial problem, which was later experimentally discovered to be true. More broadly, we have now seen many examples of this cross-task migration learning. For example, the DECANLP challenges of multitasking, such as answering questions, machine translation, and goal-oriented dialogues. Alternatively, a model for RGB three-dimensional reconstruction, semantic segmentation, and depth estimation (depth estimation) is trained to build a powerful vision system for accomplishing tasks that require body recognition (embodied Agents, Das et al. 2018). It includes, of course, the widely used methods of pre-training on the ImageNet and then fine-tuning the specific tasks. All of this shows that even for many large-span tasks, the characterization of learning under multi-tasking can be very effective in migration. But it has to be admitted that finding meaningful ancillary tasks is more of an art than a science.



Shannon Technology: In recent years, the explanatory nature of deep learning models has received a lot of attention. You also have a few papers explaining the visual question and answer model, such as understanding which part of the model is concerned with answering questions, or comparing the model's attention to human attention (Das et al emnlp, Goyal et al ICML Workshop on Vis Ualization-Learning, Best Student Paper). Do you think enhancing the interpretation of deep neural networks can help us to develop better deep learning models? If so, in what way?



Figure 7. Explain the mechanism of model prediction by looking for a model that focuses on which part of the input question (heat map that highlights the importance of the vocabulary in the problem) when answering questions. For example, "whole" is the most critical word for the model to answer "no". The picture comes from the paper Goyal et al ICML Workshop on visualization for the deep learning.



Devi and Abhishek: A passage in our Grad-cam paper (Selvarajuet et al., ICCV 2017) gives an answer to this question:



Broadly speaking, transparency/hermeneutics is useful in three different stages of AI evolution. First, when the AI is significantly weaker than humans and is not yet reliably large-scale (for example, visual question answering), the goal of transparency and hermeneutics is to identify why the model failed, thus helping researchers focus on the most promising research direction; Second, when AI is comparable to humans and can be used on a large scale (for example, a model that has been trained on enough data to classify a particular category of images), the purpose of research is to establish confidence in the model in the user community. Thirdly, when AI is significantly stronger than humans (such as chess or Weiqi), the purpose of making the model explanatory is machine teaching, which is to let the machine teach people how to make better decisions.



It does help us to improve the deep neural network model. Some of the preliminary evidence we have found is as follows: If the VQA model is limited to finding answers in the area of the image that people think is relevant to the problem, the model can be better landed in the test and better extended to situations where there are different "answer priori probability distributions" (i.e., VQA-CP datasets).



It is also often possible to reveal the biases that the model learns. Doing so allows system designers to use better training data or take the necessary steps to correct this bias. One such experiment was reported in section 6.3 of our Grad-cam paper (Selvaraju et AL.,ICCV 2017). This shows that the ability to interpret can help detect and eliminate biases in the dataset, which is important not only for generalization, but also as more and more algorithms are applied to real-world problems, and interpretative is important for generating fair and ethical results.



Shannon Technology: In the past, you have done a lot of influential work, and published many widely cited papers. Can you share some advice with students who have just entered the NLP field and tell them how to develop good taste in research topics?



Devi: I'll cite the advice I heard from Jitendra Malik, professor of electronic engineering and computer science at UC Berkeley. We can consider research topics from two dimensions: importance and accessibility. Some problems can be solved, but not important, and some are important, but it is almost impossible to make any progress based on the current position of the entire field. Try to identify issues that are important and that you can (partly) solve. Of course, it's easier said than done, but there are other things to consider besides these two factors. For example, I've always been intrigued by Curiosity to study the questions I find interesting. But this may be a first-order approximation that is useful for the previous two factors.



Reference documents:



Antol S, Agrawal A, Lu J, et al vqa:visual question answering[c]. Proceedings of the IEEE International Conference on computer Vision. 2015:2425-2433.



Yang Z, He X, Gao J, et al stacked attention networks for image question Answering[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition. 2016:21-29.



Lu J,yang J, Batra D, et al hierarchical question-image co-attention for visual question Answering[c]. Proceedings of the advances in neural information processing Systems. 2016:289-297.



Andreas J, Rohrbach M, Darrell T, et al neural module networks[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition. 2016:39-48.



Fukui A, Park D H, Yang D, et al Multimodal compact bilinear pooling for visual question answering and visual grounding[j ]. ArXiv preprint arxiv:1606.01847,2016.



Johnson J, Hariharan B, van der Maaten L, et al. clevr:a Diagnostic dataset for compositional language and elementary Vis UAL Reasoning[c]. IEEE Conference on computer Vision and Pattern recognition (CVPR). IEEE, 2017:1988-1997.



Vo M, Yumer E, Sunkavalli K, et al. Automatic Adaptation of Person Association for Multiview Tracking in Group activities[ J]. ArXiv preprint arxiv:1805.08717, 2018.



Agrawal A, Kembhavi A, Batra D, et al c-vqa:a compositional split of the visual question answering (VQA) v1.0 dataset. ArXiv preprint arxiv:1704.08243, 2017.



Agrawal A, Batra D, Parikh D, et al. Don "t just assume; Look and answer:overcoming priors for visual question Answering[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition. 2018:4971-4980.



Das A, Kottur S, Gupta K, et al. Visual dialog[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition. 2017:1080--1089.



Das A, Datta S, Gkioxari G, et al embodied question answering[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition (CVPR). 2018.



Goyal Y, Khot T, Summers-stay D, et al Making the V in VQA matter:elevating the role of image understanding in Visual Qu Estion Answering[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition (CVPR). 2017, 1 (2): 3.



Mahendrua, Prabhu V, Mohapatra A, et al. The Promise of premise:harnessing Question premises in Visual Question Answering[c]. Proceedings of the Conference on empirical Methods in Natural Language processing. 2017:926-935.



Ren M, Kiros R, Zemel R. Image question answering:a Visual semantic embedding model and A new DATASET[J]. Proceedings of the advances in Neural information processing Systems, 2015,1 (2): 5.



Fang H S, Lu G, Fang X, et al weakly and Semi supervised Human Body part parsing via pose-guided knowledge transfer[c]. Proceedings of the IEEE Conference on computer Vision and Pattern recognition. 2018:70-78.



Ray A, Christie G, Bansal M, et al Question relevance in vqa:identifying non-visual and False-premise questions[j]. ArXiv preprint arxiv:1606.06622,2016.



Das A, Agrawal H, Zitnick L, et al Human attention in visual question Answering:do humans and deep networks look at the Same regions? [J]. Computer Vision and Image understanding, 2017, 163:90-100.



Goyal Y, Mohapatra A, Parikh D, et al Towards transparent AI systems:interpreting visual question answering models[j]. ArXiv preprint arxiv:1608.08974, 2016.



Selvaraju r R, Cogswell M, Das A, et al grad-cam:visual explanations from deep Networks via gradient-based localization[ C]. Proceedings of the international Conference on Computer Vision (ICCV). 2017:618-626.



Devi Parikh, chief scientist of Facebook AI Research Institute, exclusive dialogue with Shannon Technology


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.