The picture above is funny.
But for me it's also one of those examples that make me sad about the outlook for AI and for computer Vision. What would it take for a computer to understand this image as you or I do? I challenge you to think explicitly of all the pieces of knowledge that has to fall in place for it to make sense. Here's my short attempt:
- You recognize it's an image of a bunch of people and you understand they be in a hallway
- You recognize that there is 3 mirrors in the scene so some of those people is "fake" replicas from different viewpoints.
- You recognize Barack Obama from the few pixels. It helps that he was in the he suit and that he was surrounded by the other people with suits.
- You recognize that there's a person standing on a scale, even though the scale occupies only very few white pixels that bl End with the background. But, you've used the person's pose and knowledge of how people interact with objects to figure it out.
- You recognize that Obama have his foot positioned just slightly on top of the scale. Notice the language I ' m using:it is in terms of the 3D structure of the scene, not the position of the leg in the 2D coor Dinate system of the image.
- You know how physics Works:obama are leaning in on the scale, which applies a force on it. Scale measures force that's applied on it and that's how it works and it'll over-estimate the weight of the person stand ing on it.
- The person measuring he weight is not aware of Obama doing this. You derive this because your know his pose, your understand that field of view of a person are finite, and you understand That he was not very likely to sense the slight push of the Obama foot.
- You understand that people is self-conscious about their weight. You also understand that he was reading off the scale measurement, and that shortly the over-estimated weight would confuse him because it would probably be much higher than what he expects. In other words, reason about implications of the events that is about to unfold seconds after this photo was taken, a nd especially about the thoughts and how they would develop inside people ' s heads. You also reason on what pieces of information is available to people.
- There is people in the back who find the person ' s imminent confusion funny. In all words you is reasoning about state of mind of people, and their view of the state of mind of another person. That ' s getting frighteningly meta.
- Finally, the fact that the perpetrator are the president makes it maybe even a little more funnier. Understand what actions is more or less likely to being undertaken by different people based on their status and IdentIT Y.
I could go on, and the point here is, so you've used a HUGE amount of information in that half second when your look at th e picture and laugh. Information about the 3D structure of the scene, confounding visual elements like mirrors, identities of people, Affordanc ES and how people interact with objects, physics (how a particular instrument works, leaning and what that does), PE Ople, their tendency to being insecure about weight, you ' ve reasoned about the situation from the point of view of the person On the scale, what he was aware of, what he intents was and what information was available to him, and you ' ve reasoned abo UT people reasoning about people. You ' ve also thought about the dynamics of the scene and made guesses on how the situation would unfold in the next few s Econds visually, how it'll unfold in the thoughts of people involved, and what's reasoned about how likely or unlikely it I s for people of particular identity/status to carry out some action. Somehow all these thinGS come together to ' make sense ' of the scene.
It is mind-boggling this all of the above inferences unfold from a brief glance at a 2D array of r,g,b values. The core issue is, the pixel values are just a tip of a huge iceberg and deriving the entire shape and size Of the Icerberg from prior knowledge are the most difficult task ahead of us. How can we even begin to go about writing a algorithm that can reason about the scene like I did? Forget for a moment the inference algorithm that's capable of putting of this together; How does we even begin to gather data, can support these inferences (for example how a scale works)? How does we go about even giving the computer a chance?
Now consider, the state of the art techniques in computer Vision is tested on things like Imagenet (task of Assign ing 1-of-k labels for entire images), or Pascal VOC Detection challenge (+ include bounding boxes). There is also quite a bit of work on pose estimation, action recognition, etc., but it's all specific, disconnected, and Only half works. I hate to say it has the state of CV and AI are pathetic when we consider the task ahead, and if we think about how we CA n ever go from this to there. The road ahead is long, uncertain and unclear.
I ' ve seen some arguments that all we need are lots more data from images, video, maybe text and run some clever Lea Rning Algorithm:maybe A better objective function, run SGD, maybe anneal the step size, use Adagrad, or slap an L1 here a nd there and everything would just pop out. If we only had a few more tricks up our sleeves! but to me, the examples like this illustrate the we are missing many CR Ucial pieces of the puzzle and that a central problem would be as much about obtaining the right training data in the right form to support these inferences as it would be is about making them.
Thinking about the complexity and scale of the problem further, a seemingly inescapable conclusion for me was that we could a LSO need embodiment, and that's the only-to build computers-can interpret scenes like we-do are to allow them to get Exposed to all the years of (structured, temporally coherent) experience we had, ability to interact with the world, a nd some magical active learning/inference architecture that I can barely even imagine if I think backwards about what it should be capable of.
In any case, we is very, very far and this depresses me. What is the forward? :( Maybe I should just do a startup. I have a really cool idea for a mobile local social iPhone app.
from:http://karpathy.github.io/2012/10/22/state-of-computer-vision/
The state of computer vision and AI: We've come a long way. The computer Vision and Ai:we is really, really far away.