2016年早期讀論文階段,我第一次接觸Artificial Intelligence,當時只覺得這個詞彙,真難拼,聽起來逼格蠻高的,cool;2017年,人工智慧已經成了媒體年度新詞,甚至帶動了一票AI概念股漲起來,比如目前時間(2017/12/29 15:00:00 北京時間)PE值已經在365.29的科大訊飛。2017年4、5月份假期,又聽了Andrew Ng的半節《Machine Learning》,半途而廢。錯失了兩次瞭解新技術,甚至是新技術革命的良機,2018年的第一天,把李飛飛的這個Computer Vision的TED刷一下好了,正好好久沒有練英文聽寫了。
Let me show you something…
some pictures are shown, and a girl is describing what does the picture have…
the boy is …
those are the …
that’s a big airplane …
This is a three year old child describing what she sees in a series of photos. She may still have a lot to learn about this world, but she’d already an expert at one very important task to make sense what she sees. Our society is more technologically advanced than ever. We sent people to the moon, we make phones that talk to us or customise radio stations that can play only music we like. Yet, our most advanced machines and computers still struggle at this task. So I’m here today, to give you a progress report on latest advances in out research in computer vision, one of the most frontier and potentially revolutionary technology in computer science. Yes, we have prototyped of cars that can drive by themselves, but without smart vision, they cannot really tell the difference between a crumpled paper bag no the road, which can be run over, and a rock that size, which should be avoided. We have made fabulous megapixel cameras, but we have not delivered sight to the blind. Drones can fly over massive land, but don’t have enough vision technology to help us track the changes of the rainforests. Security cameras are everywhere, but they do not alert us when their child is drawning in a swimming pool. Photos and videos are becoming an integral part of global life. They’re being generated at a pace that’s far beyond what any human, or team of human, could hope to view. And you and I are contributing to that at this TED. Yet our most advanced software is still struggling at understanding or managing these enormous content. So in other words, collectively as a society, we’re very much blind, because our smartest machine are still blind.
“Why is this so hard?” you may ask. Camera can take pictures like this one, by converting lights into a two-dimensional array of numbers known as pixels, but these are just lifeless numbers. They do not carry meaning in themselves. Just as to hear is not the same as to listen, to take pictures is not the same as to see, and by seeing we really mean by understanding.
In fact, it took Mother Nature 540 millions year of hard work to do this task, and much of that effort went into developing the visual processing apparatus of our brains, not the eyes themselves. So the vision begins with the eye, but it truly take place in the brain.
So for fifteen years now, starting from my Ph.D. at Caltech and then leading Standford’s Vision Lab, I’ve been working with my mentors, collaborators and students to teach computers to see. Our research filed is called computer vision and machine learning. It’s part of the general filed of aritificail intelligence.
So ultimately, we want to teach the machines to see just what we do: naming objects, identify people, inferring 3D geometry of things, understanding relations, emotions, actions and intensions. You and I weave together entire stories of people, places and things the moment we lay our gaze on them. The first step towards this goal is to teach a computer to see objects, the building block of the visual world. In a simplest terms, imagine this teching process as showing the computer some training images of a particular object, let’s say cats, and designing a model that learns from these training images. How hard can this be? Tomorrow answer will been shown haha