What is in this picture?
When light from this scene reaches your eye, your brain immediately recognizes a bunch of things: several deer, a car, leafy bushes, and grass. The grass and bushes are a similar color but aren’t the same thing. Your brain might even give you a feeling of what the flowers on the bush smell like.
Evidently, your brain is doing some significant processing, leaning on built-in mechanisms and visual information to give you a rich experience of the scene before you. A computer has nothing like this, but it’s one of the great projects of AI researchers to fill in the gaps.
In this quiz, we will take a closer look at computer vision.
Being on the winning end of the 4.5 billion-year-long evolution of animal vision, most of us don't appreciate what’s so difficult about enabling computers to “see.” It’s not even clear what it means for a computer to see at all.
One of the things we’ll have to get used to is that computers can only think in numbers. Whether it’s a photograph, or a song, or the Iliad, to the computer they’re all represented by lists of numbers.
When light enters the computer’s camera, it hits the pixels and the light intensity is measured across the entire scene. Saved in memory, the autumnal scene from the last pane looks more like this:
…a sight less scenic.
Before we carry on with computer vision, it’s important to build empathy for our AI brethren by laying bare the superpowered goggles of evolution.
To start, let’s look at a small region of a photograph ( of its pixels) that's been blown up so we can see the individual pixels. Using nothing but your standard visual system, do your best to identify what’s in the following snippet of the photograph.
You’re right, without color this is a little too tough. After all, most image files on a computer store color information.
So here’s the photo in all its color pixel glory.
Can you figure out what it is?
Here’s the picture that the snippet is from:
Even if you got it right, you might be surprised at how different the same set of pixels looks now. Not only can you see that it’s an SUV driving across a bridge, it probably looks like this photo is displaying the car in higher resolution than it was before.
In fact, the SUV has the same number of pixels.
With the full image, your vision system is clued in by the context of the scene to realize that the little patches are most likely cars. After that point, all bets are off with regard to the missing details that it will automatically fill in for you.
In the previous question, had the patch of pixels that turned out to be a car been placed in an image of a nebula or the reflecting surface of a lake, you may have interpreted the patch as background stars or a wispy cloud.
This suggests that your brain is doing more than matching visual information to some internal visual dictionary. Every image you see is intermingled with your previous experiences of seeing. This has been, until recently, one of the biggest hurdles to computer vision: how do you encode this kind of experience for a computer?
The fact that what we see is influenced by context is unmasked by so-called "optical illusions." These incongruous experiences provide a way to start peeling back the layers of processing within the human vision system.
Which circle is shaded a darker gray?
Equally impressive are the things that the vision system chooses to simplify or transform.
In the animation below, a ring of magenta spots blink one at a time so that it looks like a hole is moving in a circle. Stare at the black cross and keep track of how the magenta spots appear over time.
What color does the empty spot turn?
Now that we have a sense of the head start we have from our vision system and the difficulty of seeing in pixels, let’s turn to how we might process pixels to identify an object in an image, with an eye toward programming a computer to perform this task.
To start, let’s think about what happens to a scene as the amount of ambient light changes. The number grids above correspond to the pixel values for a black-and-white photo of an apple. Higher values are pixels that are lighter, and lower values are darker.
The number grids below show the same patch of pixels after a change to the apple or its environment. Which of the following could be its pixels after a change in the level of light in the room?
The effect of ambient light highlights an important principle — the information about an object’s identity isn’t encoded in the absolute value of its pixel intensities. As we vary the level of light in a room, all of the pixel values will shift up and down, so we should be careful when we use rules that depend on the absolute intensities.
For this reason, it’s common to “center” images before processing them by calculating the average pixel intensity and subtracting it from every pixel. Once this is done, the image only contains information about relative changes in the image.
This allows our rules to ignore uninformative effects like overall brightness and focus on the patterns in the pixels.
We subtracted the same number from every pixel in the image to center it. Though this helps to put photos taken in different conditions onto equal footing, it doesn't get us any closer to understanding the image.
But we can use the same sort of pixel arithmetic to identify simple patterns in images. We just have to add and subtract pixels that are in the same neighborhood.
To start building up objects, we might want to find their edges, by e.g. finding all the places in an image where there's a vertical boundary between light and dark regions.
A vertical filter performs some pixel arithmetic on an image and creates a new image that has light pixels where the old image had vertical boundaries, and dark pixels where there were no boundaries.
To make a vertical filter, what should you replace a pixel in the old image with?
The idea that information resides in the comparison of neighborhoods of pixels is the basis for the modern theory of computer vision. We will see later in the course that this idea arises organically within neural networks. They even "learn" filters (like the vertical filter in the last problem) that optimize information processing for a particular database of images.
With basic filters, information in images can then be further integrated to find higher-order structures in the image — for example, by taking the vertical edge detector we just explored and combining it with a horizontal edge detector, we might be able to find corners in the image. Eventually, a higher-order rule knows that the image is likely to be something that has corners in it like a laptop or a map.
Computer vision has advanced to the point that it's competitive with human vision on some tasks. Today, computers tag photos on social media, scan x-rays for signs of cancer, and guide a car through an obstacle course. There’s a long way to go but, somehow, a computer ought to be able to identify the whole picture using the information residing in the pixels.