How Do Computers See Things in Images?
Posted on September 10th, 2023 by Max
Have you ever wondered how computers can “see” things in pictures like our eyes do? They do this using convolutional neural networks (CNNs), a type of neural network specialized in getting data from images.
But first, let’s back up–how do we even see information in images ourselves, as humans? It’s important to break down concepts to their core components, to visualize them.
In the real world, if we look at an image of a dog standing on grass, it’s going to be coming from photons–on a computer screen, LED lights will spit out light (photons) for us to see the contents of the screen, or on an actual photo you hold with your hand, the matter the photo is made of (protons, neutrons, and electrons) will reflect light carrying information about what colors are on the photo.
When photons containing information about the image collide with the retina in our eyes, they are absorbed. Our eyes are specifically built to transform these signals into electrical charges, and then these electrical charges are sent to the brain. Finally, we’re at the last step: the neural networks in our brains process this information and procedurally notice the features of the image.
“Image features” are edges, curves, colors, blurriness, and many other attributes of the image, and they essentially represent what ideas the image contains. By analyzing the edges, curves, and colors, A.K.A. the “features”, our smart brain identifies what objects are in the image very rapidly.
Our brains are absolute powerhouses. But still, after receiving the “information” of a photo, what do we… do with it? How can you quantitatively apply it? In machine learning, there are three main ways to format conclusions from images: detection, single-label classification, and multi-label classification. I’ll describe them here!
Let’s start with our brain. Given an image, our brain is able to 1) identify which objects are in the image and 2) what part of the image the object is located in. This means it’s finding 1) the “what” and 2) the “where”--in the machine learning world, finding the “what” and the “where” is a process called detection. This means, given an image and a long list of potential objects (such as dog, cat, lion, and panther), a detector will tell you which objects from this list are in the image, and where they are.
An easier (yet still very useful) process we can do is classification. This process follows the question: given an image and a list of potential objects that could be in the image, which objects are present? If the picture is of a single object (i.e., we’re just trying to identify a single object), it’s classified with single-label classification. If we want to identify multiple objects in the image, we can employ multi-label classification to find out how many of them are present.
Great, we’ve now defined the most common ways of formatting data from images. Wonderful. One problem: we’ve described the types of output we want–detection, single-label classification, or multi-label classification–but still, it must be very difficult to make a computer give us these outputs, right? How do you analyze an image to get to these outputs? The way you would achieve these processes is by neural networks. Now we can start talking about those.
Say, given a black-and-white image with a white number drawn on it, you wanted to identify which number was drawn: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. This problem is super common for classification, it’s the “Hello, World” of machine learning: The MNIST image classification problem.
First, let’s clarify how a computer can use virtual images: Virtual images on computers are way more straightforward than images in real life. Luckily, we don’t need to bother dealing with photons or other physical matter like our eyes do since these images are stored in digital files.
Digital images are formatted like a table of numbers–rows and columns of numbers form a grid. Each number specifies how white the pixel is, on a scale of 0 to 1. A pixel with a value of 0 means it’s black, a value of 1 means it’s pure white, and an in-between value like 0.5 means the pixel is slightly white but not fully, i.e., it’s gray.
So, now to briefly introduce neural networks: the meat and bones of this blog post. Neural networks are a tool humans discovered that happen to work really well for seemingly qualitative topics, such as discerning what’s in an image or generating words for a conversation in a language.
When you first learn about them, neural networks might seem arbitrarily defined, and we have trouble explaining everything about them because they’re so new in the field of engineering (despite being studied since the 1940s, researchers didn’t really care for them until 2006 !).
Neural networks have one big thing for certain going for them, though–they’re backed by evolution. The structure of the virtual neural networks that computer programmers use is inspired by the neural networks in our brains.
Neural networks are composed of 1) neurons and 2) connections between neurons. Bear with me here. First: a neuron holds a number. That’s its primary purpose, and if there’s one thing you should take away from this post, it’s that a neuron holds a number. Engineers usually force this number to be between 0.0 and 1.0 (but neural networks don’t need to work this way), neat, right?
Next: neurons are divided into groups, which we call layers. Neurons have connections from one layer to the next. The final layer represents what a neural network thinks–for example, for the number-image-classification problem, it tells us what number the model thinks the image looks like.
The model will process an image like this: in this gif, neurons with a white color represent neurons that have high values (close to 1.0). Darker neurons represent ones that aren’t activated–their value is closer to 0. Passing in a “7” will activate certain neurons in the network, and lead to the model making a guess.
If you’re confused by how one neuron connects to another, I’ll get to that soon. But you can think of the order of neurons as moving from left to right. The driving idea behind neuron connections is: the values of neurons on the left determines the values of neurons on the right.
The above picture shows an example configuration of a neural network (this is just an example–this network is very small and wouldn’t be generally good for image analysis). The picture above has 4 neurons as input for the layer–this means that you’ll input 4 numbers to use the neural network.
You might have the question: “But how does the neural network pick which neurons to activate to get the desired output? If you had a large neural network, and you passed in a 7, how does it know a “7” looks like a 7?” This is a really important question, and I’ll answer it soon. To briefly acknowledge this: After we build a neural network, it’s customizable. The “learning” in machine learning means that 1) we build a neural network, and then 2) we customize it to be good at making predictions. The neural network in the GIF I showed you way back is able to discern what a “7” looks like because we customized it to be able to make good predictions.
To be able to plug in images, let’s make a neural network that can take in a picture. This means we’ll need to set a high amount of input neurons, so that we can input all of the pixels in the image.
An MNIST picture is 28 pixels long and 28 pixels high. If we want a neural network to analyze every single pixel of the image, we’ll need to be able to input 28 × 28 numbers, or 784 numbers into a neural network. Whew. That’s a lot, right? It’s much more than just the 4 input neurons that are in the image before this.
Look at this gif: every single pixel of the image is inputted into the network. To clarify, the “784” around the first column of nodes signifies that there are 784 nodes in the input layer. That’s too many to draw, so this visual uses a “...” to convey that there are more neurons not shown.
Almost at the end now. The “learning” in machine learning is based on the connections between neurons. To determine the value of a new neuron, we take a weighted sum of all the neurons on the left that it’s connected to. This idea may seem overwhelming at first, but, try not to stress out, it’s just addition and multiplication.
Neural networks are customizable. Every connection between two neurons has a weight, and this weight is what we can customize. A weight is a number, and the number that we pick determines how we sum up leftward neurons to calculate the following rightward one.
Look at the above picture to understand weighted sums. However, the weighted sum isn’t the value of the new node–we’re not done yet! (Almost there!) Remember when I said earlier that neurons contain a number between 0.0 and 1.0? The weighted sum in the picture above is outside of those bounds (1.1 is greater than 1.0), so we need to alter the weighted sum somewhat to be able to fit it in the next neuron.
We’ll apply a “squeeze” function to the weighted sum–take any number between negative infinity to positive infinity, and squeeze it between 0 and 1. A very common function for that is the sigmoid function. This function makes very negative inputs close to 0, and very positive inputs close to 1.
Don’t worry about doing the math yourself, we can thank Python libraries for doing it for us. The key idea to remember is, a weight is a number that describes the connection from one neuron to the next. Again, all of the weights in a neural network are customizable–this means that changing a weight will change how good the model is at predicting things.
Changing the weights is where learning comes in. Say you make a neural network that can take in images and make a guess for if the image is a 0, 1, 2, …, 7, 8, or 9. If you randomized the model’s weights, the model would most likely make trash predictions. It would be randomly guessing, and not really detecting features in the image. However, this begs the idea… if your model is big enough, surely there is some optimal configuration of weights that makes the model able to predict well? The question is, how do we find what weights are “good”?
To “find out what weights are good”, machine learning engineers use a process called stochastic gradient descent. It just rolls off the tongue. It’s essentially applied math, based on vector calculus. If we know what values our model should output given a certain input (e.g., the 7th labeled output neuron should be highest when a picture of a “7” is inputted), we can apply a mathematical equation to see which changes in model weights will bring us closer to reaching that particular desired output. That’s what training is: training consists of slow, incremental steps towards changes that make our model perform better on a given training example. It’s the hope that if we train on enough diverse examples, the model will have learned how to notice the important general patterns across all possible inputs.
To describe the machine learning workflow in two steps, they are 1) we build a model that has the potential to work really well, and then 2) we use stochastic gradient descent to find the weights that will make the model notice the patterns that we want it to.
There is the question–why do neural networks work so well? As I said earlier, they're so new to the field that researchers haven’t identified the “theory” of machine learning yet, but we at least know it has to do with features. I said before that features are attributes of the image: corners, straight lines, and curved, closed loops are all examples of features in images. Many features (for example, color gradients) are easy to see at a glance with a human brain, but there are many features that the human brain hasn't evolved to see, because they weren’t useful for our brains to learn over the course of natural selection. However, features that are “difficult for humans to see” are extremely useful if you’re not a human.
A tremendous strength of neural networks is that they can be trained to see all features–both features that our human brains can see easily and the features that our brains can’t. Again, through stochastic gradient descent, a mathy process where a model gets better at aligning with the training data we feed it, neural networks learn what features in the training data are important (we should remember these) and what features in the training data aren’t important (it’s safe to ignore these).
Essentially, neural networks are good at discovering patterns. That’s their whole point. This is also coincidentally something the human brain prides itself on.
Slow, incremental training is why machine learning can be so expensive. When you hear about OpenAI using loads of power on GPUs to “train ChatGPT”, they’re “finding out what weights are good” for ChatGPT to talk the most coherently. OpenAI, the company that made ChatGPT, has simply built a neural network massive and intricate enough to have the potential to talk like a human–the challenge now is achieving that potential.
Okay! That’s the fundamentals of training a neural network. I hope this helps your understanding of them. If you have any questions, or if I got something wrong feel free to message me.
I’d also like to say thanks to the 3blue1brown youtube channel. They have amazing videos for visualizing neural networks, and the gifs in this post are from those.
If you’d like to see a neural network in action, check out maxwild.tech/models. Here, I posted a model I trained myself on PyTorch to detect alphanumeric characters in images (so not just numbers)! To train this, I used a convolutional neural network, a specialized type of network that’s really good at noticing patterns in images and sound. Maybe I’ll make a blog post in the future explaining how that works. Training this model took about an hour and a half, and required a GPU. That’s why I used Google Colab–you can buy subscriptions to access their high-power resources. I’ll upload the notebook I trained the model to my GitHub soon–watch out for that.