Support up-and-coming indie game developers by playing their games. Leave a comment to tell them what you like or what can be improved.
Our tutorials go from basic Processing through advanced libGDX.
This image has been making its rounds on the internet lately:
(click here for a bigger image of this... thing)
Apparently it was posted on reddit with the vague title of "This image was generated by a computer on its own (from a friend working on AI)". The picture was then reposted everywhere, with even vaguer captions like "this is what it look likes when a computer dreams".
What's going on with these weird images?
The gist is that these images are the result of running an image recognition algorithm backwards. So instead of looking at a picture of a squirrel and outputting "yep that's a squirrel", the algorithm is given "show me a squirrel" and it outputs... that monstrosity. The other images are generated through a similar process.
This really caught my attention, because I had done the exact same thing (on a much smaller scale) back in college. So I figured it might be interesting to revisit that project and try to explain what's going on with that hideous squirrel monster and the pretty pictures, hopefully in a way that even non-programmers will be able to understand.
Let's start with a really simple problem. Say we have a 5x7 grid:
Clicking in that grid fills in a square, and you can do that to draw numbers. Here are three examples of how you might write a two:
Even in this small grid, there are a bunch of different ways to draw each digit. So the question becomes: how can a computer look at which cells are filled and tell you which number is drawn?
One solution to that problem is to use a neural network. That sounds complicated, like something from a Terminator movie, but the logic behind it is pretty simple.
Let's focus on just one of those squares in the 5x7 grid. To sound smart, let's call that square an input neuron. You can think about this neuron exactly the way you think about the neurons in a real brain: it has an input (you clicking on it), a value (in this case, either on or off), and it can "fire" by passing a signal to another neuron based on that value.
Wait, what other neurons? Well, in the simplest case, you can imagine our input neuron being linked to 10 other neurons, one for each possible digit (0-9).
We can call these 10 neurons output neurons. When the user clicks our particular input neuron (which is just one of the cells in the whole 5x7 grid), we can "trigger" the output neurons to increase their value. In other words, given that a particular cell is filled in, we can ask questions like "how sure are you that the digit is a 0? Or a 1? Or a 2?" Our input neuron then "answers" those questions by "triggering" the corresponding output node.
So our "network" consists of two "layers": our 35 input neurons (one for each cell) and our 10 output neurons (one for each possible digit). Each input node feeds into the output nodes. We repeat those questions for every single input node (each cell in the grid). Then to figure out what the digit is, we simply see which output node has the highest value!
That is the question. And the answer is: we have to train our network. We do that by going through a bunch of example digits and telling our network what that digit is. So we draw a two:
And we tell every cell that's filled in to increase the value that it passes to the "this digit is a 2" output node. Then we draw a 7:
And we tell all of those cells to increase the value that they pass to the "this digit is a 7" output node. We repeat that process a bunch of times for each digit, each way you can draw each digit, and hopefully by the end our network will be "smart" enough to recognize new combinations it hasn't seen before.
You might notice that both 2 and 7 share some of the same cells, and that's okay! Since we're asking every cell in the grid what it thinks the digit is (based only on whether that particular cell is filled in), what we end up with are answers like this:
Here we can see the overlap, where those cells can't be sure whether it's a 2 or a 7. Those input neurons would trigger the output neurons for both 2 and 7. But we also see cells that make it more likely to be one digit or the other, and those are input neurons will be more weighted towards a particular output neuron.
Of course, this image is an oversimplification: our real network has 10 digits (not just 2 and 7), and each cell has a percentage associated with each of those digits. So a cell wouldn't say "it's either a 2 or a 7", it would say something like "I'm 40% sure it's a 2, and 60% sure it's a 7". By adding up the percentages of each digit of each cell (which we do by passing those percentages to each output neuron), we get an idea of the "big picture" of which digit was actually drawn.
In other words: to figure out what the digit is, our network looks at every filled-in cell. Each filled-in cell then recalls what percentage of our "training digits" was a 0 (or a 1 or a 2 or...) when that cell was filled in, and adds that percentage to our output neurons. The network then looks at the total value of each output neuron, and that total is the "confidence" the network has that the digit is a 0 (or a 1 or a 2 or...). The digit with the highest percentage is our answer.
I'm sure there are some people reading this (well...) and grumbling "that's not really a neural network..." because what we've described so far could probably be accomplished with something simpler- why not just ask the cells directly, instead of going through an output layer? If we did that, we wouldn't require any fancy-sounding neural network jargon.
But what we've built so far can be expanded into a "real" neural network by adding in another "layer" of nodes. Huh?
Think about it this way: so far we've been looking at one cell at a time, but how would we capture relationships between cells? Instead of having "cell X" and "cell Y" separately feed into the output layer, how could we have "the fact that both cell X and cell Y are filled in" feed into the output layer?
We'd do that by not having our input neurons feed directly into our output neurons. We'd add another "layer" of neurons in-between the input and output layers that could capture more complicated interactions between the input neurons. We'd go from this:
(This is simply what we've described so far. I'm only showing 2 of the input cells, but we really have 35 (5x7) of them.)
This example image just shows two input neurons feeding into a single hidden neuron that then feeds into our output neurons, but "real" neural networks can become very complicated.
Now instead of feeding directly into the output neurons, our input neurons feed into a hidden layer of neurons. Those "hidden" neurons can then feed into the the output neurons using the same logic as our input neurons previously used: each hidden neuron has an input, a weight, and an output. Hidden neurons can even feed into other hidden neurons!
Hidden neurons can capture very complicated relationships between input neurons and output neurons, and can look more like this:
Or even this:
Building these complicated networks is a field of study in itself, but if you've read this far you know the basics: a neural network is just a set of input neurons, connected through a hidden layer of neurons, to a set of output neurons that give us our answers.
Remember our example problem with the grid of cells and the number guessing? That's the project I did back in college. Finding it took some digging around on an old hard drive, but the final product looked like this:
Users can draw a number in the grid, then push the "Evaluate!" button, and the program uses a simple neural network to identify that number.
But that's not the interesting part.
The interesting part is that once I finished this project, I got the bright idea to reverse it. Instead of drawing a number and having the network tell me what that number, I wanted to give the program a number and have the network draw what it thought that number looked like.
The way I did that was pretty simple, but you can think about it this way: instead of using the cells in the grid as input neurons and each digit 0-9 as an output neuron, you can use each digit as an input neuron and each cell as an output neuron!
Then instead of asking "if this particular cell is filled in, how sure are you that the digit is a 0, 1, 2...?", you can ask "if the digit is a 7, how sure are you that this particular cell is filled in?" You then fill in any cell that it's "pretty sure" (above 50% for example) should be filled in.
What you get can maybe be seen as what the network "thinks" a particular digit looks like. Here's what it thinks a 7 looks like:
And you can tell that maybe during training, the network saw a bunch of big sevens, so it knows to fill those cells in, but maybe it also saw some smaller sevens, so it combines them into a weird recursive 7 monster.
Here it is drawing a 4:
Think of a baby (human) living in a house that has a little white dog. The baby might start out believing that every dog is little and white- this is a problem of under-generalization. It might also think that other little white animals are also dogs, even if they're actually rabbits- this is a problem of over-generalization. But as the baby grows up, it will establish an idea of a "generic" dog that acts as a template for determining whether other animals are dogs.
You might think of the generated 4 as the neural network's idea of a "generic 4", which it uses as a "template" to categorize new input digits.
One thing neural networks are used for is image recognition- similar to our little digit recognizer, but with many more input neurons and a more complicated hidden layer. The input neurons might be the actual pixels from the image, but they also might include other information, such as "how round is this" or "what is the background and foreground" or "where are the edges". The output neurons might tell you who is in a picture (like with facebook) or what images are similar (like with google reverse image search). In any case, the idea is the same as our above example: you've got some input neurons, a hidden layer, and an output layer.
So what happens if we reverse one of these big neural networks, like we reversed our little digit recognization neural network?
You might start out with an image recognition neural network that normally takes images as input and outputs a label for that image, and reverse it so you can give it a label and it generates an image for that label- exactly like we reversed our digit recognizer to draw digits! And the results are pretty trippy:
You might then say "show me what you think a squirrel looks like", and your reversed network might generate... well, this:
If you look at this picture, you'll notice that whatever this is has a lot of eyes. And that's very disconcerting at first, but think about it from the neural network's perspective: during training, the network was shown a bunch of pictures of squirrels- and those pictures probably all contained eyes. The neural network doesn't know what those eyes actually are, but it knows that pictures that contain eyes (along with brown body sections) are more likely to contain squirrels than pictures that contain no eyes. So if you ask for a squirrel, you'll get a bunch of eyes!
The process that the original article discusses is just a little bit more complicated than that. Instead of simply reversing a neural network and asking it to generate a picture of a squirrel, they first asked the non-reversed neural network to identify the features of the image, and then they passed those features into the reversed neural network to "enhance" the original image. Basically, instead of passing the reversed network a label, they first asked the original network to come up with the labels itself.
For example, the researchers can feed a picture of a normal squirrel into the original non-reversed image identification neural network, and the neural network might then say "that's a squirrel". They can then go to the reversed neural network and say "now make this input image look more like whatever you've identified". In the process of making the image look more like a squirrel, the reversed neural network might make the image look like other things as well: maybe the squirrel now has three eyes (because things with eyes might be squirrels), and maybe one of those eyes also looks a little bit like a dog's nose. Then that new image is fed back into the original non-reversed neural network and told to identify the features in that image, which generates new labels: maybe it thinks it might be a squirrel, or it might be a dog as well. Those labels can then be fed into the reversed neural network to add even more stuff to the image: maybe now the squirrel eye/dog nose is given ears, and maybe those ears also look like a little bit like a hand.
That process (which they're calling inceptionism) is repeated a bunch of times, until you've got a squirrel monster that seems to contain dog heads and wine glasses and some kids on a bike and maybe a turtle.
If you start the inceptionism process with a completely random picture of static and ask the computer to label that image with whatever it "thinks" is there, what you end up with are images generated completely by what the network "sees" and then enhances in that static:
This neural network was trained to recognize places, so it was more likely to recognize landscapes, horizons, and buildings in an image. It then enhanced those images, became even more likely to recognize those features, and then repeated that process until these images came out.
This looks cool, but it goes a little bit further than "computers on drugs". This is interesting to AI researchers because it allows them to better understand what's going on inside complicated neural networks. Remember those hidden layers? Many processes for creating and fine-tuning neural networks involve a lot of computer generated neurons, so it can be very difficult for a human to look at a network and reason about what exactly is going on. The inceptionism process helps with that.
For example, if we ask it to show us a picture of a dumbbell, it might give us pictures like this:
Notice how these dumbbells seem to have arms growing from them. That's because most of your training images of dumbbells probably also contained an arm. This tells you that you should train your network on some pictures of dumbbells without arms, so the network's idea of a "generic dumbbell" becomes more specific.
It can also be difficult to look at a neural network and pick out its individual components, like "this is the part of the network that identifies eyes" or "this is the part that identifies edges". By isolating individual parts of a neural network (a group of connected neurons, or a layer) and using the inceptionism process on just that part, we can more intuitively see what's going on.
For example, we might take out 4 different sections of the network, isolate each of those sections, and then feed the same image into each and see what comes out the other side of the inceptionism process. That allows us to see the difference between each part of the network in a more intuitive way.
If you look closely, you'll see that one of the generated images contains a lot of spirals- that means that the part of the network that generated that particular image is good at detecting spirals. Another of the generated images contains a lot of vertical edges- that means the part of the network that generated that image is good at detecting vertical edges. We might then be able to take that part of the network and plug it into another network where the ability to recognize vertical edges might be useful!
This turned out much longer than I planned, but I was really excited by the original article and how similar it sounded to my old college project. And I kept seeing that squirrel picture all over the internet, so I thought it might be interesting to try to explain what's going on in a deeper way than "this is your computer on drugs", haha.
I'd be curious to hear what you think- did any of this make sense?
And if you just want to check out cool pictures, here is the original gallery!