Transcript
JD (Johannes Dienst): Hello, Johannes.
JH (Johannes Haux): Hi, nice to be back.
JD: So when we're talking about Generative AI, there are a lot of applications right now and I think we have to narrow it down a little bit so I will ask a specific question about one specific thing I'm currently excited about and it goes about image to image translation. What is image to image translation?
JH: First of all, it's not text generation, which I'm always happy that we're not talking about that. Yet another GPT podcast.
No, image to image is the research field or application field where we take an image, we do some magic with it and then something comes out of it. But it is again an image.
JD: Okay, that's really abstract. So do we have something we can really replicate that to?
JH: One thing that people like to do is, for example, do Snapchat filters, right? Take my face and make it look like a goofy little dog or something like that. And that would be an image to image task.
And just in general, like we're talking about, for example, like deep learning only solutions. One of the earliest things I came in contact with was something called style transfer, where you would take an image and then take another image like a painting from Van Gogh or something and then say, hey, take my image and make it look like it was painted by Van Gogh and then use learning model to our neural network to generate an image that looks like that.
JD: Could I also take my picture? Picture of myself?
JH: You can. We could take a picture from Van Gogh and a picture from you and then say, make the painting from Van Gogh look like it was a photo taken of Johannes Dienst. I don't know what the outcome would be, but yeah, interesting question.
JD: I also saw a few startups providing something like headshots, which sounds a little bit brutal, but actually headshots in English are a shot of yourself where you upload five to ten pictures and then they generate a model for you and you can generate a bunch of headshots for your social media, for example.
JH: Yeah, that's a pretty cool application. Makes total sense. If I don't have to buy like a suit to get a nice photo of me and just have to take a photo of my face and get the suit for free, that sounds like a cheaper solution. So image to image translation is the technique.
JD: And I remember Adobe made a lot of buzz a few weeks ago where they said, yeah, you can now do image editing by just selecting an area and remove something or add something. Is that also image to image translation?
JH: Yeah.
JH: You put an image in, you do something and an image comes out. That is also image to image. I think like one of the main buzzwords and something that Adobe has been doing for quite some time now is something called inpainting, where you would, for example, remove an area of an image.
Like, for example, you take a picture of yourself and in the background, like some industrial chimney or something you don't want to be there, then you kind of mark it as something you want to have inpainted. And then you have a model that can take the context of the image or maybe also some conditioning, like a word you say or a text where you describe what you want to have filled in there and then fill this area with the respective content. And that's been done for quite a while, but most recently it's again gained popularity due to models like stable diffusion, because there you have a lot of control how the inpainting actually happens and what you see in those areas.
JD: You mentioned mid-journey, so we will come to that later. Actually, I decided just right now.
There's another thing I really think has a practical application. So I think it's called super resolution or upscaling. I know it under upscaling and very research came up with super resolution. So I take a small picture and upscale it to 4K, for example.
JH: Also, again, really useful. And yeah, if you're on the internet, you have images you want to send around and you want to display them in a way that is still appealing to the eye, but you don't want to send around megabytes of TIFF data or whatever. So the goal is to take little information and imagine new information that still makes sense in the context of the little information I have.
So if I have, for example, an image of a squirrel, you can imagine the squirrel itself is like filling out the picture, but there's some background and the squirrel has nice fur and nice patterns on the fur. Now, if I have a small image of that and I scale this up without any super resolution, everything would get mashed up and blurry probably depending on the algorithm you use. And you would directly notice that something's off there.
And there have been many, many ways devised to tackle this, like taking a look at textures, frequencies that you can observe in the Fourier spectrum of the image and then try to scale that up without losing this kind of frequency information. That you always have to handcraft, like which frequencies are important, like the fur of the squirrel, very high frequency information, but the background is blurred, so there's only low frequency information. So I don't want this to get noisy while scaling.
And there again, in the recent years, it basically, from what I gather, has been solved, this
problem using neural networks, because there you can do this weighing of which information
do I need to add in which area by learning this task.
JD: Okay.
So is this also what... when we come back to mid-journey, so we can say you get these four pictures usually? I use it in Discord, actually. And then you say, yeah, enlarge this. Is this also what mid-journey is doing or they're just generating with the same prompt in a higher resolution?
JH: Wow, that's a good question. I couldn't tell you. I would guess, to be honest, that they use the same prompt and other hyperparameters and basically only increase the size of the noise field they start from.
JD: A very specific question. I have to apologize for that.
JH: Yeah, no, no, don't worry about it. It's an interesting question. I'd like to know that.
JD: So when I researched super resolution, there was one specific problem they mentioned. Maybe you can shed a little bit of light on that. So they mentioned what you usually have are residuals. I spoke that correctly. Residuals.
JH: Residuals.
JD: So how do they occur? Is that a problem with deep learning techniques or is this for the classical approaches?
JH: I mean, residuals, I'm not entirely sure what's meant by that in this specific context. But in general, residuals are basically the error you get from when you compare to some ground truth.
So if I, for example, do a linear fit through a bunch of data points, then the distance from the fit, the linear fit to those points, that's the residual. In case of, or are the residuals. And in case of super resolution, what you would do is you compare the ground truth image and subtract the super resolution version and then compare the, then you get the distance between each pixel and that would be then the residuals you get out of that. So it's a way to measure the quality, I would say. It's not a good way. We can dive into that as well. But if you like, it's a way.
JD: Sure. If you can explain it, then go on.
JH: So I mean, what I described basically is calculating the L2 distance between all the image pixels. So RGB values basically, you have three values per pixel and that's, you can interpret this as vector and then you compare vectors, vector distances, get a magnitude and that's a scalar value and then you can sum over those and then you have like a loss value to optimize for, for example.
And the problem you have with that is that what you learn as your network is to be very smushy in your image generation. So this encourages you to make blurry images because you always want to learn the distribution of over the entire data set. So the easiest way to start out is to just generate the mean of all data. Then you're on average pretty good, right?
And then start learning to, to yeah, generate the images based on the inputs better and better. And the L2 loss because of its quadratic nature makes it so that images are more blurry, but there are better ways to get more realistic looking, more crisper images, for example, using perceptual losses, comparing not RGB values, but other features that you can generate and then also differentiate. That's always important in deep learning.
So that's a huge field of science. And in the end, what you will probably always do is a combination of something like perceptual losses, L2 loss and a gun loss where you, which is another, yet another field we might talk about. Yeah, which, which then helps you to, to get better image generations.
JD: So this would explain why a lot of in-painting results or demos I saw is like, they say, okay, generate me a background and the background got blurrier and blurrier. The more they generated, is that the effect you're describing.
JH: I mean, I would have to know what the model was trained on, but basically, yes, if you only trained your model on an L2 signal where you put, you know, this, so subtract pixel values and then build the quadratic mean over it, that would result in more, what did I say?
Yeah, yeah, blurred image generations.
JD: Okay. Thanks for the detailed explanation, Johannes.
So we will split the episode up here and talk about generated adversarial networks in the next episode and also nerves, neural radiance fields.