J&J Talk AI Episode 05: Vision Transformer Architectures

Academy

Boy standing on a metal platform gazing into an endless futuristic city with a lot of colorful displays. The city spans across the whole horizon and sky.

Episode

Please accept marketing-cookies to watch this video.

Transcript

Welcome back to J&J Talk AI.

This time in episode 5 we're talking about vision transformer architectures. Johannes got nearly ahead of himself.

Let's talk about vision transformer architectures because they sound really futuristic.

JH: Yeah, I mean, I wouldn't probably call it futuristic.

JD: I think it's just a fancy name.

JH: Proven to be, yeah, the name you're getting at the name. Yeah, true. I mean, transformers, of course, it's good that they didn't call them. What are the other ones called?

JD: Not the Autobots, but the Perceptrons.

JH: Perceptrons, yeah.

JD: That's another architecture.

JH: Oh, man. Oh, man. Now it's on the internet. We don't know too much about the transformer franchise.

No, but what we do know about is the transformer architectures, right? So in the last episode, we talked about how convolutional networks aggregate information and that way put information into context. Transformers on the other hand, do this a little differently, right?

They treat or they, let's put it differently. Transformers expect as inputs a sequence of any sort, right? So the origin of transformer architectures is from language modeling, especially from translation tasks.
So I have a sequence of words and I want to turn the sequence into a different sequence of words, French to English or whatever you want to do. So you need the context basically, because language is always contextual, that the meaning is all in the context.

JD: Exactly.

JH: Language is also really ambiguous. So word can mean various things depending on the context it's spoken in. And sometimes even the context of the other words is not enough. Sometimes if I say it to my mother, it's something different if I say it to a friend, but that's a whole different topic.

But transformers put this sequence into context by using a process called attention or depending on which part of the transformer we're talking about, it's also sometimes called self-attention, which allows, as you say, to put each word into context with all the other words in a learned process. And this way I can be very good at producing sequences in turn that, for example, work well as a translation into a different language.

Now why is that also interesting for computer vision? An image itself is not a sequence. And that's true, but I can take the inputs I give to a transformer and I can append more information to it. I can take each sequence element, I can say this comes from the top left corner, this comes from the top right corner, bottom left, bottom right. So you slice the image.

JD: We slice the image. So you get basically words.

JH: Exactly. I take the image, turn it into words, into small patches, and I supply the information where this patch is coming from. And then the putting-into-context process can respect this information and then can decide if this is relevant information for the question.

I'm asking of the task I'm trying to solve or maybe even discard it, which is also definitely possible depending on the task. And that's really the fact that this works is really interesting and was very surprising to me, at least because from an intuitive point of view, taking the image and treating it as a set of pixels from which I aggregate information into some abstract low dimensional representation is far more intuitive for me than to treat the image as a bunch of patches that I put into context with each other.

JD: Yeah, makes sense.

So that's how we humans would do it, like conversion neural networks, just okay, there's something and then okay, there's a cow, for example, or there's a human.

JH: Yeah, to be honest, I really cannot comment on this because I'm not a neurologist. I don't really know.

JD: Yeah, okay. I obviously also don't know. But that's how I perceive it.

JH: Yeah, exactly. I would probably perceive it quite similarly.

I think the interesting point is that we usually combine the two approaches, right? That we have this kind of pixel-based information aggregation and then putting the aggregated information into context with other aggregated information. That's what we were getting at the beginning of our conversation that, for example, these patches are turned into vectors using convolutional neural networks. Or I could even go so far as to take a convolutional neural network, apply to the entire image for a few layers. Then I get a lower dimensional image of vectors and then treat those vectors as input to a transformer network and then put this information into context to, for example, solve a classification problem or a segmentation problem or whatever problem we want to solve.

But that's where the strength of the combination of the two approaches is. I can aggregate information and I can put it into context more reasonably. And to extend this a little bit more into the philosophical, that's also where I see the parallels to how we work as brain and eye interface, right?

The eye in this case would be this sensor that takes this bunch of photons and colors and whatever and turns it into a nerve signal. And then the brain takes the signals that come in over time and puts them into context or takes the signals that come from various receptors on the eye and puts them into context. So that's kind of on a high level, probably wrong interpretation of how our brain works. But yeah, that's... Some biologists might correct us.

JD: Yeah, yeah.

JH: If you can write us emails, I think they're probably like, please be gentle. But that's my intuitive understanding of why this works.

JD: Yeah, that's intuitive understanding. And as far as my biological studies, I would say that's maybe correct.

Yeah, you mentioned something which is maybe hard to understand.

The attention or self-attention function. What is attention actually?

JH: Yeah, that's a really good question. Also, like the core of what makes transformers work, right? I would not try to explain how it's done in the transformer architecture, but give a more simple example, I think.

So for example, if you want to attend to a sequence or a number of values, you would do this through simple matrix multiplication, where the matrix then is the set of attention weights. It's a learned set of weights that says, okay, if my input looks like this, the important value is this. So if I have a sequence of 10 examples or 10 values, the attention mechanism would then decide which of those 10 values is the most important at this point in time or in this given context or whatever.

So the attention mechanism is basically a filter that gives you the most relevant information right now. And in the transformer, this then combined with a little more complex mathematics but in the end, it's still the same.

I can ask the question over the sequence of given inputs, how do I want to attend to those values which are important to put into context with each other? And that's where self-attention comes in. I can then say, given this set of inputs, if I only have this context, which of those inputs is the most relevant? And the more general formulation of attention is given two sequences, which values would I want to put into context with each other? So that's a very rough explanation of what attention is.

But the gist of it is you have a sequence, you want to find out which value is the most important, and that's what the attention mechanism solves for you. And it also takes into account the context of everything. Exactly. Yeah.

JD: Okay, so we do this for context because that's my next question.

So now we have CNNs. They work well, as you described, and now we're doing vision transformer architectures because they work well in language domains. They also work well in computer vision domains.

But why would we actually do the vision transformer architectures when CNNs just work fine?

JH: I think the why is because it works, right? It was an interesting thing to look at, and that's why the scientific community did take a look at it and figured out, hey, this actually works quite well, and for certain scenarios it works even better than CNNs.

I don't really know a better reasoning right now, right? It's a scientific process, and you can interpret or, yeah, find reasons why you would do this that are maybe, yeah, because our brain is structured this way and that, like we did.

But the real reason is because we do it because it works.

JD: What a wonderful last words from you, Johannes,
on the closing episode number five, and also this season.

Thanks for all the explanations, Johannes, and I hope we see you next season.

JH: I would hope so, too. It was really fun. Thanks, Johannes.

Podcast Homepage

Please accept marketing-cookies to watch this video.

‍

Johannes Dienst

J&J Talk AI Episode 05: Vision Transformer Architectures

What can be said can be solved.

Episode

Transcript

Podcast Homepage

‍

What can be said can be solved.

More to explore

5 Tipps to get started with Agentic AI and AskUIs Vision Agent

Harnessing Agentic AI: How AskUI's Vision Agent is Revolutionizing Online Casino Testing

Claude Computer Use vs OpenAI Operator vs AskUI: The Complete Guide to AI Computer Agents