AI & Chemistry

AlphaFold explained in the simplest way

Shoeblack.AI 2024. 11. 25. 23:33

AlphaFold. It's an AI that predicts protein structure. It just won the Nobel Prize in Chemistry. So you've probably heard the name, but few people have studied how it works. It just seems very difficult. These days many people just talk about chatGPT when they talk about AI. Simply people think AI is a sort of website or program that talks to you.

 

First of all, what is AI?

AI is a set of parameters inside a model. The parameter values are the main body of AI. Let's take ChatGPT as an example. When you type sentences into ChatGPT, each sentence turns into numbers. Then, the calculation is done by multiplying and adding the parameters of the model with the input. The model then converts the final calculated numerical value back into a sentence. This is how ChatGPT understands the prompts and provides answers. This is how all AI models work. In a QSAR model, you input a structure. It first translates the structure into numbers. Then, the final value of the model is the experimental value of the chemical. When you prompt the image generation model, it turns the sentence into a number and multiplies and adds input values with the parameters, calculating little by little the noise that needs to be removed from the original noise image. So it starts with a meaningless image with just noise. Little by little, the parameters of the model are multiplied and added to create an image that fits the intent of the prompt, until the final image is obtained. 
All models, whether they are image models, chemical models, or language models, have parameter values. These are the numerical values in the model and used to make calculations. When training a model, the parameter values are initially random numbers. Then, as the model is trained repeatedly with data, the parameter values are updated. Training is stopped when the model's output is good. The final parameter sets become the final model. The final parameter values obtained through learning can be called the main body of the model. The user can only see what the inputs and outputs. However, the values entered by the user are made into the final answer through various calculations with the parameters inside the model. Therefore, the model parameters are essential to derive the final result from the values we enter. All models that perform operations using parameters derived from data, are called artificial intelligence. The algorithm that updates the parameters from the data, is called a machine learning algorithm. So, when a model learns from data, it means that it uses data to update the parameters inside. AI is a whole set of updated parameter values based on data!

 

How does AlphaFold work?

I mentioned that the language model turns sentences into numbers. The language model divides sentences into tokens. Tokens are words, but they are not exactly words. ?????? For example, if we have the sentence 'I am running to school,' the token in 'running' can be 'run' and 'nning'. -ing is a grammatical element. So running can be composed of two tokens. Because of such situation, a token is not exactly a word. In any case, pieces of a sentence are called tokens. In an image model, the input image is sometimes divided into 3 x 3 image patches and then inputted to the model. Each image patches are image tokens. In other words, the information can be sliced into smaller pieces, each of which is called a token. AlphaFold predicts a three-dimensional structure from the protein sequence information. In the work of AlphaFold, protein sequence and structure tokens were defined and used to make final prediction.
 
If you look at the paper, three additional information were gathered based on the protein sequence. Protein information, gene information, and protein structure (conformer) generated from the input sequence. All three information was converted into tokens along with the protein sequence information. The values are then entered into the model. To understand this, we need to understand how protein structures were predicted before AlphaFold.
Originally, there were simulation methods for predicting protein structure. It's called homology modeling. Here's a simple example. There are two proteins, protein A and protein B. Protein A's 3D structure was experimentally discovered, but protein B's 3D structure was not yet found. However, Protein A and B have very similar sequences. So, shouldn't protein B be structurally similar to protein A? This is the basic idea behind homology modeling: using the three-dimensional structure of protein A as a template, we can find the three-dimensional structure of protein B by overlaying the sequence of protein B on top of it. The same is true for AlphaFold. The protein sequence is two-dimensional information. It is not enough to create an accurate three-dimensional structure. However, if the 3D structures of other proteins with similar sequences were used to predict the structure of a protein by making some modifications based on their structures, it may be possible to find it quite accurately. So AlphaFold does a lot of searching to find all the relevant three-dimensional information from available resources.
 
In AlphaFold3, the final protein structure is eventually generated by the diffusion model. The similar protein structure from template search or genetic search is eventually fed into the diffusion model, which generates final 3D structure of the target protein. The image generation model described above is a diffusion model. It works by removing noise little by little to produce the desired image. In AlphaFold, protein structure information is represented as tokens. Therefore, it works by removing noise little by little from meaningless noise tokens and generating tokens of the actual three-dimensional structure.
It was actually hard to find the exact description about the model in the original paper. The supplementary information in the article discloses all the details of tokens used in the model and how the model works to generate 3D structure.
https://www.nature.com/articles/s41586-024-07487-w
 
According to alphafold3's github, the final output is a cif file. A cif (Cyrstallographic information file) is one of the file formats for representing crystal structures. Proteins are usually crystallized to discover their structure in the experiment, so cif is one of the file formats that represent protein structures. PDB (Protein Data Bank) is another popular format. I have actually used PDB files for most of protein structure information. I've only used cif files when dealing with inorganic compounds. Language models extract features from tokens in a sentence to generate answers. AlphaFold basically converts various information from protein sequences into tokens and extracts features to predict 3D structures. If you want to learn more about diffusion models, I recommend the following lecture :)
https://www.deeplearning.ai/short-courses/how-diffusion-models-work/

 

How Diffusion Models Work

Learn and build diffusion models. Start with noise, and arrive at a final image, learning and building intuition at each step along the way.

www.deeplearning.ai