top of page

Using AlphaFold2 Colab: Step-by-Step Protein Structure Prediction

AlphaFold2 represents a groundbreaking advancement in protein structure prediction, considered one of artificial intelligence's (AI) most significant contributions to science and one of the 21st century's most important scientific discoveries (1). This assessment is not exaggerated, as understanding the three-dimensional structures of proteins has been one of the greatest challenges in biology, occupying scientists for decades (2). In 2020, AlphaFold2 (3) won the CASP14 competition by predicting protein folding with an error margin smaller than the average diameter of a carbon atom (mean <1Å). This success, being the best result achieved at the time, far outperformed its competitors. The remarkable accuracy of the results generated great excitement as it indicated that, for the first time, scientists were approaching experimental data with meaningful accuracy (4).

AlphaFold2 uses a protein sequence as input to create multiple sequence alignments (MSA) from various protein databases, identifying regions prone to mutations and detecting correlations between them. It also identifies proteins with similar structures to construct the initial representation of the target sequence (template). These two strategies had been used before and were adopted by other algorithms in CASP14. However, AlphaFold2's groundbreaking success lies in its neural network architectures, particularly two core neural network modules called the "evoformer" and the "structure module" (3,5). Evoformer extracts information from multiple sequence alignments and templates, facilitating forward and backward information flow within the network.

AlphaFold2's performance in protein structure prediction and its determination of the structures of over 200 million proteins is transforming structural biology, profoundly impacting biology and medicine fields reliant on protein structural knowledge. AlphaFold2 and its predicted protein structures offer researchers new opportunities to solve problems previously considered extremely complex. This innovative tool significantly contributes to fields such as structural biology, drug discovery, protein design, target prediction, protein function prediction, protein-protein interactions, and elucidating mechanisms of biological action (1).

To enable researchers without access to high-performance hardware like GPUs and TPUs to use AlphaFold2, independent solutions based on Google Colaboratory have been developed. Google Colaboratory, a platform hosted by Google and a proprietary version of Jupyter Notebook, offers free access to logged-in users and provides access to powerful GPU resources. Accordingly, Tunyasuvunakool and colleagues developed a specialized Jupyter Notebook to run AlphaFold2 on Google Colaboratory (6).


Figure 1. Google Colab — AlphaFold2 Homepage.
Figure 1. Google Colab — AlphaFold2 Homepage.

In our example application, we will use the human FTO protein, whose experimental structure is available in the Protein Data Bank (PDB), and the FOXO6 sequence, a protein whose structure has not been experimentally determined (not present in the PDB).


Step 1: First, the sequences of these two proteins are obtained from the UniProt database. The sequence with UniProt accession number Q9C0B1 will be used for the FTO protein, and the sequence with accession number A8MYZ6 will be used for the FOXO6 protein.

Şekil 2. Retrieval of FOXO6 and FTO protein sequences for Homo sapiens from the Uniprot database.
Şekil 2. Retrieval of FOXO6 and FTO protein sequences for Homo sapiens from the Uniprot database.

Step 2: The obtained sequences are downloaded as FASTA format. To do this, select FASTA in the Download tab and download.

Figure 3. UniProt FASTA Format Sequence Download Tab.
Figure 3. UniProt FASTA Format Sequence Download Tab.

Step 3: Then, enter the Google Colab — AlphaFold2 page. The default sequence in the query_sequence section is deleted and only the sequence part of the FTO protein is pasted first. Optionally, the JobName can be entered, the num_relax section can be left as “0” and the template_mode section is left as “none”. Then, the cell is run from the button on the left.

Figure 4. Entering the sequence of the FTO protein into the AlphaFold2 tool.
Figure 4. Entering the sequence of the FTO protein into the AlphaFold2 tool.

Step 4: When there is a green check mark where the button is, it means that the cell has been successfully run. And the other cells are run one by one. All settings in the cells can be run with their defaults. (Since the protein sequences are long, the processes can take up to half an hour or 45 minutes.)


Step 5: If the processes run smoothly up to the Package and download results cell, you can run the last cell and download the model structures and graphics of the FTO protein as a “.zip” file.


Step 6: After all the cells are run, it is the turn of the FOXO6 protein. The sequence in the FASTA file is written to the query_sequence section and run again up to the last cell. After all the lines are run, you can download the structures of FOXO6 from the Package and download results tab.

Figure 5. Entering the sequence of the FOXO6 protein into the AlphaFold2 tool.
Figure 5. Entering the sequence of the FOXO6 protein into the AlphaFold2 tool.

Step 7: The “pLDDT” and “coverage” results of the 5 model structures downloaded for the two proteins can be examined. AlphaFold generates a model confidence score (pLDDT) between 0 and 100 for each amino acid. In the isolated pLDDT, some regions below 50 may be structurally disordered. The covarage graph shows the number of sequences compared on the Y axis and the amino acid positions of the entered protein on the X axis. What is important here is that the amino acids in the entered sequence are similar to a large number of sequences.

Figure 6. pLDDT Graph of FOXO6 Protein.
Figure 6. pLDDT Graph of FOXO6 Protein.

For example, in this pLDDT graph of FOXO6, the confidence score of the region between approximately 90 and 130 amino acids is above 80, meaning that this region is well predicted. Other regions can be examined by applying similar logic (Figure 6).


Figure 7. Covarage Graph of FOXO6 Protein.
Figure 7. Covarage Graph of FOXO6 Protein.

For example, this covarage plot for FOXO6 shows that, similar to the pLDDT plot, the well-predicted region matches approximately 11,700 sequences out of the 12,000 sequences used. The region between 100 and 200 is highly conserved and is likely to be a functionally important area. These regions may be flexible or irregular, as coverage drops sharply after 200 (Figure 7).

Figure 8. pLDDT Graph of FTO protein.
Figure 8. pLDDT Graph of FTO protein.

Step 8: In this pLDDT graph of the FTO protein, the confidence score of most of the protein sequence is above 80, meaning that these regions are well predicted. Other regions can be examined by applying similar logic. For example, the extreme parts of the protein (N and C terminals) were not modeled very well because the pIDDT values ​​were below 50. But in general, it can be said that more regions were well modeled in the FTO protein than in the FOXO6 protein (Figure 8). In addition, no major differences are observed between the models (rank_1, rank_2, rank_3, rank_4 and rank_5), which shows that the predicted structures are consistent.

Figure 9. Covarage Graph of FTO protein.
Figure 9. Covarage Graph of FTO protein.

Similarly, the covarage graph shows that the well-predicted region matches approximately 750 to 780 sequences out of approximately 1,100 sequences used. In addition, the FTO protein is generally well-conserved, but the end regions and some internal regions are more diverse (Figure 9).


If we compare the two proteins used in the example, it can be said that the modeling of the FTO protein, whose three-dimensional structure was determined experimentally, was better than that of FOXO6. The protein structures predicted by AlphaFold2 can be examined with various molecule visualization tools (such as PyMOL and Chimera) and, if considered to be well-modeled, can be used in methods such as molecular docking and molecular dynamics simulations.



As a result, the high accuracy achieved by AlphaFold2 in protein structure prediction has provided a revolutionary progress in structural biology and opened new horizons in biomedical research. This AI-based tool helps scientists understand the three-dimensional structure of proteins, offering great opportunities in many areas such as drug discovery, protein engineering, and disease mechanism elucidation. In addition, the accessibility of this technology to a wide range of researchers through platforms such as Google Colaboratory enables its wider use.



REFERENCES

1. Yang, Z., Zeng, X., Zhao, Y., & Chen, R. (2023). AlphaFold2 and its applications in the fields of biology and medicine. Signal transduction and targeted therapy8(1), 115. https://doi.org/10.1038/s41392-023-01381-z

2. Dill, K. A., & MacCallum, J. L. (2012). The protein-folding problem, 50 years on. Science (New York, N.Y.)338(6110), 1042–1046. https://doi.org/10.1126/science.1219021

3. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Tunyasuvunakool, K., et al. (2020). AlphaFold 2. Fourteenth Critical Assessment of Techniques for Protein Structure Prediction. London: DeepMind.

4. Xu, T., Xu, Q., & Li, J. (2023). Toward the appropriate interpretation of Alphafold2. Frontiers in artificial intelligence, 6, 1149748. https://doi.org/10.3389/frai.2023.1149748

5. Skolnick, J., Gao, M., Zhou, H., & Singh, S. (2021). AlphaFold 2: Why It Works and Its Implications for Understanding the Relationships of Protein Sequence, Structure, and Function. Journal of chemical information and modeling61(10), 4827–4831. https://doi.org/10.1021/acs.jcim.1c01114

6. Tunyasuvunakool, K., Adler, J., Wu, Z., Green, T., Zielinski, M., Žídek, A., Bridgland, A., Cowie, A., Meyer, C., Laydon, A., Velankar, S., Kleywegt, G. J., Bateman, A., Evans, R., Pritzel, A., Figurnov, M., Ronneberger, O., Bates, R., Kohl, S. A. A., Potapenko, A., … Hassabis, D. (2021). Highly accurate protein structure prediction for the human proteome. Nature596(7873), 590–596. https://doi.org/10.1038/s41586-021-03828-1

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page