Proteins are the engines driving life's chemistry, responsible for all chemical activities within the body. As such, the engineering of novel proteins is increasingly viewed as the most promising avenue for addressing many of the environmental and health challenges humanity faces today. These challenges range from creating antibodies capable of neutralizing pathogens and eliminating cancer cells, to devising enzymes that can dismantle a vast array of pollutants, often transforming them into reusable substances.

The Weizmann Institute has made significant strides in advancing our understanding of protein structure, function, interactions, and their roles in cellular processes and human diseases. A milestone contribution came from Prof. Ada Yonath, a Weizmann Institute scientist and Nobel Laureate in Chemistry (2009). She pioneered the mapping of ribosome structures, the cellular protein factories, using X-ray crystallography. Her groundbreaking work has had far-reaching implications, enhancing our comprehension of protein synthesis in cells and aiding the development of new antibiotics.

As we continue to seek innovative solutions, AI has become an instrumental ally in the global endeavor to develop novel, beneficial proteins. Perhaps the most significant achievement in harnessing AI to aid scientific research to date has been the development of AI tools that can predict protein structure based on the sequence of its constituent amino acids. However, this somewhat unexpected success underscores the enormity of the challenges that the field continues to face.

The primary challenge lies in the unfathomable number of potential proteins. Most functionally interesting proteins comprise a chain of approximately 300 amino acids. Even when the scope is limited to shorter proteins, consisting of just 100 amino acids, the number of potential combinations of amino acid sequences is a staggering 10^132 (while the entire universe is estimated to hold only 10^80 particles). Each sequence determines the structure of a potential protein, and each structure in turn shapes the protein's function. Our understanding of the relationships between sequence, structure, and function is still rudimentary. Given this vast universe of possibilities, the probability of randomly identifying a sequence that produces a functional protein is practically zero.

Considering these odds, it's perhaps unsurprising that attempts to design novel proteins from scratch, even with the use of advanced AI tools, have hit a wall. However, the scientific community has begun to develop successful AI methods for creating novel proteins based on known structures. Advancing these methods and applying them to novel datasets can dramatically accelerate scientific progress in this field.

The recent success of AI in predicting protein folding was facilitated by decades of high-quality data collected by scientists studying the relationship between protein sequences and structures. Mirroring this success in other aspects of novel protein design necessitates the establishment of similarly high-quality, annotated training datasets. Overcoming this hurdle could catalyze a revolution in protein design, making the process faster, cheaper, and easier, and better equipping humanity to face future challenges.