News |
Articles |
Videos
HOME News
AI

How ChatGPT technology can help seed companies develop new resistant varieties

A researcher at Wageningen University is developing a tool that will compare the mechanisms through which the genomes of related species express or do not express certain key characteristics for breeders.

6/6/2024

Cabbage crops.

What we saw only a few years ago as a reality typical of science fiction movies, artificial intelligence, has today become part of everyday life, giving rise to tools as simple as a virtual advisor; or some more complex ones that write a text by simply telling it what it should be about, bringing together the prior knowledge to which it has had access, but artificial intelligence is here to stay and not only for applications as mundane as those mentioned, or its use for dark actions, such as the falsification of a voice or an image, but also has applications in science and research that can lead to significant advances in knowledge, impacting its practical application.

This is the case of how artificial intelligence, applied to plant genetic research, can discover issues that until now are still enigmas for researchers. In that sense, Michael Schön, a researcher at Wageningen University in the Netherlands, is designing a tool based on artificial intelligence that could accelerate and simplify the future development of new varieties resistant to drought or diseases.

If we take into account that proteins are the basic elements of cells and that the genetic information for their production is contained in the RNA of the genes, giving rise to coding RNA, that is, it contains said information to produce a protein; It is also important to keep in mind that there are genes that produce RNA that lacks this information (non-coding RNA), but that performs important tasks in the development of organisms, activating or deactivating genes, changing the characteristics of a plant, as well as determining if it reaches maturity.

On this premise, the tool that Schön is developing will carry out comparisons of non-coding RNAs in plant genomes that could reveal, among other issues, why a species that belongs to a certain family has different characteristics. And the researcher starts from the existing limitation in this field of research, since the existing knowledge from previous existing research focuses mainly on coding genes, which makes it difficult, for example, to compare the non-coding genes of a species such as Arabidopsis thaliana (thale cress), which is a model organism in research, and of which Schön already has prior knowledge of its non-coding RNA thanks to previous research, with other crops of the family to which it belongs, the Brassicaceae, also known as the mustard or cruciferous family, such as broccoli or cauliflower, among others. Therefore, it is necessary to unravel the non-coding genome of each crop separately from what is usually done. “There are more than 200 genomic sequences available for plants in the mustard family. Each genome is stored as a large text file consisting of millions of letters representing the bases of a DNA molecule (A [adenine], C [cytosine], T [thymine], and G [guanine]). Because the non-coding bits are not properly cataloged in these genomes, it is impossible to compare all the non-coding genes scattered within this mountain of data. We need new strategies and tools for that. I am trying to develop them,” explains Schön, who, thanks to the Veni grant that his project has received, aims to identify non-coding RNAs using knowledge from related species.

To do this, the first thing is to know which parts of the genome to search. A challenge that Schön intends to solve with the development of a tool that he has called GeneSketch, based on the Minimizer Sketch method, which supports the idea that it is only necessary to observe a small part of the DNA (a sketch) instead of the sequence. complete, paying attention to a few thousand characters per genome for comparison, rather than millions. A method that has already been used to build a family tree on the evolution of primates, including humans and their close relatives, achieving a very precise family tree from sketches of less than 1% of the respective genomes, and therefore, constituting a very effective method of estimating DNA similarities and which should be useful for comparing the genomes of different species within crucifers.

Once you know where to look, you need to be able to understand the resulting data. To do this, GeneSketch uses the same technology that ChatGPT is based on: transformers. “You can ask a transformer to complete a missing word in a sentence, for example. Initially, the transformer gives you a random word because it has never seen words before. But if you train it with millions of example sentences, it slowly learns to guess the right words by paying attention to patterns in the text. After training, a large language model like ChatGPT becomes very good at certain tasks, such as answering questions or translating from one language to another. A transformer can be trained to learn not only human languages, but also the language of DNA, which has its own distinctive patterns. “I am working on a model to detect patterns in the DNA of many different species and translate those patterns into a language that we, as humans, can understand,” explains the researcher. Schön will therefore train GeneSketch to focus on how non-coding genes change across species, although there is one limitation: reliability. “The transformer is a relatively new technology and it makes mistakes. ChatGPT, for example, was trained on many different text sources, but if you ask it about a topic it never saw during training, it needs to make something up. You hope he will come up with something reasonable based on the patterns he has seen, but this is never a guarantee. You obviously want to avoid meaningless results. The more a transformer is trained, the fewer errors it produces, but training can cost a lot of time and money. Is it better to train the model completely from scratch or build it from existing models? I am trying both approaches,” he maintains.

A project that began in October 2023 and aims to give its first results at the end of this year, when Schön hopes to have a prototype of the tool ready and to be able to create non-coding genetic annotations of all the crops that include crucifers.

A tool that, if successful in its mission, would represent a great advance, not only for research, but also for breeders and seed companies, who will be able to quickly understand the DNA of a crop and its wild relatives, being able to Plant breeders improve the traits of new varieties by looking at those that, over the centuries, their wild relatives have developed to make them more resistant to certain negative conditions, such as drought or diseases.

Infoagro Editor: Lydia Medero

Enjoyed this news? Please share it!

Sign up to our newsletter
    Sign up    

Sections:
» News
» Articles
» Vídeo
HomeContactPrivacyTerms & conditionsNewsletterAdvertiseWork for us

© Copyright Infoagro Systems, S.L.

Infoagro.com