The Simplex Things In Life: Utilizing Artificial Intelligence Models to Better Understand Autism

Autism Spectrum Disorder, or ASD, is nothing if not unique.

The way ASD manifests itself in people is unique; although it most often presents as some form of variable impairment in social interaction and communication, each individual has behaviors and habits that are as unique to them as snowflakes are to one another.

ASD has also proven itself to be a uniquely challenging disorder to study. In the past decade, de novo (new) mutations have been identified as key contributors to causality of ASD. However, the majority of these identified de novo mutations are located in protein-coding genes, which comprise only 1–2% of the entire human genome.

Up to this point, a majority of previous research has focused on identifying mutations located in the 20,000 identified genes in the protein-coding region, which would seem like a promising approach. Genes are the genetic blueprints for creating proteins, which control and perform crucial tasks in our bodies, such as fighting off infections, communicating between your organs, tissues, and cells as chemical messengers, and regulating your blood sugar levels. It seems like basic math: Genes + Mutations = Mutated Proteins. Mutated Proteins = Disrupted Protein Function.

However, it has been observed that all the known genes that are ASD-associated can explain only a minor fraction of new autism cases, and it is estimated that known de novo mutations in the protein-coding region contribute to not more than 30% of cases for individuals who have no family history of autism (better known as simplex ASD). This provides evidence to suggest mutations contributing to autism must additionally occur elsewhere in the genome.

The other 98%, the noncoding region, of the human genome has previously been thought of as “junk DNA”, due to the fact that it appeared to have no real purpose or function. Research has just begun to clear up the shroud surrounding this genetic mystery. Unfortunately, the sheer vastness of the 98% of the 3.2 billion chemical pairs that comprise the human genome makes detecting noncoding mutations an exceptionally difficult task, made even more challenging by the fact that a single individual could contain dozens of mutations, many of which would be entirely unique to that individual.

Due to these challenges, the traditional study approach, which involves identifying common mutations throughout affected populations, is an impractical one in this scenario. Up until recently, scientists’ ability to effectively scour the noncoding region of the genome for mutations that are functional (and of those functional mutations, which contribute to the phenotype of the disorder) would have involved a nearly painstaking, never-ending slog of hundreds upon millions of lab experiments, which sounds about as unfathomable as it does unsavory. Thus, the pursuit to uncover which noncoding mutations might cause ASD is by no means a small feat.

Keep It Simplex

In a recent study published in Nature, a research team at Princeton University set out, new method in hand, to tackle this massive undertaking. The team applied an artificial intelligence (AI) technique called deep learning to a collection of data known as the Simons Simplex Collection (SSC).

This data set was comprised of whole-genome sequencing for 1,790 ‘quartet’ families: families of four that have one child with autism, as well as two parents and a single sibling, all of whom are unaffected by the disorder. Each of these quartets had no prior family history of autism, which implicates non-inherited mutations as the cause of the singular child’s autism in each family. This, combined with the built-in controls of the matched, unaffected siblings, made this particular set of data ideal for this large-scale study.

In deep-learning models, an algorithm performs consecutive layers of analysis in order to understand and detect patterns. The framework of deep learning for this experiment was designed to identify biologically relevant sections of DNA and RNA, that could predict the functional, as well as pathogenic, impact of de novo mutations in the genomes available from the SSC. The algorithms of deep-learning models are entirely dependent on having tons of data to learn from and, luckily, there is no danger of a data shortage in this particular case.

For the DNA level of deep learning, the team trained the model on cell-type-specific models from the ENCODE and Roadmap Epigenomics projects, which includes approximately 2,002 transcriptional regulatory effects. For the RNA level of learning, the deep-learning model was trained using the biochemical profiles of approximately 232 post-transcriptional regulatory features, including ribosomal binding protein (RBP) profiles, histone modifications, and transcription factors, which were gathered from cross-linking immunoprecipitation (CLIP) data.

Using these models, the algorithm made it’s way through 7,097 whole genomes from the SSC, predicted the estimated functional impact of every mutation possibility in the context of the 1,000 chemical pairs around it, and provided a biochemical interpretation of that impact. To match up the biochemical outcomes to phenotypic impact, the team trained a regularized linear model using a set of mutations that have already been identified in human disease from the Human Gene Mutation Database (HGMD), as well as a collection of rare variants detected in healthy individuals in the 1000 Genomes populations.

The linear model was then able to generate a predicted disease impact score (DIS) for every autism-related mutation individually, based on what the model learned about transcriptional and post-transcriptional regulatory effects. The DIS estimates the likelihood of any given mutation to contribute to a disorder or disease (in this case, autism) and serves to help prioritize and rank them based on this likelihood.

Through analysis with these models, the research team was able to determine that there was a significantly higher functional impact of noncoding de novo mutations in the children with autism, when compared to their unaffected siblings.

The Gene-Tissue Connection

It is widely recognized that one of autism’s key attributes is altered development in the brain. However, a comprehensive gene-tissue association had never been established for the de novo noncoding variants. To address this, the research team methodically tested the variant effects of autism-specific mutations for the 53 cell types and tissues defined in the Genotype-Tissue Expression (GTEx) project.

The results of this testing reveal that the noncoding mutation variants have an effect on gene expression in the brain as well as in some genes that have already been linked to autism, particularly in brain synapse transmission and regulation of chromatin.

The observation that some of the noncoding variants affected similar genes and functions as previously identified coding variants suggests that mutations of both the coding and noncoding variety affect pathways and processes that have some overlap. This, in turn, implies a convergence in the genetic landscape, and emphasizes the possibilities of combining the mutations to pinpoint specific ASD-associated genes.

Validation Station

The analysis efforts of the research team utilizing the deep-learning models revealed prime candidates of noncoding, disease-associated mutations that have the potential to influence ASD through gene expression regulation. As a way to functionally validate and provide additional evidentiary support for potential causality of the predicted high-impact mutations, the researchers utilized a cell-based luciferase reporter assay system to study the allele-specific effects in the lab.

       NanoLuc Luciferase Target Engagement

The team inserted some of the predicted high-impact mutations into Promega’s pGL4.23 Firefly Luciferase Vector, which was then transfected into human neuroblastoma BE(2)-C cells, along with our NanoLuc® Luciferase Genetic Reporter. Following transfection, the team utilized Promega’s Nano-Glo® Dual-Luciferase® Assay System to detect luminescence, and thus the changes in gene expression. The resulting changes affirmed that the predictions made by the deep-learning model translated to quantitative allele-specific effects on gene expression.

Great Expectations

The successful application of AI modeling to the world of genomics, reveals the previously unknown significance of mutations in the noncoding region, and hints at the roles that both transcriptional and post-transcriptional mechanisms might play, not only in the causality of ASD, but in other complex diseases as well, such as heart disease and cancer.

Although the results of the study did not ultimately reveal the precise causes of autism, it did the next best thing, by providing thousands of possible ASD-contributors. This will help narrow the focus of study for future researchers, which in and of itself feels like a huge step forward in cracking the autism code.

The following two tabs change content below.
Natalie is a Science Writer at Promega. She earned her B.S. in Microbiology from the University of Wisconsin-Madison, and her Associate's Degree in Science from Cottey College. In her spare time, she can be found playing volleyball, making music, chipping away at her never-ending stack of craft projects, and volunteering with animals.

One thoughtful comment

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.