Back to all courses
The amount of molecular biology datasets available are growing exponentially every month. Multiomics consist of all the layers of the molecular biome; the genome, epigenome, transcriptome, proteome, and metabolome. In this episode, we're going to learn how each of the layers of the molecular biology stack work, and then look at 3 different real world use cases for Cancer patients (diagnostic, prognostic, and predictive) using open-source python code on GitHub. Then we'll look at how a Generative Adversarial Network can be used to generate synthetic genomic data to battle imbalanced classes. Enjoy!
I'm a technologist on a mission to spread data literacy. Artificial Intelligence, Mathematics, Science, Technology, I simplify these topics to help you understand how they work. Using this knowledge you can build wealth and live a happier, more meaningful life. I live to serve this community. We are the fastest growing AI community in the world!
This guide will demonstrate AlphaCare formulating synthetic genomic data on a data set of 5,000 lab rats using deep learning.
This is the third piece in the series of pieces on AlphaCare. Before we go any further, I suggest you look at the first two. You can find them below:
The term "genomic data" describes an organism's genome and DNA information. They are utilized in bioinformatics for gathering, storing, and processing the genomes of living species. In general, analyzing genomic data requires a lot of storage space and specialized tools.
Since genomic data is highly challenging to find, hence generating synthetic data assist in refining the predictions of deep learning models. This type of data is primarily used in predicting patient drug responses and helps doctors recommend optimal prescriptions.
This guide will teach about the most valuable type of biomedical data, i.e., multi-omics. To clarify, multi-omics is a new approach in which data sets from various omic groups are combined during analysis.
We will also study how deep learning models can be trained on multi-omics to optimize patient longevity. This whole process will be integrated by using RapidAPI, which helps us find drug metadata easily.
Taking the example of cryptocurrency, its entire mechanism rejected the fundamental principles of the legacy financial system that only the paper money developed by the government had value. An example of a P2P transaction can be seen in the relevant context:
py
from web3 import Web3import os# setup the infura eth node (plug in your project id)w3 = Web3 (Web3.HTTPProvider('https://mainnet.infura.io/v3/your-project-id'))# setup addresses and get private key for address1address1 = Web3.toChecksumAddress('your_public_address')address2 = Web3.toChecksumAddress ('receiver_public_address')private_key = os.getenv ('PRIVATE_KEY')# in this case, the nonce is the amount of transactions on account 1nonce = w3.eth.getTransactionCount(address1)# setup the transactiontx = {'nonce': nonce,'to': address2,'value': w3.toWei (0.001, 'ether'),'gas': 21000,'gasPrice': w3.toWei (40, 'gwei'),}# sign the transaction with the private key from address1signed_tx = w3.eth.account.signTransaction(tx, private_key)# send the raw transaction to the eth nodetx_hash = w3.eth. sendRawTransaction(signed_tx.rawTransaction)
Likewise, the longevity movement rejects the fundamental idea of the legacy medical system that reversing the aging process is impossible.
This longevity process leads the entire biomedical tech stack in a new direction. This has saved us the hassle of going in for tests when we have sickness symptoms by constantly monitoring multiple analytes called multi-omics data.
Due to recent advancements, medical researchers can now investigate human health at the tiniest scales, down to the level of individual molecules. Thanks to incredible advancements in hardware and software technology that are both more powerful and accessible. An example of multi-omics in python can be seen below:
py
def nucleotide_frequency(seq):"""Count nucleotides in a given sequence.Return a dictionary"""tmpFreqDict = {"A": 0, "C": 0, "G": 0, "T": 0}for nuc in seq:tmpFreqDict[nuc] +=1return tmpFreqDict
Microscope sequencing techniques and computing power have generated petabytes of data. Omics is a molecular term for the study of a group of molecules. The first discipline to emerge was genomics. It is the investigation of the entire genome, also known as DNA.
DNA contains information about your physical abilities, dietary requirements, intelligence, and personality. More often than not, DNA sequences change because specific genes are turned on or off as we age or are exposed to environmental factors such as anxiety. These are known as epigenetic changes, and their study is known as epigenomics.
DNA stores the information but doesn't use it for any particular purpose. When making proteins, for instance, DNA is translated into the messenger molecule RNA in order to retrieve the information and send it to the appropriate location. The study of the sum of all of these RNA transcripts or translations is called transcriptomics.
Tens of thousands of proteins are produced due to these processes, each of which is in charge of hundreds of vital bodily functions. The study of how these proteins are produced, degraded, and expressed is called proteomics.
The computation of all possible proteins in amino acids can be computed, an example of which can be seen as follow:
py
def proteins_from_rf(aa_seq):" " "Compute all possible proteins in an aminoacid" " "" " "seq and return a list of possible proteins" " "current_prot = [ ]proteins = [ ]for aa in aa_seq:if aa == "_":# STOP accumulating amino acids if_ - STOP was foundif current_prot:for p in current_prot:proteins.append(p)current_prot = [ ]else:# START accumulating amino acids if M - START was foundif aa == "M":current_prot.append(" ")for i in range(len(current_prot)):current_prot[i]+= aareturn proteins
Proteins break down into a group of molecules known as metabolites, including lipids and carbohydrates.
This is the final downstream result of the gene transcription process. It represents the current state of the biological system, and their study is called metabolomics.
Modern medicine's challenge is integrating all of this data into a comprehensive health system, a process known as multi-omics analysis. The genome is analogous to a hard disk that stores data in computer science. The epigenome is the disc reader, the transcriptome serves as an interpreter, and the metabolome serves as a process monitor — finally, the proteome functions as the applications themselves.
Aging is the gradual loss of information; the hard drive wears out, and the genome becomes scratched. However, by examining all the omics layers, we can better understand the information flow and eventually discover efficient ways to preserve and restore this information, reversing the aging process.
Biomarkers play a critical role in patient planning, preventative measures, and decisions that can be classified as diagnostic, prognostic, or predictive. So let's look at three different real-world applications to understand why we need to investigate the DNA.
Traditionally, a method called single-cell RNA-Seq provided an excellent overview of the different types of immune cells that exist and how many are present in a given sample.
To measure both the RNA and proteome from the same cell, a method known as cellular indexing of transcriptomes and epitopes by sequencing, or CITE-seq was developed more recently. In addition to gene expressions, it generates sequencing readout for protein expressions, providing us with a higher resolution overview.
Let's search for the first CITE-seq dataset published in 2017 CIITE-seq Data Set. It measured the transcriptome of 8,000 cells as well as the expression of 13 proteins.
There are two CSV files in this folder: one for gene expression and one for protein expression. Concatenating both categories of information in this case, each row represents a cell, and each column represents either a gene or a protein.
Since the dimensions of these two data types differ, we will first encode each independently before concatenating the results. The outputs will then be given to a different encoder. After that, the decoder will attempt to reconstruct the input. This is conducted by an autoencoder, a deep learning model we can build in pie torch, and this method is regarded as unsupervised.
After training, we may extract the compressed representation and display it as a two-dimensional map using a method known as TSNE. The cells' locations and relationships will become quite evident to us. This will facilitate doctors' understanding of the effectiveness of the therapy.
Next, let's discuss the second use case, "Diagnosis." We are trying to determine if a patient has cancer or not. Individual data sets with varying degrees of precision are frequently used to diagnose patients as either healthy or ill. However, we might theoretically improve accuracy if we combined data sets.
When we look at this single cell omics dataset, we can see that it contains data on gene expression, DNA, methylation, and open chromatin regions for a group of cells. This data set combines epigenomic and transcriptomic data. Because different distributions exist, such as binary, categorical, and continuous, we can normalize them all by combining them with a joint probability.
Then, for dimensionality reduction, we can use a technique called UMAP. This creates pairwise connections between data points, which, when visualized, gives us a clear display of whether cells are sick or healthy.
"Predictive Analytics" is the third use case. DeepMind released AlphaFold 2 two years ago, the most high-tech protein folding algorithm ever developed, including the world's largest protein database. This database can be used by researchers to predict how proteins fold, which can help them better design drugs to target specific diseases.
RapidAPI can be used to locate a database containing more information for predicting particular medicine's effectiveness. For instance, the myHealthbox API would accept a drug parameter like aspirin and return a ton of metadata about that drug in JSON format, including the active components and the dosage amount. We could then include this information in our app.
Combining this protein data set with another omics data set is one area of investigation I'd love to see explored for better drug design, but for now, that can wait for future guides. Let's just produce some artificial genetic data. Generally, a dearth of public genomics data is accessible for training, which is frequently greatly skewed.
We'll need a lot of data to train deep neural networks to make predictions, such as estimating the likelihood of developing a genetic disorder. Utilizing a generative adversarial network to produce synthetic data is one workaround.
We can input this into a GAN, starting with a data set of 4,000 rats. The discriminator will categorize each case as authentic or fake once the generator generates new examples randomly. Then both are updated through backpropagation, and the generator becomes adept at producing realistic samples over time.
When we have produced a sufficient number of genotypes, we can train a model on them that is quite effective in predicting the possibility of genetic disease in a rat.
This guide contains only five points to remember. In computer terms, the genome is the hard disk in the molecular biology stack, which keeps information. The disc reader is the epigenome, and the transcriptome serves as an interpreter. The metabolome is the process monitor, and the proteome is the collection of all applications.
The more data we integrate, the better we can make accurate predictions and eventually reverse the aging process.