Leveraging Multiomics Data & RapidAPI for Cancer Diagnosis

Tue Jul 26 2022

9 min read

This guide will demonstrate AlphaCare formulating synthetic genomic data on a data set of 5,000 lab rats using deep learning.

This is the third piece in the series of pieces on AlphaCare. Before we go any further, I suggest you look at the first two. You can find them below:

Genomic Data

The term "genomic data" describes an organism's genome and DNA information. They are utilized in bioinformatics for gathering, storing, and processing the genomes of living species. In general, analyzing genomic data requires a lot of storage space and specialized tools.

Since genomic data is highly challenging to find, hence generating synthetic data assist in refining the predictions of deep learning models. This type of data is primarily used in predicting patient drug responses and helps doctors recommend optimal prescriptions.

Multi-omic Data

This guide will teach about the most valuable type of biomedical data, i.e., multi-omics. To clarify, multi-omics is a new approach in which data sets from various omic groups are combined during analysis.

We will also study how deep learning models can be trained on multi-omics to optimize patient longevity. This whole process will be integrated by using RapidAPI, which helps us find drug metadata easily.

Longevity movement

Taking the example of cryptocurrency, its entire mechanism rejected the fundamental principles of the legacy financial system that only the paper money developed by the government had value. An example of a P2P transaction can be seen in the relevant context:

py
from web3 import Web3
import os
# setup the infura eth node (plug in your project id)
w3 = Web3 (Web3.HTTPProvider('https://mainnet.infura.io/v3/your-project-id'))
# setup addresses and get private key for address1
address1 = Web3.toChecksumAddress('your_public_address')
address2 = Web3.toChecksumAddress ('receiver_public_address')
private_key = os.getenv ('PRIVATE_KEY')
# in this case, the nonce is the amount of transactions on account 1
nonce = w3.eth.getTransactionCount(address1)
# setup the transaction
tx = {
'nonce': nonce,
'to': address2,
'value': w3.toWei (0.001, 'ether'),
'gas': 21000,
'gasPrice': w3.toWei (40, 'gwei'),
}
# sign the transaction with the private key from address1
signed_tx = w3.eth.account.signTransaction(tx, private_key)
# send the raw transaction to the eth node
tx_hash = w3.eth. sendRawTransaction(signed_tx.rawTransaction)

Likewise, the longevity movement rejects the fundamental idea of the legacy medical system that reversing the aging process is impossible.

This longevity process leads the entire biomedical tech stack in a new direction. This has saved us the hassle of going in for tests when we have sickness symptoms by constantly monitoring multiple analytes called multi-omics data.

Due to recent advancements, medical researchers can now investigate human health at the tiniest scales, down to the level of individual molecules. Thanks to incredible advancements in hardware and software technology that are both more powerful and accessible. An example of multi-omics in python can be seen below:

py
def nucleotide_frequency(seq):
"""
Count nucleotides in a given sequence.
Return a dictionary
"""
tmpFreqDict = {"A": 0, "C": 0, "G": 0, "T": 0}
for nuc in seq:
tmpFreqDict[nuc] +=1
return tmpFreqDict

Genomic Study

Microscope sequencing techniques and computing power have generated petabytes of data. Omics is a molecular term for the study of a group of molecules. The first discipline to emerge was genomics. It is the investigation of the entire genome, also known as DNA.

DNA contains information about your physical abilities, dietary requirements, intelligence, and personality. More often than not, DNA sequences change because specific genes are turned on or off as we age or are exposed to environmental factors such as anxiety. These are known as epigenetic changes, and their study is known as epigenomics.

DNA stores the information but doesn't use it for any particular purpose. When making proteins, for instance, DNA is translated into the messenger molecule RNA in order to retrieve the information and send it to the appropriate location. The study of the sum of all of these RNA transcripts or translations is called transcriptomics.

Tens of thousands of proteins are produced due to these processes, each of which is in charge of hundreds of vital bodily functions. The study of how these proteins are produced, degraded, and expressed is called proteomics.

The computation of all possible proteins in amino acids can be computed, an example of which can be seen as follow:

py
def proteins_from_rf(aa_seq):
" " "Compute all possible proteins in an aminoacid" " "
" " "seq and return a list of possible proteins" " "
current_prot = [ ]
proteins = [ ]
for aa in aa_seq:
if aa == "_":
# STOP accumulating amino acids if_ - STOP was found
if current_prot:
for p in current_prot:
proteins.append(p)
current_prot = [ ]
else:
# START accumulating amino acids if M - START was found
if aa == "M":
current_prot.append(" ")
for i in range(len(current_prot)):
current_prot[i]+= aa
return proteins

Proteins break down into a group of molecules known as metabolites, including lipids and carbohydrates.

This is the final downstream result of the gene transcription process. It represents the current state of the biological system, and their study is called metabolomics.

Multi-Omics Analysis in CS terms

Modern medicine's challenge is integrating all of this data into a comprehensive health system, a process known as multi-omics analysis. The genome is analogous to a hard disk that stores data in computer science. The epigenome is the disc reader, the transcriptome serves as an interpreter, and the metabolome serves as a process monitor — finally, the proteome functions as the applications themselves.

Aging is the gradual loss of information; the hard drive wears out, and the genome becomes scratched. However, by examining all the omics layers, we can better understand the information flow and eventually discover efficient ways to preserve and restore this information, reversing the aging process.

Biomarkers play a critical role in patient planning, preventative measures, and decisions that can be classified as diagnostic, prognostic, or predictive. So let's look at three different real-world applications to understand why we need to investigate the DNA.

Prognostic - First Use Case

Traditionally, a method called single-cell RNA-Seq provided an excellent overview of the different types of immune cells that exist and how many are present in a given sample.

To measure both the RNA and proteome from the same cell, a method known as cellular indexing of transcriptomes and epitopes by sequencing, or CITE-seq was developed more recently. In addition to gene expressions, it generates sequencing readout for protein expressions, providing us with a higher resolution overview.

Data Set

Let's search for the first CITE-seq dataset published in 2017 CIITE-seq Data Set. It measured the transcriptome of 8,000 cells as well as the expression of 13 proteins.

There are two CSV files in this folder: one for gene expression and one for protein expression. Concatenating both categories of information in this case, each row represents a cell, and each column represents either a gene or a protein.

Since the dimensions of these two data types differ, we will first encode each independently before concatenating the results. The outputs will then be given to a different encoder. After that, the decoder will attempt to reconstruct the input. This is conducted by an autoencoder, a deep learning model we can build in pie torch, and this method is regarded as unsupervised.

After training, we may extract the compressed representation and display it as a two-dimensional map using a method known as TSNE. The cells' locations and relationships will become quite evident to us. This will facilitate doctors' understanding of the effectiveness of the therapy.

Diagnosis – Second Use Case

Next, let's discuss the second use case, "Diagnosis." We are trying to determine if a patient has cancer or not. Individual data sets with varying degrees of precision are frequently used to diagnose patients as either healthy or ill. However, we might theoretically improve accuracy if we combined data sets.

Data Set

When we look at this single cell omics dataset, we can see that it contains data on gene expression, DNA, methylation, and open chromatin regions for a group of cells. This data set combines epigenomic and transcriptomic data. Because different distributions exist, such as binary, categorical, and continuous, we can normalize them all by combining them with a joint probability.

Then, for dimensionality reduction, we can use a technique called UMAP. This creates pairwise connections between data points, which, when visualized, gives us a clear display of whether cells are sick or healthy.

Predictive Analytics – Third Case

"Predictive Analytics" is the third use case. DeepMind released AlphaFold 2 two years ago, the most high-tech protein folding algorithm ever developed, including the world's largest protein database. This database can be used by researchers to predict how proteins fold, which can help them better design drugs to target specific diseases.

RapidAPI

RapidAPI can be used to locate a database containing more information for predicting particular medicine's effectiveness. For instance, the myHealthbox API would accept a drug parameter like aspirin and return a ton of metadata about that drug in JSON format, including the active components and the dosage amount. We could then include this information in our app.

Combining this protein data set with another omics data set is one area of investigation I'd love to see explored for better drug design, but for now, that can wait for future guides. Let's just produce some artificial genetic data. Generally, a dearth of public genomics data is accessible for training, which is frequently greatly skewed.

We'll need a lot of data to train deep neural networks to make predictions, such as estimating the likelihood of developing a genetic disorder. Utilizing a generative adversarial network to produce synthetic data is one workaround.

We can input this into a GAN, starting with a data set of 4,000 rats. The discriminator will categorize each case as authentic or fake once the generator generates new examples randomly. Then both are updated through backpropagation, and the generator becomes adept at producing realistic samples over time.

When we have produced a sufficient number of genotypes, we can train a model on them that is quite effective in predicting the possibility of genetic disease in a rat.

Wrap up

This guide contains only five points to remember. In computer terms, the genome is the hard disk in the molecular biology stack, which keeps information. The disc reader is the epigenome, and the transcriptome serves as an interpreter. The metabolome is the process monitor, and the proteome is the collection of all applications.

The more data we integrate, the better we can make accurate predictions and eventually reverse the aging process.