Cardiac Video Data Classification using DeepMind’s Perceiver

Tue Jul 26 2022

8 min read

The most recent version of AlphaCare, which employs DeepMind's perceiver IO neural network to diagnose cardiomyopathy automatically, or heart disease, is described in this guide.

To explore this, we will use a newly released echocardiogram video data from anonymous patients at Stanford. Alongside, we will also use RapidAPI, which provides us with accurate text descriptions for each image.

This guide is the second part of the AlphaCare application, whereas the first part discusses the convolutional networks. You can see more details about that from the link below:


This guidebook will explain how DeepMind's perceiver model functions, why it has increased our accuracy scores, and what we have discovered after examining the data set. We want AlphaCare to be a general-purpose tool that automates disease diagnosis.

Any kind of dataset should be simple to drag and drop into the app for a doctor. They can ask questions such as what is the likelihood of cardiomyopathy after displaying text, photos, audio, video point clouds, and memes.

The model should be able to instantly diagnose the data by looking at the echocardiogram video or where the cancerous parts of the brain are in the neuroimaging point cloud. To accomplish this, we must employ the most general-purpose deep learning model.


Although this process may appear science fiction, we have actually begun to witness this kind of potential. Consider the newest codex software from OpenAI. GPT-3, a neural network that learns to anticipate the following likely word in a series of words, was launched by OpenAI almost two years ago.

This network, which performs a succession of matrices and mathematical operations like multiplication and addition, is known as a transformer because it transforms an input sequence into an output sequence.

It was trained using text data from large eBook libraries, Reddit, and Wikipedia. A codex model that recently became available from OpenAI was trained on millions of publicly accessible GitHub code lines, using GPT-3 as its basic model. So, rather than just predicting the next likely English word, it could also predict the following likely Python statement.

The beta version of GitHub's Copilot software writes a text description, and code is generated automatically from that plain English description. Although this entire system still has flaws, it foreshadows future developments.

The most potent tool ever created by humans is Language, which enables us to converse, tell stories, and cooperate. Our society is built on it, and it sets us apart from all other species. Just now, let's talk about DeepMind's perceiver architecture.

Transformer for Image classification

Transformer networks are currently state-of-the-art for both language and vision applications. Tesla's self-driving software uses transformers for image classification, a mission-critical system on which people's lives depend, according to a recent explanation by Andre Carpathy (Director of AI at Tesla). Hence its proper working is essential in the relevant context.

Types of Neural Network

We need to examine the two most common forms of neural networks, recurrent networks and convolution networks, in order to understand better why transformers perform so well.

Recurrent networks process input data in a sequential manner. Each word appears one after the other in succession. The recurrent portion involves feeding the model, not just a new data point but also the learned representation from the previous time step. Recurrent nets can't be trained in parallel because of this, yet they're perfect for modelling language because words are sequential.

rnn = RNN()
ff = FeedForwardNN()
hidden_state =[0.0, 0.0, 0.0, 0.0]
for word in input:
output, hidden_state = rnn (word, hidden_state)
prediction = ff(output)

Convolution networks are excellent for image classification, but when applied to text, they require far too much memory to be trained on words.

It is because the relationship between pairs of words is too large to store when they reach a certain number.

Transformers improve on both architectures because they avoid recursion entirely. Unlike convolution networks, they can process inputs as a whole, such as a whole sentence at once, and they can scale in learning relationships between input pairs. This is made possible by the presence of two characteristics: attention and positional embeddings.

Types of Embeddings

Attention embeddings are a set of nested matrix operations that learns which parts of the input data are the most relevant to the prediction.

At the same time, positional embeddings encode the positions of input data points concerning each other.

#### Numpy version ####
def positional_encoding (max_position, d_model, min_freq=1e-4):
position = np.arange(max_position)
freqs = min_freq**(2* (np.arange(d_model)//2)/d_model)
pos_enc = position.reshape(-1,1)*freqs.reshape(1,-1)
pos_enc[:, ::2] = np.cos (pos_enc[:, ::2])
pos_enc[:, 1::2] = np.sin(pos_enc[:, 1::2])
return pos_enc


The perceiver is a particular kind of transformer that can handle a variety of input data formats, including both text and image data, without encountering exponentially long compute times.

It can do this by introducing two concepts, shared weights across layers and the cross attention mechanism.

In order to handle input data, transformers typically employ a method known as the self-attention layer. However, when given text input, space complexity skyrockets as word length grow when given text input. This is due to self-attention, which generates a query, key, and value pairs.

To produce a score, a query and a key are multiplied. The score decides how much attention the network should give to particular words. The output is then calculated by multiplying the value by the probability values.

The input sequences for images would consist of every pixel in the image. With more images, the complexity of the space would increase as well. The perceiver lowers the complexity of this space by using a cross attention layer between the input sequence and an attention layer made up of several self-attention blocks.

Cross attention similarly carries out the matrix multiplication between queries and keys, but it feeds in data computed from the previous time step. It is a recurring step that takes input data, and because the data has already been compressed into a more manageable representation, the space complexity doesn't increase significantly. It's sharing weight matrices across time for attention.

class CrossAttention(nn.Module):
"""Cross-attention module."""
def __init__(
kv_dim: int,
q_dim: int,
widening_factor: int = 1,
num_heads: int = 1,
head_dim: Optional[int] = None,
use_query_residual: bool = True,
dropout: float = 0.0,
attention_dropout: float = 0.0
self.use_query_residual = use_query_residual
self.kv_layer_norm = nn.LayerNorm(kv_dim)
Self.q_layer._norm = nn.LayerNorm(q_dim)
self.attention = MultiHeadAttention(
self.dropout = nn.Dropout (dropout)
self.mlp = FeedForward(q_dim, widening_factor, dropout)
def forward(
inputs_kv: torch.Tensor,
inputs_q: torch. Tensor,
attention_mask: Optional[torch.Tensor] = None
attention = self.attention(

To ensure that the structure of the output space is captured, the query includes positional information in the sequence. They used the Fourier transform to accomplish this, which encodes information using frequency.

Fourier Transform

Fourier Transform has frequently improved the accuracy of predictive models in recent literature, whether used as embeddings or within the model itself. This highlights an intriguing phenomenon in biological neural networks as well as the nature of intelligence itself.

Let's take a look at our data set now. The Econet data set is the most recent data set released by Stanford and includes cardiac ultrasound videos of 10,000 patient hearts. Human expert annotations labeled it. It enables supervised learning. Ejection fraction, left ventricular volume, and expert tracing of the left ventricle are among the measurement features.

The 3D convolutional network was created to classify videos. We are interested in predicting the ejection fraction in systolic and diastolic volumes. We can freely download the data set as well as their 3D COVNET constructed with a pytorch.

After downloading our Python dependencies via PIP, we can get the code for the perceiver pytorch and swap out their model with the perceiver. After that, we can check our accuracy by using test videos. It appears the perceiver outperformed Econet.


Using a single general-purpose model, AlphaCare can predictably diagnose two very distinct types of data, video and time series data. Let's add more patient metadata using RapidAPI, the biggest API hub in the world.

We can log in, look for the medical question answering API on the dashboard, and then inquire about any number of heart disease subtypes, such as arrhythmia or cardiomyopathy. It will return an explanation of the symptoms in clear English together with the crucial diagnostic data.

We can add that to our script using the Python snippet it provides. When our model detects and classifies a specific type of disease, we can also display related information about it, making it even more explanatory.

Wrap up

Following things to remember from this guide:

  • Transformer networks are state-of-the-art models for language and vision tasks, but they must be designed separately for each data type.
  • The perceiver is a more general type of transformer that does not require much handcrafting.
  • The perceiver accomplishes this by using a new type of Cross Attention mechanism and Fourier Transform embeddings based on Positional encoding.