Back to all courses

Perceiver for Cardiac Video Data Classification (AlphaCare: Episode 2)

13,496 views

323 likes

62 comments

DeepMind recently released a new type of Transformer called the Perceiver IO, which was able to achieve state of the art accuracy across multiple data types (text, images, point clouds, and more). In this episode of the AlphaCare series, I'll explain how Perceiver works, and how we used it to improve accuracy scores for Cardiac video data. The EchoNet dataset was recently made public by Stanford University, and it contains 10K privatized heart videos from patients. We'll also discuss why Transformer networks work so well, and how by using 2 key features (Cross attention & positional embeddings), the Perceiver improved on all variants of Transformers. Get hype!

About the instructor

Siraj Raval

I'm a technologist on a mission to spread data literacy. Artificial Intelligence, Mathematics, Science, Technology, I simplify these topics to help you understand how they work. Using this knowledge you can build wealth and live a happier, more meaningful life. I live to serve this community. We are the fastest growing AI community in the world!

View profile

Join the discussion on YouTube

Related guide

The most recent version of AlphaCare, which employs DeepMind's perceiver IO neural network to diagnose cardiomyopathy automatically, or heart disease, is described in this guide.

To explore this, we will use a newly released echocardiogram video data from anonymous patients at Stanford. Alongside, we will also use RapidAPI, which provides us with accurate text descriptions for each image.

This guide is the second part of the AlphaCare application, whereas the first part discusses the convolutional networks. You can see more details about that from the link below:

Convolutional Networks: A Predictor of Heart Disease

Introduction

This guidebook will explain how DeepMind's perceiver model functions, why it has increased our accuracy scores, and what we have discovered after examining the data set. We want AlphaCare to be a general-purpose tool that automates disease diagnosis.

Any kind of dataset should be simple to drag and drop into the app for a doctor. They can ask questions such as what is the likelihood of cardiomyopathy after displaying text, photos, audio, video point clouds, and memes.

Cardiomyopathy Video

The model should be able to instantly diagnose the data by looking at the echocardiogram video or where the cancerous parts of the brain are in the neuroimaging point cloud. To accomplish this, we must employ the most general-purpose deep learning model.

Transformer

Although this process may appear science fiction, we have actually begun to witness this kind of potential. Consider the newest codex software from OpenAI. GPT-3, a neural network that learns to anticipate the following likely word in a series of words, was launched by OpenAI almost two years ago.

This network, which performs a succession of matrices and mathematical operations like multiplication and addition, is known as a transformer because it transforms an input sequence into an output sequence.

Transformer

It was trained using text data from large eBook libraries, Reddit, and Wikipedia. A codex model that recently became available from OpenAI was trained on millions of publicly accessible GitHub code lines, using GPT-3 as its basic model. So, rather than just predicting the next likely English word, it could also predict the following likely Python statement.

Dataset

The beta version of GitHub's Copilot software writes a text description, and code is generated automatically from that plain English description. Although this entire system still has flaws, it foreshadows future developments.

GitHub Copilot

The most potent tool ever created by humans is Language, which enables us to converse, tell stories, and cooperate. Our society is built on it, and it sets us apart from all other species. Just now, let's talk about DeepMind's perceiver architecture.

Transformer for Image classification

Transformer networks are currently state-of-the-art for both language and vision applications. Tesla's self-driving software uses transformers for image classification, a mission-critical system on which people's lives depend, according to a recent explanation by Andre Carpathy (Director of AI at Tesla). Hence its proper working is essential in the relevant context.

Image Classification

Types of Neural Network

We need to examine the two most common forms of neural networks, recurrent networks and convolution networks, in order to understand better why transformers perform so well.

Neural Network Types

Recurrent networks process input data in a sequential manner. Each word appears one after the other in succession. The recurrent portion involves feeding the model, not just a new data point but also the learned representation from the previous time step. Recurrent nets can't be trained in parallel because of this, yet they're perfect for modelling language because words are sequential.

Recurrent Networks

py
rnn = RNN()
ff = FeedForwardNN()
hidden_state =[0.0, 0.0, 0.0, 0.0]
for word in input:
	output, hidden_state = rnn (word, hidden_state)
prediction = ff(output)

Convolution networks are excellent for image classification, but when applied to text, they require far too much memory to be trained on words.

It is because the relationship between pairs of words is too large to store when they reach a certain number.

Convolution networks

Transformers improve on both architectures because they avoid recursion entirely. Unlike convolution networks, they can process inputs as a whole, such as a whole sentence at once, and they can scale in learning relationships between input pairs. This is made possible by the presence of two characteristics: attention and positional embeddings.

Relationship between Words

Types of Embeddings

Attention embeddings are a set of nested matrix operations that learns which parts of the input data are the most relevant to the prediction.

Attention Embeddings

At the same time, positional embeddings encode the positions of input data points concerning each other.

py
#### Numpy version ####
def positional_encoding (max_position, d_model, min_freq=1e-4):
	position = np.arange(max_position)
	freqs = min_freq**(2* (np.arange(d_model)//2)/d_model)
	pos_enc = position.reshape(-1,1)*freqs.reshape(1,-1)
	pos_enc[:, ::2] = np.cos (pos_enc[:, ::2])
	pos_enc[:, 1::2] = np.sin(pos_enc[:, 1::2])
	return pos_enc

Perceiver

The perceiver is a particular kind of transformer that can handle a variety of input data formats, including both text and image data, without encountering exponentially long compute times.

Perceiver

It can do this by introducing two concepts, shared weights across layers and the cross attention mechanism.

Cross Attention & Shared Weights

In order to handle input data, transformers typically employ a method known as the self-attention layer. However, when given text input, space complexity skyrockets as word length grow when given text input. This is due to self-attention, which generates a query, key, and value pairs.

Self-Attention Layer

To produce a score, a query and a key are multiplied. The score decides how much attention the network should give to particular words. The output is then calculated by multiplying the value by the probability values.

Score Calculation

The input sequences for images would consist of every pixel in the image. With more images, the complexity of the space would increase as well. The perceiver lowers the complexity of this space by using a cross attention layer between the input sequence and an attention layer made up of several self-attention blocks.

Cross Attention

Cross attention similarly carries out the matrix multiplication between queries and keys, but it feeds in data computed from the previous time step. It is a recurring step that takes input data, and because the data has already been compressed into a more manageable representation, the space complexity doesn't increase significantly. It's sharing weight matrices across time for attention.

py
class CrossAttention(nn.Module):
	"""Cross-attention module."""
	def __init__(
		self,
		*,
		kv_dim: int,
		q_dim: int,
		widening_factor: int = 1,
		num_heads: int = 1,
		head_dim: Optional[int] = None,
		use_query_residual: bool = True,
		dropout: float = 0.0,
		attention_dropout: float = 0.0
	):
		super().__init__()
		self.use_query_residual = use_query_residual
		self.kv_layer_norm = nn.LayerNorm(kv_dim)
		Self.q_layer._norm = nn.LayerNorm(q_dim)
		self.attention = MultiHeadAttention(
			kv_dim=kv_dim,
			q_dim=q_dim,
			head_dim=head_dim,
			num_heads=num_heads,
			dropout=attention_dropout
		)
		self.dropout = nn.Dropout (dropout)
		self.mlp = FeedForward(q_dim, widening_factor, dropout)
	def forward(
		self,
		inputs_kv: torch.Tensor,
		inputs_q: torch. Tensor,
		attention_mask: Optional[torch.Tensor] = None
	):
		attention = self.attention(
			inputs_kv=inputs_kv,
			inputs_q=inputs_q,
			attention_mask=attention_mask
		)

To ensure that the structure of the output space is captured, the query includes positional information in the sequence. They used the Fourier transform to accomplish this, which encodes information using frequency.

Fourier Transform

Fourier Transform has frequently improved the accuracy of predictive models in recent literature, whether used as embeddings or within the model itself. This highlights an intriguing phenomenon in biological neural networks as well as the nature of intelligence itself.

Let's take a look at our data set now. The Econet data set is the most recent data set released by Stanford and includes cardiac ultrasound videos of 10,000 patient hearts. Human expert annotations labeled it. It enables supervised learning. Ejection fraction, left ventricular volume, and expert tracing of the left ventricle are among the measurement features.

Econet Dataset

The 3D convolutional network was created to classify videos. We are interested in predicting the ejection fraction in systolic and diastolic volumes. We can freely download the data set as well as their 3D COVNET constructed with a pytorch.

After downloading our Python dependencies via PIP, we can get the code for the perceiver pytorch and swap out their model with the perceiver. After that, we can check our accuracy by using test videos. It appears the perceiver outperformed Econet.

RapidAPI

Using a single general-purpose model, AlphaCare can predictably diagnose two very distinct types of data, video and time series data. Let's add more patient metadata using RapidAPI, the biggest API hub in the world.

We can log in, look for the medical question answering API on the dashboard, and then inquire about any number of heart disease subtypes, such as arrhythmia or cardiomyopathy. It will return an explanation of the symptoms in clear English together with the crucial diagnostic data.

Medical Question Answering API

We can add that to our script using the Python snippet it provides. When our model detects and classifies a specific type of disease, we can also display related information about it, making it even more explanatory.

Wrap up

Following things to remember from this guide:

Transformer networks are state-of-the-art models for language and vision tasks, but they must be designed separately for each data type.
The perceiver is a more general type of transformer that does not require much handcrafting.
The perceiver accomplishes this by using a new type of Cross Attention mechanism and Fourier Transform embeddings based on Positional encoding.

Guide author

Mashhood A.

Learn API Development tips & tricks.Subscribe to our newsletter with over 1.7 Million Developers

Product

Build APIs
Public API Hub
API Hub for Enterprise
Rapid API Client VSCode

Enterprise

Internal Hub
Partner Hub
Security
Customers
Vertical Solutions

Resources

EBooks & Guides
Whitepages & Reports
Data Sheets & One-Pagers
Videos
Webinars
Learn

Company

Careers
Pressroom/News
Events
Blog
Contact Us