flowchart LR start["🎯 Start Here"] --> part1["Part 1<br/>Single Neuron<br/>────<br/>Activation<br/>Functions"] part1 --> part2["Part 2<br/>Layers<br/>────<br/>Vectorization<br/>& Depth"] part2 --> part3["Part 3<br/>Attention<br/>────<br/>Context-Aware<br/>Processing"] part3 --> part4["Part 4<br/>Transformers<br/>────<br/>Complete<br/>Architecture"] part4 --> end_goal["🎓 Goal Achieved<br/>────<br/>Understand GPT,<br/>BERT, GFMs"] style start fill:#e1f5ff style part1 fill:#e8f0ff style part2 fill:#f0e1ff style part3 fill:#ffe8e1 style part4 fill:#ffe1e1 style end_goal fill:#e1ffe1
Introduction
This guide builds your understanding of neural networks from the ground up, starting with how individual neurons process data, scaling to vectorized layer operations, and culminating in transformer architectures with attention mechanisms.
What you’ll learn: - How neurons transform data with activation functions - How vectorization makes processing efficient - Why attention is revolutionary for sequence modeling - How transformers combine these ideas into powerful models
For excellent visual explanations of neural networks, check out 3Blue1Brown’s Neural Networks series:
- But what is a neural network? - Foundation concepts
- Gradient descent, how neural networks learn - Training process
Part 1: The Individual Neuron
What Does a Neuron Do?
At its core, a neuron is a simple computational unit that:
- Takes multiple inputs
- Combines them with learned weights
- Adds a bias term
- Passes the result through an activation function
flowchart LR subgraph inputs["Inputs"] x1["x₁ = 1.5"] x2["x₂ = 2.0"] x3["x₃ = -0.5"] end subgraph weights["× Weights"] w1["w₁ = 0.5"] w2["w₂ = -1.0"] w3["w₃ = 0.3"] end subgraph computation["Weighted Sum"] sum["Σ(xᵢ × wᵢ) + b"] bias["+ bias (0.5)"] end subgraph activation["Activation"] z["z = -0.35"] act["f(z)"] output["Output"] end x1 --> w1 x2 --> w2 x3 --> w3 w1 --> sum w2 --> sum w3 --> sum bias --> sum sum --> z z --> act act --> output style inputs fill:#e1f5ff style weights fill:#fff3e1 style computation fill:#f0e1ff style activation fill:#e1ffe1
Key insight: A neuron is just a weighted sum followed by a non-linear function. That’s it!
Let’s see this in action with code:
import numpy as np
import matplotlib.pyplot as plt
# A single neuron processes inputs
def single_neuron(inputs, weights, bias, activation_fn):
"""
Simulate a single neuron's computation.
Args:
inputs: array of input values
weights: array of weights (same length as inputs)
bias: single bias value
activation_fn: function to apply to weighted sum
Returns:
The neuron's output after activation
"""
# Step 1: Weighted sum
= np.dot(inputs, weights) + bias
z
# Step 2: Apply activation function
= activation_fn(z)
output
return output, z
# Example inputs and parameters
= np.array([1.5, 2.0, -0.5])
inputs = np.array([0.5, -1.0, 0.3])
weights = 0.5
bias
# Without activation (just linear combination)
= single_neuron(inputs, weights, bias, lambda x: x)
output_linear, z print(f"Input values: {inputs}")
print(f"Weights: {weights}")
print(f"Bias: {bias}")
print(f"\nWeighted sum (z): {z:.3f}")
print(f"Linear output: {output_linear:.3f}")
Input values: [ 1.5 2. -0.5]
Weights: [ 0.5 -1. 0.3]
Bias: 0.5
Weighted sum (z): -0.900
Linear output: -0.900
What to notice: The neuron computes a weighted sum of its inputs plus a bias. Without an activation function, this is just a linear transformation—not very powerful for learning complex patterns.
Activation Functions: Adding Non-Linearity
Activation functions introduce non-linearity, which is crucial for neural networks to learn complex patterns. Let’s explore the most common ones:
# Define common activation functions
def relu(x):
"""ReLU: Rectified Linear Unit"""
return np.maximum(0, x)
def sigmoid(x):
"""Sigmoid: Squashes values to (0, 1)"""
return 1 / (1 + np.exp(-x))
def tanh(x):
"""Tanh: Squashes values to (-1, 1)"""
return np.tanh(x)
def leaky_relu(x, alpha=0.01):
"""Leaky ReLU: Like ReLU but allows small negative values"""
return np.where(x > 0, x, alpha * x)
# Visualize these functions
= np.linspace(-5, 5, 100)
x
= plt.subplots(2, 2, figsize=(12, 10))
fig, axes
# ReLU
0, 0].plot(x, relu(x), 'b-', linewidth=2)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_title('ReLU: max(0, x)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Input (x)')
axes[0, 0].set_ylabel('Output')
axes[0, 0].axhline(y=0, color='k', linewidth=0.5)
axes[0, 0].axvline(x=0, color='k', linewidth=0.5)
axes[
# Sigmoid
0, 1].plot(x, sigmoid(x), 'g-', linewidth=2)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_title('Sigmoid: 1/(1+e^(-x))', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Input (x)')
axes[0, 1].set_ylabel('Output')
axes[0, 1].axhline(y=0.5, color='r', linewidth=0.5, linestyle='--', alpha=0.5)
axes[
# Tanh
1, 0].plot(x, tanh(x), 'r-', linewidth=2)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_title('Tanh: (e^x - e^(-x))/(e^x + e^(-x))', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Input (x)')
axes[1, 0].set_ylabel('Output')
axes[1, 0].axhline(y=0, color='k', linewidth=0.5)
axes[
# Leaky ReLU
1, 1].plot(x, leaky_relu(x), 'm-', linewidth=2)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_title('Leaky ReLU: max(0.01x, x)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Input (x)')
axes[1, 1].set_ylabel('Output')
axes[1, 1].axhline(y=0, color='k', linewidth=0.5)
axes[1, 1].axvline(x=0, color='k', linewidth=0.5)
axes[
plt.tight_layout() plt.show()
What to notice:
- ReLU is the most popular: simple, fast, and effective. It “turns off” negative values completely.
- Sigmoid squashes values between 0 and 1, useful for probabilities.
- Tanh centers outputs around 0, often better than sigmoid for hidden layers.
- Leaky ReLU prevents “dead neurons” by allowing small negative values.
Seeing the Effect of Activation Functions
Let’s see how different activation functions transform our neuron’s output:
# Use the same inputs from before
= np.array([1.5, 2.0, -0.5])
inputs = np.array([0.5, -1.0, 0.3])
weights = 0.5
bias
# Compute weighted sum
= np.dot(inputs, weights) + bias
z
print(f"Weighted sum (z): {z:.3f}\n")
# Apply different activations
= {
activations 'Linear (no activation)': lambda x: x,
'ReLU': relu,
'Sigmoid': sigmoid,
'Tanh': tanh,
'Leaky ReLU': leaky_relu
}
for name, fn in activations.items():
= fn(z)
output print(f"{name:25s}: {output:.3f}")
Weighted sum (z): -0.900
Linear (no activation) : -0.900
ReLU : 0.000
Sigmoid : 0.289
Tanh : -0.716
Leaky ReLU : -0.009
Why this matters: The activation function dramatically changes the neuron’s output. This non-linear transformation is what allows neural networks to learn complex, non-linear patterns in data.
Part 2: From Single Neurons to Layers
The Power of Vectorization
Rather than computing neurons one at a time, we can process an entire layer simultaneously using matrix operations. This is both computationally efficient and conceptually elegant.
flowchart TB subgraph single["Single Neuron (Sequential)"] direction LR i1["Input<br/>Vector<br/>(3 dims)"] --> n1["Neuron 1"] i2["Input<br/>Vector<br/>(3 dims)"] --> n2["Neuron 2"] i3["Input<br/>Vector<br/>(3 dims)"] --> n3["Neuron 3"] n1 --> o1["Output 1"] n2 --> o2["Output 2"] n3 --> o3["Output 3"] end subgraph vectorized["Vectorized Layer (Parallel)"] direction LR iv["Input<br/>Vector<br/>(3 dims)"] --> matrix["Weight Matrix<br/>(3 × 5)<br/>+ Bias<br/>+ Activation"] matrix --> ov["Output<br/>Vector<br/>(5 dims)"] end single -.->|"Matrix Operation"| vectorized style single fill:#ffe1e1 style vectorized fill:#e1ffe1 style matrix fill:#fff3e1
Key insight: Instead of looping through neurons, one matrix multiplication processes all neurons simultaneously!
For understanding how layers work together, see 3Blue1Brown’s Neural network intuitions (timestamp 7:00 onwards).
# A layer is just multiple neurons working in parallel
class NeuralLayer:
"""A single layer of neurons with vectorized operations."""
def __init__(self, n_inputs, n_neurons):
"""
Initialize a layer.
Args:
n_inputs: number of input features
n_neurons: number of neurons in this layer
"""
# Each neuron has n_inputs weights
# Shape: (n_inputs, n_neurons)
self.weights = np.random.randn(n_inputs, n_neurons) * 0.1
# Each neuron has one bias
# Shape: (n_neurons,)
self.biases = np.zeros(n_neurons)
def forward(self, inputs, activation_fn=relu):
"""
Forward pass through the layer.
Args:
inputs: input array of shape (n_samples, n_inputs)
activation_fn: activation function to apply
Returns:
outputs after activation, shape (n_samples, n_neurons)
"""
# Matrix multiplication: (n_samples, n_inputs) @ (n_inputs, n_neurons)
# Result: (n_samples, n_neurons)
= np.dot(inputs, self.weights) + self.biases
z
# Apply activation function element-wise
return activation_fn(z)
# Create a layer with 3 inputs and 5 neurons
= NeuralLayer(n_inputs=3, n_neurons=5)
layer
# Process a batch of 4 samples
= np.random.randn(4, 3)
batch_inputs = layer.forward(batch_inputs, activation_fn=relu)
outputs
print(f"Input shape: {batch_inputs.shape}")
print(f"Weight matrix shape: {layer.weights.shape}")
print(f"Output shape: {outputs.shape}\n")
print("Sample input:")
print(batch_inputs[0])
print("\nCorresponding output:")
print(outputs[0])
Input shape: (4, 3)
Weight matrix shape: (3, 5)
Output shape: (4, 5)
Sample input:
[-0.09917586 1.846637 -1.07008477]
Corresponding output:
[0. 0.14226798 0.04979847 0. 0.07559149]
What to notice:
- The weight matrix has shape
(n_inputs, n_neurons)
- each column represents one neuron’s weights. - One matrix multiplication processes all neurons and all samples simultaneously.
- Output shape is
(n_samples, n_neurons)
- each sample gets transformed into a vector of neuron activations.
Building a Multi-Layer Network
Now let’s stack multiple layers to create a deep neural network:
flowchart LR subgraph input["Input Layer"] i1["x₁"] i2["x₂"] i3["x₃"] end subgraph hidden1["Hidden Layer 1<br/>(8 neurons)"] h11["🔵"] h12["🔵"] h13["🔵"] h14["🔵"] h15["🔵"] h16["🔵"] h17["🔵"] h18["🔵"] end subgraph hidden2["Hidden Layer 2<br/>(5 neurons)"] h21["🟢"] h22["🟢"] h23["🟢"] h24["🟢"] h25["🟢"] end subgraph output["Output Layer<br/>(2 neurons)"] o1["🔴"] o2["🔴"] end i1 --> h11 & h12 & h13 & h14 & h15 & h16 & h17 & h18 i2 --> h11 & h12 & h13 & h14 & h15 & h16 & h17 & h18 i3 --> h11 & h12 & h13 & h14 & h15 & h16 & h17 & h18 h11 & h12 & h13 & h14 & h15 & h16 & h17 & h18 --> h21 & h22 & h23 & h24 & h25 h21 & h22 & h23 & h24 & h25 --> o1 & o2 style input fill:#e1f5ff style hidden1 fill:#e8e1ff style hidden2 fill:#e1ffe8 style output fill:#ffe1e1
Key insight: Data flows forward through the network, with each layer creating progressively more abstract representations.
Now let’s implement this:
class SimpleNeuralNetwork:
"""A simple feedforward neural network."""
def __init__(self, layer_sizes):
"""
Args:
layer_sizes: list of layer sizes, e.g., [3, 8, 5, 2]
means 3 inputs, two hidden layers (8 and 5 neurons),
and 2 output neurons
"""
self.layers = []
for i in range(len(layer_sizes) - 1):
= NeuralLayer(layer_sizes[i], layer_sizes[i + 1])
layer self.layers.append(layer)
def forward(self, inputs):
"""Forward pass through all layers."""
= inputs
activation
print("Forward pass through network:")
print(f"Input shape: {activation.shape}")
for i, layer in enumerate(self.layers):
# Use ReLU for hidden layers, linear for output
if i < len(self.layers) - 1:
= layer.forward(activation, activation_fn=relu)
activation else:
= layer.forward(activation, activation_fn=lambda x: x)
activation
print(f"After layer {i+1}: {activation.shape}")
return activation
# Create a network: 3 inputs -> 8 hidden -> 5 hidden -> 2 outputs
= SimpleNeuralNetwork([3, 8, 5, 2])
network
# Process a batch of data
= np.random.randn(4, 3)
batch = network.forward(batch)
output
print(f"\nFinal output:\n{output}")
Forward pass through network:
Input shape: (4, 3)
After layer 1: (4, 8)
After layer 2: (4, 5)
After layer 3: (4, 2)
Final output:
[[ 0.00230489 0.00411068]
[-0.00171528 0.00183165]
[ 0.01574223 0.01413083]
[ 0.00275095 0.00187454]]
What to notice: - Each layer transforms the data: (batch, n_in) -> (batch, n_out)
- The output of one layer becomes the input to the next - This creates a series of increasingly abstract representations
Visualizing the Transformation
Let’s see how data is transformed as it flows through the network:
# Create a simpler network for visualization
= SimpleNeuralNetwork([2, 4, 3, 2])
viz_network
# Create some structured input data
= np.linspace(0, 2*np.pi, 100)
theta = np.column_stack([np.cos(theta), np.sin(theta)])
inputs
# Track activations at each layer
= [inputs]
activations = inputs
current
for i, layer in enumerate(viz_network.layers):
if i < len(viz_network.layers) - 1:
= layer.forward(current, activation_fn=relu)
current else:
= layer.forward(current, activation_fn=lambda x: x)
current
activations.append(current)
# Visualize the transformations
= plt.subplots(1, 4, figsize=(16, 4))
fig, axes
for i, (ax, act) in enumerate(zip(axes, activations)):
if i == 0:
= "Input Space"
title elif i == len(activations) - 1:
= "Output Space"
title else:
= f"Hidden Layer {i}"
title
# Plot first two dimensions
0], act[:, 1], c=theta, cmap='viridis', s=20)
ax.scatter(act[:, =12, fontweight='bold')
ax.set_title(title, fontsize'Dimension 1')
ax.set_xlabel('Dimension 2')
ax.set_ylabel(True, alpha=0.3)
ax.grid('equal')
ax.axis(
plt.tight_layout() plt.show()
What to notice: Each layer progressively transforms the data, creating new representations. The network learns to map inputs to outputs through these transformations.
Part 3: Attention Mechanisms and Transformers
The Limitation of Basic Neural Networks
Traditional feedforward networks process each input independently. But what if relationships between inputs matter? This is where attention mechanisms come in.
For an excellent visual explanation of attention and transformers, watch: - Attention in transformers, visually explained by 3Blue1Brown - Visualizing Attention, a Transformer’s Heart (full explanation)
What is Attention?
Attention allows the network to focus on relevant parts of the input when processing each element. Think of reading a sentence: to understand “it,” you need to look back and find what “it” refers to.
flowchart TB subgraph input["Input Sequence"] t1["Token 1"] t2["Token 2"] t3["Token 3"] t4["Token 4"] end subgraph qkv["Linear Projections"] direction LR q["Query (Q)<br/>What I'm looking for"] k["Key (K)<br/>What I offer"] v["Value (V)<br/>What I return"] end subgraph attention["Attention Computation"] similarity["Compute Similarity<br/>Q · Kᵀ"] weights["Softmax<br/>(Attention Weights)"] output["Weighted Sum<br/>Σ(weights × V)"] end subgraph result["Context-Aware Output"] o1["Output 1<br/>(informed by all tokens)"] o2["Output 2<br/>(informed by all tokens)"] o3["Output 3<br/>(informed by all tokens)"] o4["Output 4<br/>(informed by all tokens)"] end t1 & t2 & t3 & t4 --> qkv qkv --> similarity similarity --> weights weights --> output output --> o1 & o2 & o3 & o4 style input fill:#e1f5ff style qkv fill:#fff3e1 style attention fill:#f0e1ff style result fill:#e1ffe1
Key insight: Each token can “attend to” (look at) all other tokens, deciding which are most relevant for its own representation.
def simple_attention(query, keys, values):
"""
Simplified attention mechanism.
Args:
query: what we're looking for (n_queries, d_k)
keys: what we're comparing against (n_keys, d_k)
values: what we return (n_keys, d_v)
Returns:
weighted combination of values
"""
# Step 1: Compute similarity scores
# How much does each key match the query?
= np.dot(query, keys.T) # (n_queries, n_keys)
scores
# Step 2: Convert to attention weights (softmax)
# Scale by sqrt of dimension for stability
= keys.shape[1]
d_k = scores / np.sqrt(d_k)
scores = np.exp(scores) / np.exp(scores).sum(axis=1, keepdims=True)
attention_weights
# Step 3: Weighted sum of values
= np.dot(attention_weights, values) # (n_queries, d_v)
output
return output, attention_weights
# Example: A simple sequence
# Let's say we have 4 tokens (words), each represented by a 3D vector
= np.array([
sequence 1.0, 0.0, 0.5], # Token 0
[0.5, 1.0, 0.2], # Token 1
[0.2, 0.3, 1.0], # Token 2
[1.0, 0.5, 0.3], # Token 3
[
])
# Let's see what Token 2 attends to
= sequence[2:3] # Query is Token 2
query = sequence # Keys are all tokens
keys = sequence # Values are all tokens (simplified)
values
= simple_attention(query, keys, values)
output, attention_weights
print("Attention weights for Token 2:")
for i, weight in enumerate(attention_weights[0]):
print(f" Token {i}: {weight:.3f}")
print(f"\nOriginal Token 2: {sequence[2]}")
print(f"After attention: {output[0]}")
Attention weights for Token 2:
Token 0: 0.238
Token 1: 0.225
Token 2: 0.305
Token 3: 0.231
Original Token 2: [0.2 0.3 1. ]
After attention: [0.64324521 0.43223908 0.53893462]
What to notice: - The attention weights sum to 1.0 - Token 2 attends most to itself, but also considers other tokens - The output is a weighted combination - it’s been “informed” by the context
Visualizing Attention Patterns
# Compute full attention matrix (all tokens attending to all tokens)
= sequence
all_queries = sequence
all_keys = sequence
all_values
= simple_attention(all_queries, all_keys, all_values)
outputs, full_attention
# Visualize the attention matrix
= plt.subplots(1, 2, figsize=(14, 5))
fig, (ax1, ax2)
# Attention matrix
= ax1.imshow(full_attention, cmap='viridis', aspect='auto')
im 'Attention Matrix\n(Which tokens attend to which)',
ax1.set_title(=12, fontweight='bold')
fontsize'Key Token')
ax1.set_xlabel('Query Token')
ax1.set_ylabel(range(4))
ax1.set_xticks(range(4))
ax1.set_yticks(=ax1, label='Attention Weight')
plt.colorbar(im, ax
# Add values to cells
for i in range(4):
for j in range(4):
= ax1.text(j, i, f'{full_attention[i, j]:.2f}',
text ="center", va="center", color="white", fontweight='bold')
ha
# Show transformation
range(4), sequence[:, 0], alpha=0.5, label='Original (dim 0)')
ax2.bar(range(4), outputs[:, 0], alpha=0.5, label='After attention (dim 0)')
ax2.bar('Token Representations\n(First dimension)',
ax2.set_title(=12, fontweight='bold')
fontsize'Token')
ax2.set_xlabel('Value')
ax2.set_ylabel(
ax2.legend()True, alpha=0.3)
ax2.grid(
plt.tight_layout() plt.show()
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
What to notice: The attention matrix shows which tokens influence each other. Diagonal values are often high (tokens attend to themselves), but off-diagonal values capture contextual relationships.
Multi-Head Attention
Transformers use multiple attention heads to capture different types of relationships simultaneously.
flowchart TB subgraph input_seq["Input Sequence (d_model = 512)"] seq["Token 1, Token 2, ..., Token n"] end subgraph linear_proj["Linear Projections"] wq["W_Q"] wk["W_K"] wv["W_V"] end subgraph heads["Split into Multiple Heads (e.g., 8 heads × 64 dims)"] direction LR head1["Head 1<br/>🔵<br/>Attends to<br/>Syntax"] head2["Head 2<br/>🟢<br/>Attends to<br/>Semantics"] head3["Head 3<br/>🟡<br/>Attends to<br/>Position"] dots["..."] head8["Head 8<br/>🔴<br/>Attends to<br/>Context"] end subgraph attention_ops["Parallel Attention Operations"] att1["Attention<br/>Computation"] att2["Attention<br/>Computation"] att3["Attention<br/>Computation"] att4["Attention<br/>Computation"] end subgraph concat["Concatenate Heads"] combined["Combined Output<br/>(8 heads × 64 = 512 dims)"] end subgraph final["Final Projection"] wo["W_O<br/>(512 × 512)"] output["Context-Aware<br/>Representations"] end seq --> linear_proj linear_proj --> heads head1 --> att1 head2 --> att2 head3 --> att3 head8 --> att4 att1 & att2 & att3 & att4 --> combined combined --> wo wo --> output style input_seq fill:#e1f5ff style linear_proj fill:#fff3e1 style heads fill:#f0e1ff style attention_ops fill:#ffe1f0 style concat fill:#e1ffe8 style final fill:#e1ffe1
Key insight: Multiple heads can learn different attention patterns - some might focus on nearby words, others on distant relationships, enabling richer representations.
class MultiHeadAttention:
"""Multi-head attention mechanism."""
def __init__(self, d_model, n_heads):
"""
Args:
d_model: dimension of the model (e.g., 512)
n_heads: number of attention heads (e.g., 8)
"""
self.n_heads = n_heads
self.d_model = d_model
self.d_k = d_model // n_heads # dimension per head
# Linear projections for Q, K, V
self.W_q = np.random.randn(d_model, d_model) * 0.1
self.W_k = np.random.randn(d_model, d_model) * 0.1
self.W_v = np.random.randn(d_model, d_model) * 0.1
self.W_o = np.random.randn(d_model, d_model) * 0.1
def split_heads(self, x):
"""Split the last dimension into (n_heads, d_k)."""
= x.shape
batch_size, seq_len, d_model # Reshape to (batch_size, seq_len, n_heads, d_k)
= x.reshape(batch_size, seq_len, self.n_heads, self.d_k)
x # Transpose to (batch_size, n_heads, seq_len, d_k)
return x.transpose(0, 2, 1, 3)
def attention(self, q, k, v):
"""Scaled dot-product attention."""
= q.shape[-1]
d_k = np.matmul(q, k.transpose(0, 1, 3, 2)) / np.sqrt(d_k)
scores = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
attention_weights = np.matmul(attention_weights, v)
output return output, attention_weights
def forward(self, x):
"""
Forward pass.
Args:
x: input tensor of shape (batch_size, seq_len, d_model)
Returns:
output of shape (batch_size, seq_len, d_model)
"""
= x.shape
batch_size, seq_len, _
# Linear projections
= np.dot(x, self.W_q)
q = np.dot(x, self.W_k)
k = np.dot(x, self.W_v)
v
# Split into multiple heads
= self.split_heads(q.reshape(batch_size, seq_len, self.d_model))
q = self.split_heads(k.reshape(batch_size, seq_len, self.d_model))
k = self.split_heads(v.reshape(batch_size, seq_len, self.d_model))
v
# Apply attention
= self.attention(q, k, v)
attended, attention_weights
# Concatenate heads
= attended.transpose(0, 2, 1, 3)
attended = attended.reshape(batch_size, seq_len, self.d_model)
concatenated
# Final linear projection
= np.dot(concatenated, self.W_o)
output
return output, attention_weights
# Create multi-head attention
= 8
d_model = 2
n_heads = MultiHeadAttention(d_model, n_heads)
mha
# Process a sequence
= 1
batch_size = 4
seq_len = np.random.randn(batch_size, seq_len, d_model)
x
= mha.forward(x)
output, attention_weights
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
print(f"\nNumber of heads: {n_heads}")
print(f"Each head attends to sequence of length: {seq_len}")
Input shape: (1, 4, 8)
Output shape: (1, 4, 8)
Attention weights shape: (1, 2, 4, 4)
Number of heads: 2
Each head attends to sequence of length: 4
What to notice:
- Each head learns different attention patterns
- Multiple heads capture multiple types of relationships simultaneously
- This is much more powerful than single-head attention
How Attention Changes Encodings
Let’s see how attention modifies representations compared to a simple feedforward layer:
flowchart TB subgraph comparison["Processing Paradigms"] direction LR subgraph ff["Feedforward Network"] direction TB ff_input["Token 1 | Token 2 | Token 3 | Token 4"] ff_process["↓ Independent Processing ↓"] ff_layer["Layer applies SAME transformation<br/>to each position separately"] ff_output["Output 1 | Output 2 | Output 3 | Output 4"] ff_note["❌ No communication between positions"] ff_input --> ff_process ff_process --> ff_layer ff_layer --> ff_output ff_output --> ff_note end subgraph attn["Attention Network"] direction TB attn_input["Token 1 | Token 2 | Token 3 | Token 4"] attn_process["↓ Context-Aware Processing ↓"] attn_layer["Each position looks at ALL positions<br/>Weighted combination based on relevance"] attn_output["Output 1 | Output 2 | Output 3 | Output 4"] attn_note["✅ Each output informed by full context"] attn_input --> attn_process attn_process --> attn_layer attn_layer --> attn_output attn_output --> attn_note end end subgraph examples["Real-World Example"] direction LR sentence["'The animal didn't cross the street because it was too tired'"] ff_ex["Feedforward: 'it' processed in isolation<br/>❌ Can't determine if 'it' = animal or street"] attn_ex["Attention: 'it' attends to all words<br/>✅ Learns 'it' → 'animal' (via context)"] sentence --> ff_ex sentence --> attn_ex end style ff fill:#ffe1e1 style attn fill:#e1ffe1 style ff_note fill:#ffcccc style attn_note fill:#ccffcc style examples fill:#fff9e1
Why this is revolutionary: Attention allows the network to dynamically route information based on context, rather than applying fixed transformations. This is essential for understanding language, time series, and sequential data where relationships between elements matter.
# Create sample sequence data
42)
np.random.seed(= 6
seq_length = 8
d_model
# Input sequence
= np.random.randn(1, seq_length, d_model)
input_seq
# Option 1: Process with feedforward layer (no attention)
= NeuralLayer(d_model, d_model)
ff_layer = ff_layer.forward(input_seq.reshape(-1, d_model), activation_fn=relu)
ff_output = ff_output.reshape(1, seq_length, d_model)
ff_output
# Option 2: Process with attention
= MultiHeadAttention(d_model, n_heads=2)
mha = mha.forward(input_seq)
attn_output, attn_weights
# Visualize the difference
= plt.subplots(1, 3, figsize=(16, 4))
fig, axes
# Input
= axes[0].imshow(input_seq[0].T, cmap='RdBu', aspect='auto', vmin=-2, vmax=2)
im0 0].set_title('Input Sequence', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Position in Sequence')
axes[0].set_ylabel('Feature Dimension')
axes[=axes[0])
plt.colorbar(im0, ax
# Feedforward output
= axes[1].imshow(ff_output[0].T, cmap='RdBu', aspect='auto', vmin=-2, vmax=2)
im1 1].set_title('After Feedforward Layer\n(No attention)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Position in Sequence')
axes[1].set_ylabel('Feature Dimension')
axes[=axes[1])
plt.colorbar(im1, ax
# Attention output
= axes[2].imshow(attn_output[0].T, cmap='RdBu', aspect='auto', vmin=-2, vmax=2)
im2 2].set_title('After Multi-Head Attention\n(Context-aware)', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Position in Sequence')
axes[2].set_ylabel('Feature Dimension')
axes[=axes[2])
plt.colorbar(im2, ax
plt.tight_layout()
plt.show()
# Show how representations relate to each other
print("\nCosine similarity between position encodings:\n")
print("Feedforward (independent processing):")
= ff_output[0] / np.linalg.norm(ff_output[0], axis=1, keepdims=True)
ff_norm = np.dot(ff_norm, ff_norm.T)
ff_similarity print(f"Average off-diagonal similarity: {(ff_similarity.sum() - seq_length) / (seq_length * (seq_length - 1)):.3f}")
print("\nAttention (context-aware processing):")
= attn_output[0] / np.linalg.norm(attn_output[0], axis=1, keepdims=True)
attn_norm = np.dot(attn_norm, attn_norm.T)
attn_similarity print(f"Average off-diagonal similarity: {(attn_similarity.sum() - seq_length) / (seq_length * (seq_length - 1)):.3f}")
Cosine similarity between position encodings:
Feedforward (independent processing):
Average off-diagonal similarity: 0.421
Attention (context-aware processing):
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Average off-diagonal similarity: 0.998
Why this matters:
- Feedforward layers process each position independently - no position “knows” about others
- Attention layers mix information across positions - each position is informed by context
- This context-awareness is crucial for sequential data like language, time series, or video
Part 4: Putting It All Together - The Transformer Block
A complete transformer block combines attention with feedforward layers:
flowchart TB input["Input<br/>(Sequence of Embeddings)"] subgraph mha_block["Multi-Head Attention Block"] mha["Multi-Head<br/>Attention"] add1["Add"] norm1["Layer Norm"] end subgraph ff_block["Feedforward Block"] ff1["Linear<br/>(expand)"] relu["ReLU"] ff2["Linear<br/>(project)"] add2["Add"] norm2["Layer Norm"] end output["Output<br/>(Enriched Representations)"] input --> mha input -.->|"Residual<br/>Connection"| add1 mha --> add1 add1 --> norm1 norm1 --> ff1 ff1 --> relu relu --> ff2 norm1 -.->|"Residual<br/>Connection"| add2 ff2 --> add2 add2 --> norm2 norm2 --> output style input fill:#e1f5ff style mha_block fill:#f0e1ff style ff_block fill:#ffe8e1 style output fill:#e1ffe1 style mha fill:#d8b3ff style add1 fill:#ffd8b3 style add2 fill:#ffd8b3
Key components:
- Multi-Head Attention: Mix information across positions (context)
- Residual Connection: Add input back to help gradient flow
- Layer Normalization: Stabilize training
- Feedforward Network: Transform features independently
- Another Residual Connection: More gradient flow
This pattern repeats for each transformer layer in a model!
class TransformerBlock:
"""A single transformer block with attention and feedforward."""
def __init__(self, d_model, n_heads, d_ff):
"""
Args:
d_model: model dimension
n_heads: number of attention heads
d_ff: feedforward network dimension
"""
self.attention = MultiHeadAttention(d_model, n_heads)
self.ff1 = NeuralLayer(d_model, d_ff)
self.ff2 = NeuralLayer(d_ff, d_model)
def forward(self, x):
"""
Forward pass through transformer block.
Args:
x: input of shape (batch_size, seq_len, d_model)
Returns:
output of same shape as input
"""
= x.shape
batch_size, seq_len, d_model
# Step 1: Multi-head attention
= self.attention.forward(x)
attn_out, attn_weights
# Step 2: Add & Norm (simplified - just add)
= x + attn_out
x
# Step 3: Feedforward network
# Reshape for layer processing
= x.reshape(-1, d_model)
x_flat = self.ff1.forward(x_flat, activation_fn=relu)
ff_out = self.ff2.forward(ff_out, activation_fn=lambda x: x)
ff_out = ff_out.reshape(batch_size, seq_len, d_model)
ff_out
# Step 4: Add & Norm (simplified - just add)
= x + ff_out
output
return output, attn_weights
# Create and test a transformer block
= TransformerBlock(d_model=8, n_heads=2, d_ff=16)
transformer
# Process a sequence
= np.random.randn(1, 6, 8)
input_seq = transformer.forward(input_seq)
output, attention
print(f"Input shape: {input_seq.shape}")
print(f"Output shape: {output.shape}")
print("\nTransformer block completed:")
print(" 1. Multi-head attention (context mixing)")
print(" 2. Residual connection")
print(" 3. Feedforward network (feature transformation)")
print(" 4. Residual connection")
Input shape: (1, 6, 8)
Output shape: (1, 6, 8)
Transformer block completed:
1. Multi-head attention (context mixing)
2. Residual connection
3. Feedforward network (feature transformation)
4. Residual connection
What to notice: The transformer combines two key ideas:
- Attention: Mix information across sequence positions (capture context)
- Feedforward: Transform features independently at each position (extract patterns)
- Residual connections: Add the input back to help gradients flow
Summary: The Neural Network Hierarchy
Let’s recap the progressive complexity:
flowchart TD subgraph level1["Level 1: Single Neuron"] n1["Weighted Sum<br/>+ Activation<br/>────<br/>z = Σ(wᵢxᵢ) + b<br/>output = f(z)"] end subgraph level2["Level 2: Neural Layer"] layer["Vectorized Operations<br/>────<br/>Multiple neurons in parallel<br/>Matrix: (n_in, n_out)"] end subgraph level3["Level 3: Deep Network"] deep["Stacked Layers<br/>────<br/>Input → Hidden₁ → Hidden₂ → Output<br/>Hierarchical features"] end subgraph level4a["Level 4a: Attention Mechanism"] attention["Query-Key-Value<br/>────<br/>Context-aware weighting<br/>Tokens attend to each other"] end subgraph level4b["Level 4b: Multi-Head Attention"] multihead["Parallel Attention Heads<br/>────<br/>Multiple relationship types<br/>8 heads × 64 dims = 512 dims"] end subgraph level5["Level 5: Transformer Block"] transformer["Complete Architecture<br/>────<br/>Multi-Head Attention<br/>+ Residual Connection<br/>+ Feedforward Network<br/>+ Residual Connection"] end subgraph level6["Level 6: Full Transformer"] full["Stack of N Blocks<br/>────<br/>Block₁ → Block₂ → ... → Blockₙ<br/>+ Positional Encoding<br/>+ Output Head"] end level1 -.->|"Parallelize"| level2 level2 -.->|"Stack"| level3 level3 -.->|"Add Context<br/>Awareness"| level4a level4a -.->|"Multiple<br/>Heads"| level4b level4b -.->|"+ Feedforward<br/>+ Residuals"| level5 level5 -.->|"Stack<br/>Layers"| level6 capability1["Independent<br/>Processing"] -.-> level1 capability2["Batch<br/>Processing"] -.-> level2 capability3["Hierarchical<br/>Features"] -.-> level3 capability4["Sequential<br/>Dependencies"] -.-> level4a capability5["Rich<br/>Relationships"] -.-> level4b capability6["Stable Deep<br/>Learning"] -.-> level5 capability7["Complex<br/>Understanding"] -.-> level6 style level1 fill:#e1f5ff style level2 fill:#e8f0ff style level3 fill:#e8e5ff style level4a fill:#f0e1ff style level4b fill:#f8e1ff style level5 fill:#ffe1f0 style level6 fill:#ffe1e1 style capability1 fill:#fff9e1 style capability2 fill:#fff9e1 style capability3 fill:#fff9e1 style capability4 fill:#fff9e1 style capability5 fill:#fff9e1 style capability6 fill:#fff9e1 style capability7 fill:#fff9e1
Level 1: Single Neuron
- Computes weighted sum of inputs
- Applies activation function (e.g., ReLU)
- Transforms: scalar inputs → scalar output
- Key insight: Non-linearity enables learning complex patterns
Level 2: Layer of Neurons
- Multiple neurons computed in parallel (vectorized)
- Matrix multiplication: efficient batch processing
- Transforms: vector input → vector output
- Key insight: Multiple features extracted simultaneously
Level 3: Multi-Layer Network
- Stack layers to create deep representations
- Each layer builds on previous abstractions
- Transforms: input space → hidden spaces → output space
- Key insight: Depth creates hierarchical features
Level 4: Attention & Transformers
- Attention: dynamically weight inputs based on context
- Multi-head: capture multiple relationship types
- Transformer: attention + feedforward with residual connections
- Key insight: Context-aware processing for sequential data
Traditional Neural Networks: Process each input independently
- Good for: images, tabular data where position doesn’t matter
Transformers with Attention: Process inputs in context of each other
- Good for: language, time series, any sequential data
- Revolution: Enables models like GPT, BERT, and modern GFMs
Interactive Exploration
Experiment with the code above:
- Activation functions: Change the activation in
single_neuron()
and observe output changes - Layer sizes: Modify
layer_sizes
inSimpleNeuralNetwork
- what happens with very wide or very deep networks? - Attention heads: Increase
n_heads
inMultiHeadAttention
- do patterns change? - Sequence length: Use longer sequences in attention examples - observe attention patterns
These experiments will deepen your intuition about how neural networks transform data.
Further Resources
Video Explanations (3Blue1Brown)
- Neural Networks - Core concepts
- Gradient Descent - How networks learn
- Backpropagation - How gradients flow
- Attention & Transformers - Modern architecture
Key Papers
- “Attention Is All You Need” (Vaswani et al., 2017) - The original transformer paper
- “Deep Residual Learning” (He et al., 2015) - Residual connections
- “Understanding Deep Learning Requires Rethinking Generalization” (Zhang et al., 2017)
Next Steps in This Course
- Week 2: Spatial-temporal attention for geospatial data
- Week 3: Vision Transformers adapted for satellite imagery
- Week 4: Pretraining strategies (masked autoencoders)
This explainer is designed to build progressive understanding. Each section assumes you’ve understood the previous ones. Take time to run the code, observe the outputs, and experiment!