Deep learning for traffic forecasting: Graph and RNN approaches
Traffic congestion has become one of the most pressing challenges in modern urban environments, costing billions in lost productivity and environmental damage. Traditional statistical methods for traffic forecasting often struggle to capture the complex spatial and temporal dependencies inherent in traffic networks. Enter deep learning—a revolutionary approach that combines the power of graph neural networks and recurrent neural networks to predict traffic conditions with unprecedented accuracy.
This article explores how cutting-edge techniques like the diffusion convolutional recurrent neural network are transforming data-driven traffic forecasting and enabling smarter cities.

Content
Toggle1. Understanding the traffic forecasting challenge
Traffic forecasting involves predicting future traffic conditions based on historical and real-time data collected from various sensors across road networks. Unlike simple time series prediction, traffic forecasting must account for both temporal patterns (how traffic evolves over time) and spatial dependencies (how traffic at one location affects nearby locations).
The complexity of spatial-temporal data
Traffic networks exhibit unique characteristics that make forecasting particularly challenging:
- Spatial dependencies: Traffic conditions at one intersection are influenced by upstream and downstream locations. A bottleneck at one point can cascade through the network, affecting multiple roads.
- Temporal dynamics: Traffic patterns vary throughout the day, week, and season, with rush hours, weekends, and holidays showing distinct behaviors.
- Non-Euclidean structure: Unlike images or regular grids, road networks form irregular graph structures where traditional convolutional neural networks cannot be directly applied.
Consider a highway system where an accident on the main route causes ripple effects on alternative routes miles away. This spatial-temporal interaction requires models that can capture both the network topology and temporal evolution simultaneously.
Traditional approaches and their limitations
Classical methods like ARIMA (AutoRegressive Integrated Moving Average) and Kalman filters have been the workhorses of traffic forecasting for decades. These statistical models work well for simple scenarios but struggle with:
- Non-linear relationships in traffic flow
- Complex interactions between multiple road segments
- Sudden disruptions like accidents or special events
- Large-scale networks with thousands of sensors
Machine learning methods like support vector regression improved upon statistical approaches but still treated each location independently, missing crucial spatial correlations. The breakthrough came with deep learning, which could automatically learn hierarchical representations of spatial-temporal patterns.
2. Graph neural networks for spatial modeling
Graph neural networks have emerged as a powerful framework for processing data on irregular structures like traffic networks. Unlike traditional neural networks that operate on fixed-size grids, GNNs can handle the variable topology of road networks naturally.
Representing traffic networks as graphs
In traffic forecasting, we model the road network as a graph \( G = (V, E) \) where:
- Vertices \( V \) represent sensors or road segments
- Edges \( E \) represent connections between locations (physical roads or proximity relationships)
- Each node \( v_i \) has feature vectors representing traffic measurements (speed, flow, occupancy)
The adjacency matrix ( A ) encodes the network structure, where \( A_{ij} = 1 \) if locations \( i \) and \( j \) are connected. This matrix can be weighted to reflect distance, travel time, or correlation strength between locations.
Graph convolution operations
The key innovation in GNNs is the graph convolution operation, which aggregates information from neighboring nodes similar to how CNNs aggregate information from nearby pixels. For a node ( i ), the convolution can be expressed as:
$$ h_i^{(l+1)} = \sigma\left(\sum_{j \in N(i)} \frac{1}{c_{ij}} W^{(l)} h_j^{(l)}\right) $$
where \( h_i^{(l)} \) is the hidden state of node \( i \) at layer \( l \), \( N(i) \) represents the neighbors of node \( i \), \( c_{ij} \) is a normalization constant, \( W^{(l)} \) is a learnable weight matrix, and \( \sigma \) is an activation function.
Spectral graph convolutions
Spectral approaches leverage graph signal processing theory to define convolutions in the spectral domain using the graph Laplacian. The graph Laplacian is defined as \( L = D – A \), where ( D ) is the degree matrix. Spectral convolutions can be expressed as:
$$ g_\theta \star x = U g_\theta U^T x $$
where \( U \) is the matrix of eigenvectors of \( L \), and \( g_\theta \) is a learnable filter in the spectral domain.
For traffic networks, spectral methods can capture global patterns but are computationally expensive for large graphs. This led to the development of localized approximations like Chebyshev polynomials.
Diffusion convolution
The diffusion convolutional layer, a key component in the diffusion convolutional recurrent neural network, uses a different approach inspired by diffusion processes. It models how information spreads through the graph over multiple hops:
$$ H = \sum_{k=0}^{K} P^k X W_k $$
where \( P \) is the transition matrix (often derived from the adjacency matrix), \( X \) is the input signal, \( K \) is the diffusion steps, and \( W_k \) are learnable parameters. This captures how traffic conditions diffuse through the network, similar to how congestion propagates from one road segment to neighboring segments.
3. Recurrent neural networks for temporal modeling
While graph neural networks excel at capturing spatial dependencies, recurrent neural networks are the go-to architecture for modeling temporal sequences. In traffic forecasting, we need to capture how traffic evolves over time, including short-term fluctuations and longer-term trends.
The power of recurrence
Recurrent neural networks maintain a hidden state that gets updated at each time step, allowing them to remember past information. The basic RNN update is:
$$ h_t = \sigma(W_{hh} h_{t-1} + W_{xh} x_t + b) $$
where \( h_t \) is the hidden state at time \( t \), \( x_t \) is the input, and \( W_{hh}, W_{xh} \) are weight matrices.
However, basic RNNs suffer from vanishing gradients when dealing with long sequences, making it difficult to capture long-term dependencies like weekly patterns in traffic.
LSTM and GRU cells
Long Short-Term Memory networks and Gated Recurrent Units address the vanishing gradient problem through gating mechanisms that control information flow. An LSTM cell maintains a cell state \( c_t \) and uses three gates:
- Forget gate: decides what information to discard from the cell state
- Input gate: decides what new information to store
- Output gate: decides what to output based on the cell state
The LSTM equations are:
$$ \begin{align}
f_t &= \sigma\big(W_f \cdot [h_{t-1}, x_t] + b_f\big) \\
i_t &= \sigma\big(W_i \cdot [h_{t-1}, x_t] + b_i\big) \\
\tilde{c}_t &= \tanh\big(W_c \cdot [h_{t-1}, x_t] + b_c\big) \\
c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \\
o_t &= \sigma\big(W_o \cdot [h_{t-1}, x_t] + b_o\big) \\
h_t &= o_t \odot \tanh(c_t)
\end{align}$$
where \( \odot \) denotes element-wise multiplication.
Sequence-to-sequence architectures
For traffic forecasting, we typically use an encoder-decoder architecture:
- Encoder: processes historical traffic data (e.g., past hour) and compresses it into a context vector
- Decoder: generates future predictions (e.g., next 30 minutes) based on the context
This sequence-to-sequence approach allows the model to learn complex temporal patterns and make multi-step ahead predictions.
4. Diffusion convolutional recurrent neural network
The diffusion convolutional recurrent neural network represents a breakthrough in data-driven traffic forecasting by seamlessly integrating spatial and temporal modeling. This architecture combines diffusion convolution for capturing spatial dependencies with recurrent units for temporal dynamics.
Architecture overview
The DCRNN replaces the matrix multiplications in recurrent cells with diffusion convolution operations. Instead of simple linear transformations, each gate in the recurrent cell performs graph convolutions that aggregate information from neighboring nodes.
For a GRU-based DCRNN, the update equations become:
$$ \begin{align} r_t &= \sigma(\Theta_r \star_G [X_t, H_{t-1}] + b_r) \ u_t &= \sigma(\Theta_u \star_G [X_t, H_{t-1}] + b_u) \ C_t &= \tanh(\Theta_c \star_G [X_t, (r_t \odot H_{t-1})] + b_c) \ H_t &= u_t \odot H_{t-1} + (1 – u_t) \odot C_t \end{align} $$
where \( \star_G \) denotes the diffusion convolution operation, \( X_t \) is the input at time \( t \), \( H_t \) is the hidden state, and \( r_t, u_t \) are the reset and update gates respectively.
Bidirectional diffusion
A key innovation in DCRNN is bidirectional diffusion, which captures both forward and backward information propagation in the graph. Traffic flow can propagate both downstream (following traffic direction) and upstream (congestion backing up), so the model uses two separate diffusion processes:
$$ H = \sum_{k=0}^{K} (P_f^k X W_{k,f} + P_b^k X W_{k,b}) $$
where \( P_f \) is the forward transition matrix and \( P_b \) is the backward transition matrix. This allows the model to capture how traffic conditions spread in both directions through the network.
Scheduled sampling for training
Training sequence-to-sequence models can be challenging due to exposure bias—the model only sees ground truth during training but must use its own predictions during inference. DCRNN employs scheduled sampling, gradually transitioning from using ground truth to using model predictions during training:
import numpy as np
import torch
import torch.nn as nn
class DCRNN(nn.Module):
def __init__(self, num_nodes, input_dim, hidden_dim, output_dim, num_layers, max_diffusion_step):
super(DCRNN, self).__init__()
self.num_nodes = num_nodes
self.hidden_dim = hidden_dim
self.num_layers = num_layers
self.max_diffusion_step = max_diffusion_step
# Encoder layers
self.encoder_cells = nn.ModuleList([
DCGRUCell(num_nodes, input_dim if i == 0 else hidden_dim,
hidden_dim, max_diffusion_step)
for i in range(num_layers)
])
# Decoder layers
self.decoder_cells = nn.ModuleList([
DCGRUCell(num_nodes, output_dim if i == 0 else hidden_dim,
hidden_dim, max_diffusion_step)
for i in range(num_layers)
])
# Output projection
self.projection = nn.Linear(hidden_dim, output_dim)
def forward(self, inputs, adj_matrix, targets=None, teacher_forcing_ratio=0.5):
"""
Args:
inputs: (batch_size, seq_len, num_nodes, input_dim)
adj_matrix: (num_nodes, num_nodes)
targets: (batch_size, horizon, num_nodes, output_dim)
teacher_forcing_ratio: probability of using ground truth
"""
batch_size, seq_len, num_nodes, _ = inputs.shape
# Encoding
encoder_hidden_states = [None] * self.num_layers
for t in range(seq_len):
for layer in range(self.num_layers):
inputs_layer = inputs[:, t] if layer == 0 else encoder_hidden_states[layer-1]
encoder_hidden_states[layer] = self.encoder_cells[layer](
inputs_layer, encoder_hidden_states[layer], adj_matrix
)
# Decoding with scheduled sampling
outputs = []
decoder_hidden_states = encoder_hidden_states
decoder_input = inputs[:, -1, :, :1] # Use last input as first decoder input
horizon = targets.shape[1] if targets is not None else 12
for t in range(horizon):
for layer in range(self.num_layers):
inputs_layer = decoder_input if layer == 0 else decoder_hidden_states[layer-1]
decoder_hidden_states[layer] = self.decoder_cells[layer](
inputs_layer, decoder_hidden_states[layer], adj_matrix
)
# Project to output
output = self.projection(decoder_hidden_states[-1])
outputs.append(output)
# Scheduled sampling
if targets is not None and np.random.random() < teacher_forcing_ratio:
decoder_input = targets[:, t]
else:
decoder_input = output
return torch.stack(outputs, dim=1)
class DCGRUCell(nn.Module):
def __init__(self, num_nodes, input_dim, hidden_dim, max_diffusion_step):
super(DCGRUCell, self).__init__()
self.num_nodes = num_nodes
self.hidden_dim = hidden_dim
self.max_diffusion_step = max_diffusion_step
# Parameters for reset and update gates
self.graph_conv_gate = DiffusionGraphConv(
input_dim + hidden_dim, hidden_dim * 2, max_diffusion_step
)
# Parameters for candidate activation
self.graph_conv_candidate = DiffusionGraphConv(
input_dim + hidden_dim, hidden_dim, max_diffusion_step
)
def forward(self, inputs, hidden_state, adj_matrix):
if hidden_state is None:
hidden_state = torch.zeros(
inputs.shape[0], self.num_nodes, self.hidden_dim
).to(inputs.device)
# Concatenate input and hidden state
combined = torch.cat([inputs, hidden_state], dim=-1)
# Compute reset and update gates
combined_conv = self.graph_conv_gate(combined, adj_matrix)
r, u = torch.split(combined_conv, self.hidden_dim, dim=-1)
r = torch.sigmoid(r)
u = torch.sigmoid(u)
# Compute candidate activation
combined_candidate = torch.cat([inputs, r * hidden_state], dim=-1)
c = torch.tanh(self.graph_conv_candidate(combined_candidate, adj_matrix))
# Compute new hidden state
new_hidden_state = u * hidden_state + (1 - u) * c
return new_hidden_state
class DiffusionGraphConv(nn.Module):
def __init__(self, input_dim, output_dim, max_diffusion_step):
super(DiffusionGraphConv, self).__init__()
self.max_diffusion_step = max_diffusion_step
# Weights for forward and backward diffusion
self.weight_forward = nn.Parameter(
torch.FloatTensor(input_dim * (max_diffusion_step + 1), output_dim)
)
self.weight_backward = nn.Parameter(
torch.FloatTensor(input_dim * (max_diffusion_step + 1), output_dim)
)
self.bias = nn.Parameter(torch.FloatTensor(output_dim))
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.weight_forward)
nn.init.xavier_uniform_(self.weight_backward)
nn.init.zeros_(self.bias)
def forward(self, inputs, adj_matrix):
batch_size, num_nodes, input_dim = inputs.shape
# Compute transition matrices
adj_forward = adj_matrix / (adj_matrix.sum(dim=1, keepdim=True) + 1e-6)
adj_backward = adj_matrix.t() / (adj_matrix.sum(dim=0, keepdim=True) + 1e-6)
# Diffusion process
supports_forward = [inputs]
supports_backward = [inputs]
for k in range(self.max_diffusion_step):
supports_forward.append(
torch.matmul(adj_forward, supports_forward[-1])
)
supports_backward.append(
torch.matmul(adj_backward, supports_backward[-1])
)
# Concatenate all diffusion steps
x_forward = torch.cat(supports_forward, dim=-1)
x_backward = torch.cat(supports_backward, dim=-1)
# Apply weights
output_forward = torch.matmul(x_forward, self.weight_forward)
output_backward = torch.matmul(x_backward, self.weight_backward)
return output_forward + output_backward + self.bias
This implementation demonstrates the core components of DCRNN, including the diffusion graph convolution and the recurrent cell structure that integrates spatial and temporal processing.
5. Spatio-temporal graph convolutional networks
Building on similar principles, spatio-temporal graph convolutional networks offer another powerful framework for traffic forecasting. The key difference from DCRNN is that STGCN uses purely convolutional operations for both spatial and temporal dimensions, avoiding recurrence entirely.
Temporal gated convolution
Instead of RNNs, STGCN employs temporal gated convolutions inspired by the gated linear units used in sequence modeling. For a temporal sequence of graph signals, the temporal convolution applies 1D convolutions along the time axis:
$$ \Gamma = P \odot \sigma(Q) $$
where \( P \) and \( Q \) are obtained by convolving the input with different filters, and \( \sigma\) is typically a sigmoid function. This gating mechanism allows the model to control information flow without the computational overhead of recurrence.
ST-Conv blocks
The building block of STGCN is the ST-Conv block, which sandwiches a spatial graph convolution between two temporal convolutions:
- Temporal convolution layer: captures temporal dependencies
- Spatial graph convolution layer: captures spatial dependencies
- Temporal convolution layer: further processes temporal information
This “sandwich” structure effectively captures spatio-temporal correlations. The output of an ST-Conv block for a graph signal \( X \in \mathbb{R}^{N \times T \times C} \) (N nodes, T time steps, C channels) can be expressed as:
$$ H^{(l)} = \text{TConv}_2(\text{GConv}(\text{TConv}_1(H^{(l-1)}))) $$
Advantages over recurrent approaches
STGCN offers several benefits compared to recurrent architectures:
- Parallel computation: All temporal convolutions can be computed in parallel, unlike RNNs which must process sequentially
- Stable training: No vanishing or exploding gradient problems associated with long sequences
- Faster inference: No need to maintain and update hidden states
- Flexible receptive field: Stacking multiple ST-Conv blocks increases the receptive field in both space and time
For real-time traffic forecasting applications where latency is critical, STGCN’s computational efficiency makes it particularly attractive.
Implementation example
Here’s a simplified implementation of an STGCN layer:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TemporalConvLayer(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3):
super(TemporalConvLayer, self).__init__()
self.conv = nn.Conv2d(in_channels, out_channels,
kernel_size=(1, kernel_size),
padding=(0, (kernel_size - 1) // 2))
self.gate_conv = nn.Conv2d(in_channels, out_channels,
kernel_size=(1, kernel_size),
padding=(0, (kernel_size - 1) // 2))
def forward(self, x):
# x shape: (batch, channels, nodes, time)
P = self.conv(x)
Q = self.gate_conv(x)
return P * torch.sigmoid(Q)
class SpatialGraphConvLayer(nn.Module):
def __init__(self, in_channels, out_channels, num_nodes):
super(SpatialGraphConvLayer, self).__init__()
self.theta = nn.Parameter(torch.FloatTensor(in_channels, out_channels))
nn.init.xavier_uniform_(self.theta)
def forward(self, x, adj_matrix):
# x shape: (batch, channels, nodes, time)
batch_size, in_channels, num_nodes, time_steps = x.shape
# Reshape for matrix multiplication
x_reshaped = x.permute(0, 3, 2, 1).reshape(-1, num_nodes, in_channels)
# Graph convolution: L_norm @ X @ Theta
# Using normalized Laplacian
degree_matrix = torch.sum(adj_matrix, dim=1)
D_sqrt_inv = torch.diag(torch.pow(degree_matrix, -0.5))
L_norm = torch.matmul(torch.matmul(D_sqrt_inv, adj_matrix), D_sqrt_inv)
# Apply graph convolution
output = torch.matmul(torch.matmul(L_norm, x_reshaped), self.theta)
# Reshape back
output = output.reshape(batch_size, time_steps, num_nodes, -1)
output = output.permute(0, 3, 2, 1)
return output
class STConvBlock(nn.Module):
def __init__(self, in_channels, out_channels, num_nodes, kernel_size=3):
super(STConvBlock, self).__init__()
self.temporal1 = TemporalConvLayer(in_channels, out_channels, kernel_size)
self.spatial = SpatialGraphConvLayer(out_channels, out_channels, num_nodes)
self.temporal2 = TemporalConvLayer(out_channels, out_channels, kernel_size)
self.batch_norm = nn.BatchNorm2d(out_channels)
def forward(self, x, adj_matrix):
# x shape: (batch, channels, nodes, time)
t1 = self.temporal1(x)
s = self.spatial(t1, adj_matrix)
t2 = self.temporal2(s)
out = self.batch_norm(t2)
return F.relu(out)
class STGCN(nn.Module):
def __init__(self, num_nodes, in_channels, hidden_channels, out_channels,
num_blocks=2, kernel_size=3):
super(STGCN, self).__init__()
self.st_blocks = nn.ModuleList([
STConvBlock(in_channels if i == 0 else hidden_channels,
hidden_channels, num_nodes, kernel_size)
for i in range(num_blocks)
])
self.output_layer = nn.Conv2d(hidden_channels, out_channels,
kernel_size=(1, 1))
def forward(self, x, adj_matrix):
# x shape: (batch, channels, nodes, time)
for block in self.st_blocks:
x = block(x, adj_matrix)
# Output projection
out = self.output_layer(x)
return out
# Example usage
num_nodes = 207
input_dim = 1 # e.g., traffic speed
hidden_dim = 64
output_dim = 1
sequence_length = 12
model = STGCN(num_nodes, input_dim, hidden_dim, output_dim, num_blocks=2)
adj = torch.rand(num_nodes, num_nodes) # Placeholder adjacency matrix
x = torch.randn(32, input_dim, num_nodes, sequence_length) # Batch of sequences
output = model(x, adj)
print(f"Output shape: {output.shape}")
6. Practical considerations and applications
Implementing deep learning models for traffic forecasting involves several practical considerations beyond just model architecture. Success in real-world deployments requires attention to data preprocessing, graph construction, training strategies, and evaluation metrics.
Data preprocessing and feature engineering
Traffic data often comes from loop detectors, GPS traces, or traffic cameras, each with unique characteristics and noise patterns. Essential preprocessing steps include:
- Missing value imputation: Sensors frequently fail or report erroneous readings. Common strategies include linear interpolation, seasonal decomposition, or using neighboring sensors to fill gaps.
- Normalization: Traffic speeds and flows vary significantly across different road types. Z-score normalization per sensor ensures stable training: \( x_{norm} = \frac{x – \mu}{\sigma} \)
- Temporal features: Adding time-of-day, day-of-week, and holiday indicators as additional input features helps capture periodic patterns
- Weather integration: External factors like rain, snow, or special events significantly impact traffic and can be incorporated as auxiliary features
Graph construction strategies
The choice of graph structure significantly impacts model performance. Several approaches exist:
- Distance-based: Connect nodes within a threshold distance (e.g., 10 km radius)
- Road network connectivity: Use actual road connections from OpenStreetMap or similar sources
- Correlation-based: Compute pairwise correlations from historical data and threshold to create edges
- Adaptive graphs: Learn the graph structure jointly with the forecasting task
For large networks with thousands of nodes, sparse graphs are essential for computational efficiency. Adaptive thresholding or k-nearest neighbors approaches help maintain sparsity while preserving important connections.
Training strategies
Training deep spatial-temporal models requires careful tuning:
- Loss functions: Mean Absolute Error (MAE) is commonly used as it’s robust to outliers, but weighted losses can emphasize certain time periods or locations
- Curriculum learning: Start with short prediction horizons and gradually increase to longer horizons
- Multi-task learning: Training to predict multiple outputs (speed, flow, occupancy) simultaneously can improve generalization
- Data augmentation: Temporal shifting, adding noise, or creating synthetic scenarios helps prevent overfitting
Evaluation metrics
Traffic forecasting models are typically evaluated using:
- Mean Absolute Error (MAE): \( \frac{1}{n}\sum_{i=1}^{n}|y_i – \hat{y}_i| \)
- Root Mean Square Error (RMSE): \( \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2} \)
- Mean Absolute Percentage Error (MAPE): \( \frac{100%}{n}\sum_{i=1}^{n}\left|\frac{y_i – \hat{y}_i}{y_i}\right| \)
It’s important to report metrics across different prediction horizons (e.g., 15, 30, 60 minutes ahead) since accuracy typically degrades for longer horizons.
Real-world applications
Deep learning traffic forecasting has been successfully deployed in various scenarios:
- Intelligent transportation systems: Real-time routing and signal control based on predicted traffic conditions
- Navigation apps: Companies like Google Maps and Waze use these techniques to provide accurate travel time estimates
- City planning: Urban planners use traffic forecasts to evaluate infrastructure projects and policy changes
- Ride-sharing optimization: Platforms like Uber and Lyft predict demand and supply to optimize driver positioning
One compelling case involves a major metropolitan area that deployed DCRNN-based forecasting to manage highway traffic during peak hours. By predicting congestion 30 minutes in advance, traffic management systems could proactively adjust ramp metering rates, reducing average travel times by 12% during rush hour.
7. Conclusion
The fusion of graph neural networks and recurrent neural networks has revolutionized traffic forecasting, enabling accurate predictions that account for complex spatial-temporal dependencies. The diffusion convolutional recurrent neural network and spatio-temporal graph convolutional networks represent significant advances in data-driven traffic forecasting, offering powerful tools for understanding and managing urban mobility.
These deep learning approaches continue to evolve, with ongoing research exploring attention mechanisms, transfer learning across cities, and integration with reinforcement learning for traffic control. As cities grow smarter and sensor networks become more pervasive, the importance of accurate traffic forecasting will only increase, making these techniques essential tools for building sustainable, efficient urban transportation systems.