Regularized Optimization

Regularization techniques improve the stability and generalization of parameter estimation by adding penalty terms to the objective function. When fitting battery models, regularization helps address common challenges: noisy data, correlated parameters, and limited experimental conditions that leave some parameters poorly constrained.

Why Regularization?

Standard least-squares fitting minimizes the error between model predictions and data. However, this can lead to problems:

Overfitting: The optimizer finds parameter values that match noise in the training data, leading to poor predictions on new data
Ill-conditioning: When parameters are correlated (e.g., electrode thickness and diffusivity both affect time constants), small data perturbations cause large parameter swings
Non-identifiability: Some parameters may not be uniquely determined by the available data

Regularization addresses these issues by penalizing extreme parameter values, effectively encoding prior knowledge that parameters should stay within reasonable ranges.

Ridge Regression

Ridge regression adds an L2 penalty (sum of squared parameter values) to the least-squares objective. This shrinks parameter estimates toward zero, reducing variance at the cost of introducing some bias.

Problem Formulation

The ridge regression objective is:

x_{\text{RR}}^* = \arg\min_x \sum_i^N r_i(x)^2 + \lambda \sum_j^M x_j^2

with residuals:

r(x) = Ax - b

where:

$x \in \mathbb{R}^M$ – vector of parameters to estimate
$A \in \mathbb{R}^{N \times M}$ – design matrix (model predictions as a function of parameters)
$b \in \mathbb{R}^N$ – observed data
$\lambda \in [0, +\infty)$ – regularization strength

The first term measures data fidelity (how well the model fits the data), while the second term penalizes large parameter values. The hyperparameter

\lambda

controls the tradeoff: larger

\lambda

means stronger regularization and more shrinkage toward zero.

Normalization Requirement

For the L2 penalty to treat all parameters equally, both the residuals and parameters must be on comparable scales. This is typically achieved by Z-scoring (standardizing to zero mean and unit variance):

\hat{A} = \frac{A - \text{mean}(A, \text{axis}=0)}{\text{std}(A, \text{axis}=0)}

\hat{b} = \frac{b - \text{mean}(b)}{\text{std}(b)}

Without normalization, parameters with larger natural scales would be penalized more heavily, distorting the regularization.

Hyperparameter Optimization

The regularization strength

\lambda

is a hyperparameter that must be chosen carefully. Too little regularization leaves the model prone to overfitting; too much regularization forces parameters away from their data-driven values, introducing bias. The goal is to find the

\lambda

that best balances these competing effects.

Bias-Variance Tradeoff

Regularization introduces a fundamental tradeoff between bias and variance:

Bias: Regularization shrinks parameters toward the prior, pulling estimates away from the “true” values. This is the cost of regularization.
Variance: Without regularization, estimates are highly sensitive to noise in the training data. Regularization reduces this sensitivity.

The optimal

\lambda

minimizes the total error (bias² + variance) on unseen data:

$\lambda$ Value	Training Error	Validation Error	Issue
Too small ( $\lambda \to 0$ )	Low	High	Overfitting
Too large	High	High	Underfitting
Optimal ( $\lambda^*$ )	Moderate	Low	Best generalization

Optimization Procedure

Fit on Training Data

For a fixed value of

\lambda

, determine

x_{\text{RR}}^*

using the training data

Evaluate Validation Error

Compute the prediction error on the validation set

Repeat for Multiple λ Values

Iterate steps 1-2 for several

\lambda

values

Select Optimal λ

Choose

\lambda^*

that minimizes validation error, then refit on combined training and validation data

Maximum A Posteriori (MAP) Estimation

While ridge regression shrinks parameters toward zero, we often have better prior knowledge—for example, literature values or physical constraints. MAP estimation with Gaussian priors generalizes ridge regression by shrinking parameters toward specified prior means rather than zero. From a Bayesian perspective, MAP estimation finds the parameter values that maximize the posterior probability given the data. With Gaussian priors and Gaussian measurement noise, this is equivalent to minimizing:

x_{\text{MAP}}^* = \arg\min_x \sum_i^N \left(\frac{\hat{y}_i(x) - y_i}{\sigma_{y,i}}\right)^2 + \sum_j^M \left(\frac{x_j - \mu_j}{\sigma_{x,j}}\right)^2

where:

$\hat{y}_i(x)$ – model prediction at data point $i$
$y_i$ – observed data at point $i$
$\sigma_{y,i}$ – measurement uncertainty (standard deviation)
$\mu_j$ – prior mean for parameter $j$ (e.g., literature value)
$\sigma_{x,j}$ – prior uncertainty for parameter $j$

The first term is the normalized data misfit (chi-squared statistic). The second term penalizes deviations from prior expectations, weighted by prior uncertainty. Parameters with tight priors (small

\sigma_{x,j}

) are constrained more strongly.

Connection to Ridge Regression

MAP estimation is mathematically equivalent to ridge regression when parameters are centered at the prior mean and scaled by the prior standard deviation. Adding a regularization hyperparameter

\lambda

gives:

x_{\text{MAP,RR}}^* = \arg\min_x \sum_i^N \left(\frac{\hat{y}_i(x) - y_i}{\sigma_{y,i}}\right)^2 + \lambda \sum_j^M \left(\frac{x_j - \mu_j}{\sigma_{x,j}}\right)^2

When

\lambda = 1

, this is standard MAP estimation. When

\lambda < 1

, the data is weighted more heavily relative to the priors. When

\lambda > 1

, the priors dominate.

Efficient Nonlinear Regularization

For linear models, ridge regression has an analytic solution. Nonlinear models (like battery electrochemical models) require iterative optimization, and finding the optimal

\lambda

through cross-validation would require repeated refitting—computationally expensive. An efficient alternative leverages two key assumptions:

All parameters have priors: Every parameter has a specified prior distribution, eliminating identifiability issues where multiple parameter combinations give equivalent fits.
Local quadratic approximation: Near the optimum $x_{\text{MAP}}^*$ , the objective function is approximately quadratic. This is valid when optimization has converged to a well-defined minimum.

Under these assumptions, the Hessian at the optimum characterizes the local curvature, and the optimal

\lambda^*

can be determined efficiently from a single optimization run plus validation error evaluation—without repeatedly refitting the full model.

Practical Usage

To use regularization in ionworkspipeline, attach Gaussian priors to your parameters. The prior mean represents your best estimate before seeing data, and the prior standard deviation encodes your uncertainty.

import ionworkspipeline as iwp

# Define parameters with priors
parameters = {
    "Positive particle diffusivity [m2.s-1]": iwp.Parameter(
        "D_pos",
        initial_value=1e-14,
        bounds=(1e-16, 1e-12),
        prior=iwp.priors.Gaussian(mean=1e-14, std=5e-15)
    ),
    "Negative particle diffusivity [m2.s-1]": iwp.Parameter(
        "D_neg",
        initial_value=3e-14,
        bounds=(1e-16, 1e-12),
        prior=iwp.priors.Gaussian(mean=3e-14, std=1e-14)
    ),
}

# Run with regularization
datafit = iwp.DataFit(
    objectives=objective,
    parameters=parameters,
    optimizer=iwp.optimizers.ScipyLeastSquares(),
)
result = datafit.run(fixed_parameters)

Choosing Priors

Good priors come from:

Literature values: Published measurements for similar materials
Physical constraints: Known bounds from theory (e.g., diffusivity must be positive)
Previous experiments: Results from related cells or conditions
Order-of-magnitude estimates: Even rough estimates help stabilize fitting

The prior standard deviation should reflect genuine uncertainty. A narrow prior (small

\sigma

) strongly constrains the parameter; a wide prior (large

\sigma

) allows the data to dominate.

When uncertain about prior strength, start with wide priors (large

\sigma

) and tighten them only if fitting becomes unstable. Overly tight priors can prevent the optimizer from finding good solutions.

The regularization options and usage examples here are not exhaustive. See the API reference for full details on priors, constraints, and penalties.

Getting Started

Batteries 101

Modeling

Pipelines

Optimization

Reference

Why Regularization?

Ridge Regression

Problem Formulation

Normalization Requirement

Hyperparameter Optimization

Bias-Variance Tradeoff

Optimization Procedure

Maximum A Posteriori (MAP) Estimation

Connection to Ridge Regression

Efficient Nonlinear Regularization

Practical Usage

Choosing Priors

Getting Started

Batteries 101

Modeling

Pipelines

Optimization

Reference

Documentation Index

​Why Regularization?

​Ridge Regression

​Problem Formulation

​Normalization Requirement

​Hyperparameter Optimization

​Bias-Variance Tradeoff

​Optimization Procedure

​Maximum A Posteriori (MAP) Estimation

​Connection to Ridge Regression

​Efficient Nonlinear Regularization

​Practical Usage

​Choosing Priors

Why Regularization?

Ridge Regression

Problem Formulation

Normalization Requirement

Hyperparameter Optimization

Bias-Variance Tradeoff

Optimization Procedure

Maximum A Posteriori (MAP) Estimation

Connection to Ridge Regression

Efficient Nonlinear Regularization

Practical Usage

Choosing Priors