8.8.1. Tutorial 1: Writing your own tangent-linear and adjoint code

In this tutorial, we essentially follow the Appendix A of Marotzke et al 1999. The difference is that we will be doing this in Python. The paper does not give the tangent linear code. However, we have provided one here, since it is helpful to write the adjoint code. The tape (storage) for the adjoint is mimicked using a list object in Python as a stack.

We start by importing the only library we need, which is numpy.

[1]:
import numpy as np

8.8.1.1. Forward model

The forward model (not to be confused with tangent linear model) performs non-linear quadratic operations on the input \(\mathbf{x}_0\), which has been copied to \(x\). This is followed by “convective adjustment”. \(\mathbf{x}_0\) is our control variable, and for the given choice of \(\mathbf{x}_0 = (1,3,3)\), the cost function \(fc\) evaluates to 40.25.

We call the function forward_without_tape because we are not storing the interim vectors to tape, which would then aid in adjoint calculations. Instead we have a separate function forward for that, which is illustrated further below.

[2]:
def forward_without_tape(x0 = np.array([1, 3, 3])):

    x = np.zeros(3)

    for i in range(3):
        x[i] = x0[i]

    y = x[0]**2

    for i in range(3):

        x[i] = y + x[i]**2

    # Convective adjustment

    for i in range(2):

        if (x[i] < x[i+1]):
            x[i] = 0.5*(x[i] + x[i+1])
            x[i+1] = x[i]

    # Cost function

    fc = (x[0]-5.5)**2 + 2.*x[1] + 3.*x[2]

    return fc

print(f"Cost function for given x0: {forward_without_tape()}")
Cost function for given x0: 40.25

8.8.1.1.1. Getting the finite difference gradient from the Forward model

We can use finite differences with the central differencing scheme to get the gradient evaluated at the given value of \(\mathbf{x}_0\). For a \(N\) dimensional control vector, the forward model needs to be called \(2N\) times to get an approximate gradient. This is because we can only evaluate directional derivatives, which we need to do \(N\) times to get the gradient. Each directional derivative evaluation, in turn, requires 2 forward model evaluations.

Mathematically, this can be formulated as

\[\left (\nabla_{\mathbf{x}_0} fc(\mathbf{x}_0), \hat{\mathbf{e}} \right) \approx \frac{fc\left(\mathbf{x}_0+\epsilon \hat{\mathbf{e}}\right) - fc\left(\mathbf{x}_0-\epsilon \hat{\mathbf{e}}\right)}{2\epsilon} + \mathcal{O}(\epsilon^2)\]

Here \((.,.)\) indicates the normal inner product of two discrete vectors in \(\mathbb{R}^N\).

The left side is the directional derivative of \(fc\) at \(\mathbf{x}_0\) in the direction \(\hat{\mathbf{e}}\), while the right side is the central differences approximation of the directional derivative.

The choice of \(\epsilon\) can be critical, however we do not discuss that here and simply choose a value \(\epsilon = 0.001\). In general, one would perform a convergence analysis with a range of values of \(\epsilon = {0.1, 0.01, 0.001, ...}\).

To get the first component of the gradient at \(\mathbf{x}_0\), we choose \(\epsilon = 0.001, \hat{\mathbf{e}} = \hat{\mathbf{e}}_1 = (1,0,0)\). To get the second component of the gradient at \(\mathbf{x}_0\), we choose \(\epsilon = 0.001, \hat{\mathbf{e}} = \hat{\mathbf{e}}_2 = (0,1,0)\). To get the third component of the gradient at \(\mathbf{x}_0\), we choose \(\epsilon = 0.001, \hat{\mathbf{e}} = \hat{\mathbf{e}}_3 = (0,0,1)\).

[3]:
eps = 0.001
x0 = np.array([1,3,3])
e1 = np.array([1,0,0])
e2 = np.array([0,1,0])
e3 = np.array([0,0,1])
[4]:
g1 = (forward_without_tape(x0 = x0 + eps*e1) - forward_without_tape(x0 = x0 - eps*e1))/(2*eps)

print(f"First component of gradient for given x0: {g1:.2f}")
First component of gradient for given x0: 15.50
[5]:
g2 = (forward_without_tape(x0 = x0 + eps*e2) - forward_without_tape(x0 = x0 - eps*e2))/(2*eps)

print(f"Second component of gradient for given x0: {g2:.2f}")
Second component of gradient for given x0: 10.50
[6]:
g3 = (forward_without_tape(x0 = x0 + eps*e3) - forward_without_tape(x0 = x0 - eps*e3))/(2*eps)

print(f"Third component of gradient for given x0: {g3:.2f}")
Third component of gradient for given x0: 15.00
[7]:
from IPython.display import Markdown as md
md("Therefore the approximated finite differences gradient at $\mathbf{x}_0 = (1,3,3)$ is given by $\mathbf{g} = (%.2f, %.2f, %.2f)$"%(g1,g2, g3))
[7]:

Therefore the approximated finite differences gradient at \(\mathbf{x}_0 = (1,3,3)\) is given by \(\mathbf{g} = (15.50, 10.50, 15.00)\)

8.8.1.2. Tangent linear mode

We now illustrate how to write the tangent linear code for the forward model shown above. Mathematically, a tangent linear model is a linearization of a non-linear forward model.

If the forward model is \(\mathbf{x}_{\text{new}} = A(\mathbf{x})\), where \(A\) can be non-linear, the linearized statement would read - \(\mathbf{dx}_{\text{new}} = \mathbf{B}\mathbf{dx}\), where \(\mathbf{B}(i,j) = \frac{\partial [A(\mathbf{x})]_i}{\partial \mathbf{x}_j}\) is the jacobian of the non-linear operator \(A\).

An important subtlety here is that \(\mathbf{B}\) depends on \(\mathbf{x}\), and not \(\mathbf{x}_{\text{new}}\). From a coding perspective, this means that the linearized statements are always called before the original non-linear ones. This is because we generally need the old values of the control vector for the linearized statement. If the non-linear statements are called first, they will generally update \(\mathbf{x}\) to \(\mathbf{x}_{\text{new}}\).

One would expect a linearized model to be cheaper computationally than its non-linear counterpart. However, the computations for the linearized model require variables from the original non-linear model and hence one has to evaluate the non-linear forward model too in order to evaluate the linearized forward model. This means the cost is approximately 2 times that of the original non-linear model.

This can be clearly seen below. For each statement in the non-linear forward model, the tangent linear model forward_d has 2 statements. It is helpful to look at each line of code as a general non-linear operation \(\mathbf{x}_{\text{new}} = A(\mathbf{x})\), which we linearize to \(\mathbf{dx}_{\text{new}} = \mathbf{B}\mathbf{dx}\). The jacobian matrices \(\mathbf{B}\), which are functions of \(\mathbf{x}\) are mentioned in the comments above the repsective code statements.

[8]:
def forward_d(x0 = np.array([1, 3, 3]), x0d = [1, 0, 0]):

    x = np.zeros(3)
    xd = np.zeros(3)



    for i in range(3):

        # [xd[i]]  = [0. 1.] [xd]
        # [x0d[i]] = [0. 1.] [x0d[i]]

        xd[i] = x0d[i]
        x[i] = x0[i]

    # [yd]    = [0. 2*x[0]] [yd]
    # [xd[0]] = [0.     1.] [xd[0]]

    yd = 2*x[0]*xd[0]
    y = x[0]**2

    for i in range(3):

        # [xd[i]]  = [2*x[i]  1.] [xd[i]]
        # [yd]     = [0.      1.] [yd]

        xd[i] = yd + 2*x[i]*xd[i]
        x[i] = y + x[i]**2

    # Convective adjustment

    for i in range(2):

        if (x[i] < x[i+1]):

            # [xd[i]]   = [0.5 0.5] [xd[i]]
            # [xd[i+1]] = [0.   1.] [xd[i+1]]

            xd[i] = 0.5*(xd[i] + xd[i+1])
            x[i] = 0.5*(x[i] + x[i+1])

            # [xd[i]]   = [1. 0.] [xd[i]]
            # [xd[i+1]] = [1. 0.] [xd[i+1]]

            xd[i+1] = xd[i]
            x[i+1] = x[i]

    # [xd[0]] = [1.           0. 0. 0.] [xd[0]]
    # [xd[1]] = [0.           1. 0. 0.] [xd[1]]
    # [xd[2]] = [0.           0. 1. 0.] [xd[2]]
    # [fcd]   = [2*(x[0]-5.5) 2. 3. 0.] [fcd]

    fcd = 2*(x[0]-5.5)*xd[0] + 2.*xd[1] + 3.*xd[2]
    fc = (x[0]-5.5)**2 + 2.*x[1] + 3.*x[2]


    return fcd

The tangent linear model for a given \(\mathbf{dx}_0\), represented by \(x0d\) in the code above, returns a scalar value \(fcd\), which is directional derivative in the direction \(\mathbf{dx}_0\). To get the gradient, we have to evaluate the directional derivative in all of the basis directions \(\hat{\mathbf{e}}_1, \hat{\mathbf{e}}_2, \hat{\mathbf{e}}_3\), just like we did for the finite differences gradient.

[9]:
(g1, g2, g3) = forward_d(x0d = [1, 0, 0]), forward_d(x0d = [0, 1, 0]), forward_d(x0d = [0, 0, 1])
[10]:
from IPython.display import Markdown as md
md("Therefore the tangent linear evaluated gradient at $\mathbf{x}_0 = (1,3,3)$ is given by $\mathbf{g} = (%.2f, %.2f, %.2f)$"%(g1,g2, g3))
[10]:

Therefore the tangent linear evaluated gradient at \(\mathbf{x}_0 = (1,3,3)\) is given by \(\mathbf{g} = (15.50, 10.50, 15.00)\)

8.8.1.3. Adjoint mode

We now illustrate how to write the adjoint code for the forward model shown above. To better understand this, we consider a non-linear forward model \(\mathbf{y} = A(\mathbf{x})\), where the non-linear operator \(A\) can be further seen as a composition of functions - \(A = A_3 \circ A_2 \circ A_1\). This is helpful to see how the adjoint is reverse in the nature of its flow.

\[\therefore \mathbf{y} = A_3(A_2(A_1(\mathbf{x})))\]

One can even interpret \(A_1, A_2, A_3\) as being individual lines of code. We also define the interim variables as \(\mathbf{x}_1 = A_1(\mathbf{x}), \mathbf{x}_2 = A_2(A_1(\mathbf{x})) = A_2(\mathbf{x}_1), \mathbf{y} = A_3(A_2(A_1(\mathbf{x}))) = A_3(\mathbf{x}_2)\).

Let us assume that the cost function \(\mathcal{J}\) is a function of \(\mathbf{y}\), therefore \(\mathcal{J} = \mathcal{J}(\mathbf{y})\). We wish to evaluate the sensitivity of \(\mathcal{J}\) to \(\mathbf{x}\), i.e \(\nabla_{\mathbf{x}} J\), which can be evaluated using a simple chain rule. Both the LHS and RHS are row vectors here.

\[\mathbf{g}^T = \nabla_{\mathbf{x}} \mathcal{J}^T = \frac{d\mathcal{J}}{d\mathcal{J}}\nabla_{\mathbf{y}} \mathcal{J}^T \:\mathbf{B}_3(\mathbf{x}_2) \:\mathbf{B}_2(\mathbf{x}_1) \:\mathbf{B}_1(\mathbf{x}) = 1 \times \nabla_{\mathbf{y}} \mathcal{J}^T \:\mathbf{B}_3(\mathbf{x}_2) \:\mathbf{B}_2(\mathbf{x}_1) \:\mathbf{B}_1(\mathbf{x})\]

Here \(\mathbf{B}_i\) is the jacobian matrix for \(A_i\). There are two ways to calculate this sensitivity:

  1. Take inner product of LHS and RHS by unit basis vectors \(\hat{\mathbf{e}}_i\). Do this \(N\) times to get all the \(N\) components of the gradient.

\[\mathbf{g}_i = \nabla_{\mathbf{x}} \mathcal{J}^T \hat{\mathbf{e}}_i = 1 \times \nabla_{\mathbf{y}} \mathcal{J}^T \:\mathbf{B}_3(\mathbf{x}_2) \:\mathbf{B}_2(\mathbf{x}_1) \:\mathbf{B}_1(\mathbf{x}) \hat{\mathbf{e}}_i\]

This is exactly what the tangent linear model is doing. If you go back to the discussion of the tangent linear model, you see that we substitute \(\mathbf{dx}\) as \(\hat{\mathbf{e}}_i\) and get the directional derivative \(N\) times to get the gradient.

  1. Transpose both the LHS and the RHS.

\[\mathbf{g} = \nabla_{\mathbf{x}} \mathcal{J} = \mathbf{B}_1^T(\mathbf{x}) \:\mathbf{B}_2^T(\mathbf{x}_1) \:\mathbf{B}_3^T(\mathbf{x}_2) \:\nabla_{\mathbf{y}} \mathcal{J} \times 1\]

We see the advantage here clearly, if we multiply starting from the right, we only ever have to do matrix vector multiplications, and we get the entire gradient in one single go. This is what the adjoint model does.

The reverse nature of the adjoint model is clearly seen here. In the non-linear forward model and tangent linear model, \(A_1, \mathbf{B}_1\) act first respectively followed by \(A_2, \mathbf{B}_2\), and finally \(A_3, \mathbf{B}_3\). For the adjoint model, \(\mathbf{B}_3^T\) acts first, followed by \(\mathbf{B}_2^T\) and finally, \(\mathbf{B}_1^T\). Furthermore, the adjoint calculation requires \(\mathbf{y}\) first even though it is computed last in the forward mode, followed by \(\mathbf{x}_2, \mathbf{x}_1, \&\:\: \mathbf{x}\).

Therefore, we need to run the non-linear forward mode, and store (push to a stack named tape) \(\mathbf{x}, \mathbf{x}_2, \mathbf{x}_3, \mathbf{y}\) to tape, and then retrieve them (pop from a stack named tape) in the reverse order during the adjoint computation.

All of this can be generalized to \(A = A_M \circ A_{M-1} ... \circ A_1\). If we view \(\mathbf{x}_i = A_i(\mathbf{x}_{i-1})\) as one line of code, \(d\mathbf{x}_i = \mathbf{B}_i(\mathbf{x}_{i-1})d\mathbf{x}_{i-1}\) is the linearization of that one line of code and \(\nabla_{\mathbf{x}_{i-1}} \mathcal{J} = \mathbf{B}_i^T \nabla_{\mathbf{x}_{i}} \mathcal{J}\) is the adjoint of that line of code. By definition, within the code \(xib = \nabla_{\mathbf{x}_{i}} \mathcal{J}\). Therefore, \(Jb = \frac{d\mathcal{J}}{d\mathcal{J}} = 1\).

Now we see why doing the Tangnt Linear code was useful. The jacobian matrices computed and illustrated in the comments there just need to be transposed and used for the adjoint code.

8.8.1.3.1. Forward mode with tape used for adjoint

We now define the forward model with tape functionality included, which is useful to compute the adjoint. The tape is essentially a stack (it actually has all the properties of a stack data structure such as being Last-In-First-Out (LIFO) and First-In-Last-Out (FILO)) that stores the vectors that are needed during the computation of adjoint, which has a reverse flow when compared to the forward and tangent linear models.

[11]:
def forward(x0 = np.array([1, 3, 3])):

    x = np.zeros(3)
    tape = []
    for i in range(3):
        x[i] = x0[i]

    y = x[0]**2

    tape.append(np.copy(x))

    for i in range(3):

        x[i] = y + x[i]**2

    # Convective adjustment

    for i in range(2):

        tape.append(np.copy(x))

        if (x[i] < x[i+1]):
            x[i] = 0.5*(x[i] + x[i+1])
            x[i+1] = x[i]

    tape.append(np.copy(x))

    fc = (x[0]-5.5)**2 + 2.*x[1] + 3.*x[2]

    return fc, tape

print(f"Cost function for given x0: {forward()[0]}")
print(f"Tape for given x0: {forward()[1]}")
Cost function for given x0: 40.25
Tape for given x0: [array([1., 3., 3.]), array([ 2., 10., 10.]), array([ 6.,  6., 10.]), array([6., 8., 8.])]

8.8.1.3.2. Code for adjoint mode

In this case, all the adjoint variables are suffixed with \(b\). The comments show the adjoint matrices used for the operations. They are exactly the transpose of the jacobian matrices mentioned in the tangent linear code comments. We pop the required value of \(\mathbf{x}\) from tape whenever needed.

There is an important subtlety here as well. Sometimes, you have to avoid doing the matrix vector product shown in the comments in the exact same order. Look in the comments below where it says LOOK HERE! to understand why.

[12]:
_, tape = forward(x0 = np.array([1, 3, 3]))

def forward_b(x0 = np.array([1, 3, 3]), tape = tape):

    xb = np.zeros(3)
    x0b = np.zeros(3)
    yb = 0.0

    fcb = 1.0

    # [xb[0]] = [1. 0. 0. 2*(x[0]-5.5)] [xb[0]]
    # [xb[1]] = [0. 1. 0. 2.          ] [xb[1]]
    # [xb[2]] = [0. 0. 1. 3.          ] [xb[2]]
    # [fcb]   = [0. 0. 0. 0.          ] [fcb]

    x = tape.pop()

    xb[0] = xb[0] +  2*(x[0]-5.5) * fcb
    xb[1] = xb[1] +  2.           * fcb
    xb[2] = xb[2] +  3.           * fcb

    for i in range(1,-1,-1):

        x = tape.pop()

        if (x[i] < x[i+1]):

            # [xb[i]]   = [1. 1.] [xb[i]]
            # [xb[i+1]] = [0. 0.] [xb[i+1]]

            xb[i] = xb[i] + xb[i+1]
            xb[i+1] = 0.

            # [xb[i]]   = [0.5 0.] [xb[i]]
            # [xb[i+1]] = [0.5 1.] [xb[i+1]]
            # LOOK HERE!
            # We first compute xb[i+1] even though it is the second entry in the matrix.
            # This is because it depends on old value of xb[i].
            # If we compute xb[i] first, we lose that old value which is needed for xb[i+1].

            xb[i+1] = 0.5*xb[i] + xb[i+1]
            xb[i]   = 0.5*xb[i]

    x = tape.pop()

    for i in range(2,-1,-1):

        # [xb[i]]  = [2*x[i] 0.] [xb[i]]
        # [yb]     = [1.      1.] [yb]

        yb = yb + xb[i]
        xb[i] = 2*x[i]*xb[i]

    # [yb]    = [0.     0.] [yb]
    # [xb[0]] = [2*x[0] 1.] [xb[0]]

    xb[0] = xb[0] + 2*x[0]*yb
    yb    = 0.0

    for i in range(2,-1,-1):

        # [xb[i]]  = [0. 0.] [xb]
        # [x0b[i]] = [1. 1.] [x0b[i]]

        x0b[i] = x0b[i] + xb[i]
        xb[i]  = 0.0

    return x0b

As mentioned above, we have to store the required variables to tape by running the non-linear forward code first. Then we use this tape for our adjoint calculations.

[13]:
_, tape = forward(x0 = np.array([1, 3, 3]))
(g1, g2, g3) = forward_b(x0 = np.array([1, 3, 3]), tape = tape)
[14]:
from IPython.display import Markdown as md
md("Therefore the adjoint evaluated gradient at $\mathbf{x}_0 = (1,3,3)$ is given by $\mathbf{g} = (%.2f, %.2f, %.2f)$"%(g1,g2, g3))
[14]:

Therefore the adjoint evaluated gradient at \(\mathbf{x}_0 = (1,3,3)\) is given by \(\mathbf{g} = (15.50, 10.50, 15.00)\)

We see that we get the same gradient using finite differences, tangent linear model and adjoint model. Adjoint model is very efficient computationally when the size of the control vector is extremely large.

Acknowledgments

Thanks to An T. Nguyen for fruitful discussions regarding this code.