Machine Precision Effects on Finite Difference Methods
Adventures in Computational Mathematics
Approximating Derivatives with Finite Difference Methods
Suppose we have a real-valued function of a single variable , which we can evaluate for any value of . Suppose we want to compute the derivative of at a point , but we don't know a general formula for . We can approximate the derivative with the following finite difference method:
If we ignore the effects of floating point precision, we would choose to be as small as possible. Indeed, analytically we have
Floating Point Issues
Let's try it for . In 64-bit floating point precision, the smallest positive number we can represent is 5e-324.
f(x) = exp(2*x)
x0 = 1.0
h = 5e-324
f_derivative_approximation = (f(x0+h) - f(x0))/h
@show f_derivative_approximation
f_derivative_approximation = 0.0
What gives? We know that , so we should have , but we got . Well, let's see what happens when we add and .
x0 = 1.0
h = 5e-324
@show x0 + h
x0 + h = 1.0
We can't represent the number in 64-bit floating point representation. Instead, the value gets rounded off to the nearest number, which is just . So in our computation we have , so that , and therefore we get . In fact, we would get for any function .
To fix this, let's try the smallest value of for which in the computer. The function eps(T)
gives the difference between 1 and the next largest number of type T
.
h = eps(Float64)
@show h
@show 1+h
h = 2.220446049250313e-16
1 + h = 1.0000000000000002
We now have . Does this fix our issue and give us an accurate approximation of ? Not quite.
f(x) = exp(2*x)
x0 = 1.0
h = eps(Float64)
f_derivative_approximation = (f(x0+h) - f(x0))/h
f_derivative_analytic = 2*exp(2*x0)
@show h
@show 1+h
@show f_derivative_approximation
@show f_derivative_analytic
h = 2.220446049250313e-16
1 + h = 1.0000000000000002
f_derivative_approximation = 12.0
f_derivative_analytic = 14.7781121978613
Choosing an Ideal
We are in the ballpark of the analytic value of , but our approximation is off by quite a bit (we don't even have 2 digits of accuracy). What went wrong? Well, just as is rounded off to , each of the floating point operations involved in computing (f(x0+h) - f(x0))/h
might have some small floating point errors. But even a small floating point error can result in a large loss of relative precision.
We saw above that machine precision is . Roughly speaking, this means we have about 16 digits of precision in any number we represent. Consequently, any number which we represent may actually be off by an amount around . We can observe this by increasing the precision of our floating point numbers:
using Quadmath
@show Float64(1.1)
@show Float128(Float64(1.1))
Float64(1.1) = 1.1
Float128(Float64(1.1)) = 1.10000000000000008881784197001252323e+00
We see that the 64-bit version of actually stores a number closer to . This may not seem like a very big deal since the error is very small. But one case where the roundoff error becomes a big deal is in the subtraction of nearly equal numbers. Consider the two nearly equal numbers and . Analytically, we should have . But let's try performing that computation with 64-bit floating point numbers:
a = 3.3000000000000003
b = 3.3
@show a-b
a - b = 4.440892098500626e-16
We expected , but we got . Although the error is small compared to and , it is significant compared to the analytic value of . We went from having 16 digits of precision in and to having not even one digit of precision in . This may seem like an extreme example, but we can see similar effects for something less extreme:
a = 3.30003
b = 3.3
@show a-b
a - b = 3.0000000000196536e-5
and are no longer quite so close, and we have 11 digits of precision in . There is still a noticeable loss of precision, although it is not as catastrophic as before. In our finite difference computation, we run the risk of catastrophic loss of precision when we perform the subtraction . Because we take to be small, these two numbers may be very close. Let's say the error in the subtraction is roughly . Then when we try to perform the finite difference approximation (1), what we are actually computing is more like
The error in our approximation is
where the term is due to floating point errors, and the is the truncation error in our approximation formula.
On the one hand, we want to make small so that is small. But on the other hand, the smaller we make , the larger we make . To minimize the error in our computation of , we need to choose which makes both terms as small as possible.
We don't know the coefficients in
Assuming , then we are trying to solve
We can take the derivative with respect to and set the result equal to zero to find a critical point:
Let's assume , where is the machine precision, as returned by eps(Float64)
. That is, , giving us
Checking Results
These are all rough calculations, so let's check them with a numerical experiment. Let's compute using our finite difference approximation with several values of , and check the relative error in the computed value compared to the analytic solution.
using Plots
pgfplotsx()
using LaTeXStrings
f(x) = exp(2*x)
fprime_finite_diff(x, h) = (f(x+h) - f(x)) / h
x0 = 1.0
fprime_analytic = 2*exp(2*x0)
h_vals = [10.0 ^ i for i in -15:0.5:-1]
fprime_finite_diff_vals = [fprime_finite_diff(x0, h) for h in h_vals]
fprime_finite_diff_errors = abs.(fprime_finite_diff_vals .- fprime_analytic)
fprime_finite_diff_relative_errors = fprime_finite_diff_errors ./ fprime_analytic
pl = plot(h_vals, fprime_finite_diff_relative_errors,
xscale=:log10, yscale=:log10, label="Relative Error",
xlabel=L"h", ylabel="Relative Error",
title = raw"\bf Finite Difference Approximation of exp(2x)",
lw=2.5, tickfontsize=18, guidefontsize=24, legendfontsize=18,
ylims=(1e-15, 1e0), xlims=(1e-15, 1e0),
yticks=[1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e0],
xticks=[1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e0],
titlefontsize=24)
vline!(pl, [1e-8], label=raw"h=1e-8")
hline!(pl, [1e-8], label="Error=1e-8")

Indeed, we see that the error is minimized around , and that the the relative error in our approximation at is roughly , as we predicted. This is a big improvement over our result when choosing to be machine precision, for which the relative error was over . In other words, we went from over 10% error to only 0.0000001% error.
In conclusion, by analyzing the effects of floating point errors in our computation, we have successfuly obtained a much more accurate finite difference approximation of than the one we obtained by simply taking to be very small.
A Simpler Optimization Method
I used calculus to find a critical point of (3). If we wanted to be less fancy about finding the minimum, we could have simply plotted each of the terms in that equation on a log-log scale and found the point where they intersect.
g1(h) = 1e-16 / h
g2(h) = h
h_vals = [10.0 ^ i for i in -15:0.5:-1]
g1_vals = [g1(h) for h in h_vals]
g2_vals = [g2(h) for h in h_vals]
pl = plot(h_vals, g1_vals,
xscale=:log10, yscale=:log10,
xlabel=L"h",
title = raw"\bf Roundoff vs Approximation Errors",
label=raw"\epsilon / h (Roundoff Error)",
lw=2.5, tickfontsize=18, guidefontsize=24, legendfontsize=18,
ylims=(1e-15, 1e0), xlims=(1e-15, 1e0),
yticks=[1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e0],
xticks=[1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e0],
titlefontsize=24)
plot!(pl, h_vals, g2_vals, label="h (Approximation Error)")

We can see that the lines intersect near , and that the roundoff and approximation terms both have a value around (which will add up to ). This is consistent with our original prediction.
Higher Accuracy With Second Order Central Difference
What if we decide relative precision is not enough? We can use a higher-order method to obtain a more accurate result, but then our ideal value for will change. Let's consider the second-order central difference method
Accounting for roundoff errors and ignoring constant factors as we did before, we get
f(x) = exp(2*x)
fprime_central_diff(x, h) = (f(x+h) - f(x-h)) / (2*h)
x0 = 1.0
fprime_analytic = 2*exp(2*x0)
h_vals = [10.0 ^ i for i in -15:0.5:-1]
fprime_central_diff_vals = [fprime_central_diff(x0, h) for h in h_vals]
fprime_central_diff_errors = abs.(fprime_central_diff_vals .- fprime_analytic)
fprime_central_diff_relative_errors = fprime_central_diff_errors ./ fprime_analytic
pl2 = plot(h_vals, fprime_central_diff_relative_errors,
xscale=:log10, yscale=:log10, label="Relative Error",
xlabel=L"h", ylabel="Relative Error",
title = raw"\bf Central Difference Approximation of exp(2x)",
lw=2.5, tickfontsize=18, guidefontsize=24, legendfontsize=18,
ylims=(1e-15, 1e0), xlims=(1e-15, 1e0),
yticks=[1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e0],
xticks=[1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e0],
titlefontsize=24)
vline!(pl2, [10.0 ^ (-16/3)], label="h = epsilon")
hline!(pl2, [10.0 ^ (-10.7)], label="Error")

We see that our predictions were correct, and we can now approximate to a relative precision of , which is more precision than what we could do with the first-order method.