Exponential Backoff and Jitter

table of contents
"Take that, Exponential Backoff!!"
"What, what technique is that!?"
looks like a story that's hard to understand is about to unfold... but
thank you for your hard work.
This is Matsuyama, the Living Corpse, from the Systems Development Department.
I thought I'd update my blog for the first time in a while, but
I didn't have the energy to write any sample code and make something that works, so
this time I'll write about a little algorithm.
for example
When an error is returned from the server during communication between the client and the server,
the possible behaviors on the client side are:
① Display an error message and cancel the process
② Execute the same communication again (retry)
③ The process becomes impossible
③ is just a bug, so it's out of the question, but I think most cases will be ① or ②.
In particular, in the case of a flow where processing cannot be interrupted, I think you will hope for recovery by ②.
retry
Let's think about retries for a moment.
If we simply
retry because the response was an error If there is a problem on the server side for some reason,
all users will keep retrying, which could
the server load rises sharply (a situation no different from a DoS attack) and the server eventually goes down
.
I have bitter memories of this actually happening to me...
The algorithm used to distribute such high loads
is the "exponential backoff" mentioned at the beginning.
Exponential Backoff
Literally,
exponential
backoff
is an algorithm that exponentially backs off (delays) the retry interval.
If an error occurs, the retry
1 second, 2 seconds, 4 seconds, 8 seconds, etc.
, reducing the overall number of retries and achieving efficient retries.
AWS also explains it.
AWS explanation
jitter
However, a simple exponential function can result in many simultaneous requests
(because users who access at the same time will have the same retry interval)
. This is why "jitter" is used in conjunction with exponential.
Literally
, jitter = time lag
.
By using a random value to give a range to the retry interval, you can spread out simultaneous retries.
A detailed summary of the verification results is available in a post on the AWS Solution Architect blog
summary
In reality, you will need to consider things like the maximum number of retries and timeouts, but
this time we will only introduce exponential backoff and jitter.
In the case of temporary errors or failures, retrying can often resolve the issue.
If it will take some time to recover,
that by performing efficient retries as introduced here
on the system.
By the way,
I used the opening line as a little joke to grab the attention of the audience, because I had heard someone say, "Exponential backoff sounds like a special move."
It's simple, but that's all for now
16