Exponential Backoff And Jitter story
“Damn it, exponential backoff!!”
“What is that technique?”
A story that I don't understand is about to unfold...
Thank you for your hard work.
This is Matsuyama, also known as the Living Dead, from the System Development Department.
It's been a while since I last updated my blog, but
I didn't have the motivation to write some sample code and make something work, so
I thought I'd write a little bit about algorithms this time.
for example
Client ↔︎ If an error is returned from the server during communication between the server,
possible behavior on the client side is:
① Display an error message and cancel the process
② Execute the same communication again (retry)
③ Become unable to proceed
③ is out of the question since it's just a bug, but I think it's mostly ① or ②.
Especially in the case of a flow where the processing cannot be interrupted, I think you would expect recovery from ②.
retry
Let's think about retry for a moment.
The response was simply an error, so I'll try again.
In this case, the process of request → response (error) → retry will be repeated endlessly.
If a failure occurs on the server side for some reason,
all users will keep retrying (the situation is no different from a DoS attack),
the load on the server will suddenly increase, and the server will eventually go down
. This could lead to the worst situation.
It is a bitter memory that it actually developed...
The algorithm used to distribute such high loads
is the ``Exponential Backoff'' mentioned at the beginning.
exponential backoff
Directly translated,
・Exponential = Exponential
・Backoff = Backward
, which means that it is an algorithm that moves backwards (delays) the retry interval exponentially.
If an error occurs, retrying
1 second, 2 seconds, 4 seconds, 8 seconds, etc.
, thereby reducing the overall number of retries and achieving efficient retries. Masu.
It is also explained on AWS.
AWS explanation
jitter
However, a simple exponential function may result in many requests occurring at the same time.
(Users who access at the same time will have the same retry interval.)
Therefore, "jitter" is used in conjunction with exponential.
Directly translated
, jitter = time lag
.
By giving a width to the retry interval using a random value, it becomes possible to distribute simultaneous retries.
Detailed verification results are summarized in an article on the AWS Solution Architect blog
summary
In reality, you need to consider things like the upper limit on the number of retries and timeouts, but
this time I'll just introduce exponential backoff and jitter.
In the case of temporary errors or failures, I think that retrying often resolves the problem.
Please keep in mind if it takes some time to recover,
you can improve the situation without putting unnecessary load on it by doing efficient retries like the one introduced here I'm lucky.
, I used the one at the beginning
``exponential backoff seems like a special move.''
It's simple, but that's all for today's story.