I’ve been looking at the way Mercury handles retries when sending mail. The current procedure involves setting a “Basic minimum period between queue retries”, and using a “progressive backoff algorithm” to calculate job retries.
Assuming you set the minimum period to N minutes, the algorithm will do ten tries, one every N minutes, then 10 tries, one every 1.5N minutes, and continue to increase the retry interval to 2N, 2.5N, 3N, 3.5N, 4N, 5N, 6N, every time doing ten retries at the current value. It then increases to 7N and uses that up to the maximum number of retries.
This worked well in the days when you were retrying because of actual failures, but does not work quite as well when sending mail to a server that uses Grey listing. If you set your N too low, you end up hammering a receiving server that is actually down. If you set your N too high, your email to recipients that use Grey Listing gets delayed, and you also run the risk of getting it rejected: your first delivery attempt gets refused, and by the time you do your second attempt, the receiving server has already deleted your first attempt from its gray list and you are rejected again. Don’t laugh – this happened with one of our recipients, and we ended on their black list for repeated failed attempts.
Even with a reasonable N, your email can get delayed quite a bit: we have a recipient that has two servers in their MX record. Our first try hits their server #1 and gets rejected by their Grey listing. Our second try hits their server #2 and gets rejected again (yes, I know their two servers should share data, but they don’t). The third try hits either server #1 or #2 and is accepted, for a total delay of 30 minutes (we use N=15) – all the while we are on the phone with them telling them their data will arrive “any minute now”.
In an attempt to see how other servers handle this, I configured my Mercury server to reject all inbound e-mail containing a specific nonsense subject, and after sending myself email with that subject from a Gmail account and from a Microsoft Exchange account, I sat back and watched the Mercury logs.
Exchange has the simplest algorithm. It retried after 1 minute, then waited another 2 minutes to try a second time, then waited 6 minutes for a third retry, and another 20 minutes for a fourth one. After that, it kept trying once every 60 minutes and gave up after a total of about 72 hours and 77 retries. Notice that they start very aggressively, and then back off rather quickly to a reasonable 1 hour between retries.
Gmail sent the first retry 7 minutes after the initial rejection, the second after an additional 21 minutes, and the third one 27 minutes after that. It then started sending at what seem like random intervals selected with an ever increasing mean value. The times between tries are all over the place, but if you use exponential smoothing to look at them, you see a clearly increasing trend. The first couple of retry intervals were around one hour, while the final ones were in the 6 hour vicinity. It also gave up after 72 hours, but did only 27 retries in all. I have the data if anyone is interested.
Obviously, I think that the algorithm that Mercury uses needs some tweaking. I like Exchange's approach - simple and reasonable. Gmail's use of random intervals seems overly complicated, although it also starts with short intervals and quickly increases to longer intervals.
<font size="2">I’ve been looking at the way Mercury handles retries when sending mail. The current procedure involves setting a “<span style="font-style: italic;">Basic minimum period between queue retries</span>”, and using a “<span style="font-style: italic;">progressive backoff algorithm</span>” to calculate job retries.
Assuming you set the minimum period to N minutes, the algorithm will do ten tries, one every N minutes, then 10 tries, one every 1.5N minutes, and continue to increase the retry interval to 2N, 2.5N, 3N, 3.5N, 4N, 5N, 6N, every time doing ten retries at the current value. It then increases to 7N and uses that up to the maximum number of retries.
This worked well in the days when you were retrying because of actual failures, but does not work quite as well when sending mail to a server that uses Grey listing. If you set your N too low, you end up hammering a receiving server that is actually down. If you set your N too high, your email to recipients that use Grey Listing gets delayed, and you also run the risk of getting it rejected: your first delivery attempt gets refused, and by the time you do your second attempt, the receiving server has already deleted your first attempt from its gray list and you are rejected again. Don’t laugh – this happened with one of our recipients, and we ended on their black list for repeated failed attempts.
Even with a reasonable N, your email can get delayed quite a bit: we have a recipient that has two servers in their MX record. Our first try hits their server #1 and gets rejected by their Grey listing. Our second try hits their server #2 and gets rejected again (yes, I know their two servers should share data, but they don’t). The third try hits either server #1 or #2 and is accepted, for a total delay of 30 minutes (we use N=15) – all the while we are on the phone with them telling them their data will arrive “any minute now”.
In an attempt to see how other servers handle this, I configured my Mercury server to reject all inbound e-mail containing a specific nonsense subject, and after sending myself email with that subject from a Gmail account and from a Microsoft Exchange account, I sat back and watched the Mercury logs.
Exchange has the simplest algorithm. It retried after 1 minute, then waited another 2 minutes to try a second time, then waited 6 minutes for a third retry, and another 20 minutes for a fourth one. After that, it kept trying once every 60 minutes and gave up after a total of about 72 hours and 77 retries. Notice that they start very aggressively, and then back off rather quickly to a reasonable 1 hour between retries.
Gmail sent the first retry 7 minutes after the initial rejection, the second after an additional 21 minutes, and the third one 27 minutes after that. It then started sending at what seem like random intervals selected with an ever increasing mean value. The times between tries are all over the place, but if you use exponential smoothing to look at them, you see a clearly increasing trend. The first couple of retry intervals were around one hour, while the final ones were in the 6 hour vicinity. It also gave up after 72 hours, but did only 27 retries in all. I have the data if anyone is interested.
Obviously, I think that the algorithm that Mercury uses needs some tweaking. I like Exchange's approach - simple and reasonable. Gmail's use of random intervals seems overly complicated, although it also starts with short intervals and quickly increases to longer intervals.
</font>