We have a webpage that queries an item from an API gateway which in turn calls a service that calls another service and so on.
Webpage --> API Gateway --> service#1 --> service#2 --> data store (RDMS, S3, Azure blob)
We want to make the operation resilient so we added a retry mechanism at every layer.
Webpage --retry--> API Gateway --retry--> service#1 --retry--> service#2 --retry--> data store.
This however could case a cascading failure because if the data store doesn't response on time, it could cause every layer to timeout and retry. In other words, if each layer is configured to retry once, then the data store will be queried 4 additional times.
My proposed solution
Only retry at the API Gateway or webpage so that:
If the data store fails to response, then the services will timeout as well which will cause the API Gateway to retry.
If the data store returns an error, then the service#2 will return an HTTP 500 which will cause service#1 to return HTTP 500 thus causing the API Gateway to retry.
API Gateway <--500-- service#1 <--500-- service#2 <--error-- data store.
This solution seems reasonable but does it mean that I should disable or shorten the retry mechanism of the AWS SDK, Azure SDK, and RMDS client libraries.
- Inform the calling layer if the failure can be retried. This however is complicated to implement because I somehow need to know if the error returned by AWS, Azure, etc is transient.