Amazon API Gateway 504 : Execution failed due to a network error communicating with endpoint

Amazon API Gateway 504 : Execution failed due to a network error communicating with endpoint could be caused by networking problems with integration. In VPC Link Private Integration and HTTP Integration such errors can be seen.

COMMON REASONS for Amazon API Gateway 504 : Execution failed due to a network error communicating with endpoint

1.) Endpoint is not reachable from API Gateway.

b) Endpoint might be unhealthy when a request was made and hence API Gateway could not establish the connection (TCP handshake).

c) API Gateway sent a request but the endpoint rejected the connection request due to issues related to TCP connection.

General advise for 5XX Errors

If the 5xx errors “Execution failed due to a network error communicating with endpoint” are intermittent and rate of error is low the usual method to handle these errors is implementing retries with exponential backoff.

It has been stated here [1] that

Numerous components on a network, such as DNS servers, switches, load balancers, and others can generate errors anywhere in the life of a given request. The usual technique for dealing with these error responses in a networked environment is to implement retries in the client application. This technique increases the reliability of the application and reduces operational costs for the developer.

However, If even after implementing retries with exponential backoff, you consistently receive these errors at a high error rate. There might be an issue which needs to be investigated and following troubleshooting can be performed to root cause.

TROUBLESHOOTING

Before troubleshooting, It is really important to enable CloudWatch logs for API Gateway
Enable the CloudWatch Logs – https://cloudnamaste.com/api-gateway-cloudwatch-logs/

I can think of following request routes (there exist more combinations as well):
Client —-> API Gateway —–> NLB(VPC Link) —–> HTTP/S Target
Client —-> API Gateway —–> Public ALB —–> HTTP/S Target
Client —-> API Gateway —–> Public HTTP/S Target

  • Hit the Network Load Balancer or Application Load Balancer or HTTP/S endpoint directly and see if the similar error is received. You may need to make multiple requests to reproduce the issue, if the issue is of intermittent nature. This can help in narrow down the source and which components to focus upon during troubleshooting. This will help to eliminate the API Gateway to be the error source.
  • Note the timestamp of error and check If there were any unhealthy Network Load Balancer/Application Load Balancer/HTTP target hosts during that time.
  • Check for Network Load Balancer TCP_Target_Reset_Count metrics. A spike in Network Load Balancer TCP_Target_Reset_Count metrics indicates that the target instances may not be properly closing connections to the Network Load Balancer. When Network Load Balancer attempts to send a request through one of these improperly closed connections, a “Connection is Closed” error is experienced.
  • Network Load Balancer has an idle timeout of 350 seconds. Check the timeout configuration i.e. keep-alive timeout for your target host server and ensure it is greater than Network Load Balancer idle connection timeout to avoid any connection timeout related issues.
  • Make sure Network Load Balancer meets one of the following scenario to ensure that the request is always served by a healthy target with reduced latency and can help prevent connection timeout issues between API gateway and backend.
Only enable Availability Zones on the Network Load Balancer in which there are active targets.
 
Enable cross-zone load balancing on the Network Load Balancer to 
reduce the need to maintain equivalent numbers of instances in each enabled Availability Zone, and improves your application's ability to handle the loss of one or more instances

Ensure that the Network Load Balancer does not have an Availability Zone without a healthy target
  • If the error is connection reset by peer error it indicates that the connection reset is triggered by either party (client or server) in the lifespan of a TCP connection. The reason could be an idle client/server, a connection timeout or either client/server failing to receive SYNC/ACK packets on connect/close stage of TCP session.
  • If the above steps do not help in narrowing down the cause of issue, run a packet capture on the target to understand if the requests were received by instance and were there any underlying issues occurred during those requests.
  • VPC Flow logs can be enabled to capture information about the IP traffic going to and from network interfaces of NLB/ALB. This will help in understanding if requests were accepted/rejected by Security Groups or ACLs.
  • If Network Load Balancer has a TLS listener, Access Logs can be enabled on Network Load Balancer to capture detailed information about the TLS requests made to Network Load Balancer and troubleshoot issues.

After the above steps If you are not able to narrow down the cause you can reach out to API Gateway Premium support team for investigation. In order to reach out to API Gateway Premium support open a case from support console and provide following details for investigation:

1. Complete API Gateway execution logs of a few failed requests (or a few ExtendedRequestIDs ending in  =) These can be found in execution logs. An  example is shown below
(b822cacf-####-####-9d8d-cb96737e4eb8) Extended Request Id: ZSUx#####wMFRGw=
2. VPC Flow logs
3. Access Logs (If Network Load Balancer has TLS listener)
3. TCP Dumps captured from target instances

If you have any questions related to articles on this website please free to ask in our CloudNamaste discord channel – https://cloudnamaste.com/join-cloudnamaste-discord-community/

Join here directly : https://discord.gg/TEbhdutUDQ

Looking forward to connect with everyone !!!

Happy Troubleshooting !

References:
[1] https://docs.aws.amazon.com/general/latest/gr/api-retries.html