How to Connect an AWS API Gateway to a Private VPC Using an ALB

AWS and API Gateway

The AWS ecosystem provides developers and system administrators with many tools to expose their applications to their clients or to other systems. One of these tools is the API Gateway, which lets you create an HTTP or WebSocket API in a few clicks or using an industry standard like OpenAPI.

Up until last year, API Gateway needed resources in a VPC to be publicly available in order to be able to access them. This made it difficult for companies concerned with security to use it as the frontend to their applications as their load balancers (or any other private target really) would need to be made publicly available. Some extra measures would need to be added to restrict calls not coming from the API Gateway itself.

With the addition of AWS PrivateLinks to API Gateway, now it is possible to frontend a private VPC with API Gateway… but only using Network Load Balancers (NLBs for short). The consequences of this are mainly:

  • Inability to route requests based on options like the ones supported by an Application Load Balancer (ALB for short).
  • Inability to add a Security Group to the NLB.

AWS Load Balancers and their IPs

AWS published in one of its blog series a way to link a NLB to an ALB to be able to get all the benefits of a layer 7 load balancer while still using a layer 4 one. This setup can be used not only with the API Gateway but also in legacy systems that require a static IP to connect to, something that can’t be done with an ALB. The solution requires a periodic lookup to the ALB DNS to get its IPs, a S3 bucket to store the result and a Lambda function to keep the list up to date and to update the Target Group of the NLB with the delta of IPs detected.

To explain the reason why the blog goes to such lengths as to keep constantly pinging the ALB’s DNS to maintain the solution, it’s necessary to understand how this load balancer scales.

An Application Load Balancer is a serverless service. It provides the user with a domain name that hides the complexity of its implementation. Internally, under heavy traffic, new servers are launched to keep up with the load. The service fetches new IPs from an internal pool that are assigned to the servers. These IPs are then registered with the DNS servers. Once the load comes back to normal, the extra servers are shut down, their IPs returned to the pool and the DNS servers updated accordingly. This is the reason why AWS does not provide a direct way to get the IPs used by the ALB as to prevent users from referencing IPs that could change.

This scaling behaviour doesn’t work the same way for the Network Load Balancer though. On the surface, a NLB is similar to an ALB in that only a domain name is provided after creation. But, as explained in the documentation, NLB IP’s are static and there is only one IP per availability zone the load balancer has been deployed in. So there are no servers that are created/shut down, nor are there IPs that are fetched from/returned to an internal pool dynamically.

Another Solution to the Problem

I came across the blog post referenced earlier while researching on how to do a similar implementation for an API for the project I was working on. I didn’t like the part where the DNS lookup was done as a lookup only returns 8 random IPs for each invocation so I decided to try a different solution. The following diagram describes it:

Steps involved:

  1. Create the Network Load Balancer.
  2. Create the Application Load Balancer.
  3. Add a Security Group to the Application Load Balancer.
  4. Assign a Target Group to the Network Load Balancer.
  5. Create a Lambda to update the Target Group of the Network Load Balancer with the IPs of the Application Load Balancer.
  6. Create a Cloudwatch Event to trigger the Lambda periodically.
  7. Create a Lambda to update the Security Group of the Application Load Balancer with the IPs of the Network Load Balancer once.
  8. Create the API Gateway.
  9. Create a VPC Link associated with the Network Load Balancer.
  10. Create a Target Group for the Application Load Balancer with the required target (individual IPs, EC2 instances or Auto Scaling Groups).

The rest of the article describes how to make use of CloudFormation to create the solution from the diagram. There are some other steps involved in the process like the creation of the IAM policies and roles that allow all the different services to talk to each other. I’m not going to describe them here as not to make the post too long and because there are plenty of other posts that describe this same process. At the end of the article though there is a link pointing to a git repository that includes the python code and CloudFormation template that generate the whole stack for anyone interested in reusing it, learning how it works or suggesting modifications/fixes. Here I’m just going to focus on the not so straightforward parts.

Most of the steps described can be done through the console for testing purposes. Once in a production environment though, it’s always better to automate the whole process as much as possible to ensure consistency across environments and to avoid manual errors.

Update the Security Group of the Application Load Balancer

Unfortunately, the same way AWS doesn’t make it easy to find the IPs of a created load balancer through the console or the CLI, CloudFormation doesn’t export them either for reuse in a template.

But by making use of Custom Resources we can overcome this problem. A Custom Resource is a resource that is not supported by CloudFormation by default. Its behaviour is defined by the user by invoking a Lambda with bespoke code. The result of the Lambda can then be used as any other CloudFormation resource, allowing us to make use of intrinsic functions to feed other resources. In our case, to feed the SecurityGroupIngress of the ALB with the NLB’s IPs. The following code snippet shows the code in Python capable of doing this.

import boto3
import cfnresponse


DEFAULT_LISTENER_PORT = 80


def lambda_handler(event, context):
   success = True
   data = {}
   try:
       if event['RequestType'] == 'Create' or event['RequestType'] == 'Update':
           print(event['RequestType'] + " resource using NLB private IPs")
           nlb_description = event['ResourceProperties']['nlbDescription']
           listener_port = event['ResourceProperties']['listenerPort'] or DEFAULT_LISTENER_PORT
           print("Resource properties: listenerPort={}, nlbDescription={}".format(listener_port, nlb_description))
           client = boto3.client('ec2')
           nlb_nis = client.describe_network_interfaces(Filters=[
               {
                   'Name': 'description',
                   'Values': ['*' + nlb_description + '*']
               }
           ], MaxResults=100)
           data = {
               'privateIps': [{
                   'IpProtocol': 'tcp',
                   'CidrIp': ni['PrivateIpAddress'] + '/32',
                   'FromPort': listener_port,
                   'ToPort': listener_port
               } for ni in nlb_nis['NetworkInterfaces']]
           }
       else:
           print('Deleting resource, nothing to do')
       cfnresponse.send(event, context, cfnresponse.SUCCESS, data)
   except Exception as exception:
       print("Exception finding the private IPs of the NLB", str(exception))
       data = {}
       success = False
   finally:
       status_response = cfnresponse.SUCCESS if success else cfnresponse.FAILED
       print("Cloudformation status response: " + status_response)
       cfnresponse.send(event, context, status_response, data)

The Lambda function expects two variables: nlbDescription and listenerPort. Out of them, only nlbDescription is mandatory as it will be used to find all the ENIs (Elastic Network Interface, basically a network adapter in the cloud) created for the load balancer. In case listenerPort is not set, port 80 will be used by default. nlbDescription must be the name given to the NLB as it will be used to identify its ENIs. listenerPort, as the name implies, is the port the load balancer listener uses.

The code makes use of the boto3 library provided by AWS to query the EC2 API. It sounds a bit weird that we need to use the EC2 API when there are no EC2 instances involved but it’s actually within it that we find the endpoint to get the ENIs associated to load balancers (in case anyone wonders, yes, ENIs associated to Lambdas running inside a VPC can also be found using the same method). We use the nlbDescription passed as argument to filter the ENIs we need. Unfortunately, tags added to the NLB are not copied across to its associated ENIs so the only way to find the right ones is by checking their description which includes the name given to the load balancer.

I should also mention that CloudFormation expects a specific response from the Lambda invoked by the Custom Resource. The library cfnresponse provided by AWS helps with that. It expects a call back with the event and context CloudFormation first sends when invoking the Lambda, plus a status code (SUCCESS or FAILED) and the response data. The response must follow a specific format but the send method of the cfnresponse library takes care of that for us.

Once the IPs are found, the data is exposed with the format specific to the property SecurityGroupIngress of the SecurityGroup resource. That way, it can then be used in the CloudFormation template like in this example:

NlbPrivateIpsFinder:
 Type: AWS::Serverless::Function
 Description: Lambda used to find the NLB private IPs
 Properties:
   FunctionName: 'NlbPrivateIpsFinder'
   Handler: nlb_private_ips_finder.lambda_handler
   Runtime: python3.7
   CodeUri: <path_to_code>
   Role: !GetAtt [ <lambda_role>, Arn ]
   AutoPublishAlias: PROD

NlbPrivateIps:
 Type: Custom::PrivateIps
 Version: 1.0
 DependsOn: NlbPrivateIpsFinder
 Properties:
   ServiceToken: !GetAtt [ NlbPrivateIpsFinder, Arn ]
   nlbDescription: '<load_balancer_name>'
   listenerPort: 80

BridgedPrivateAlbSecurityGroup:
 Type: AWS::EC2::SecurityGroup
 Description: Security group for the ALB
 Properties:
   GroupName: '<any_name>'
   GroupDescription: Security Group to handle ingress traffic into the ALB
   SecurityGroupIngress: !GetAtt [ NlbPrivateIps, privateIps ] ← Use of the Custom Resource

Application Load Balancer Security Group

One of the main problems with the NLB is that it does not support Security Groups. Security Groups are an integral part of the VPC architecture in AWS. Their stateful nature and the fact that one can configure allow/deny rules using other Security Groups let users create network policies between services and EC2 instances very easily. Not being able to create a Security Group for the NLB forces us to configure the Security Group of the ALB with the NLB’s static IPs, which, as described above, are not straightforward to find.

In this solution there is no DNS lookup. There is no temporary storage either. Instead, the bulk of the job is done by searching the ENIs associated with theto the ALB by their description, as done before to configure the ALB Security Group.

The implementation uses a Lambda triggered by a Cloudwatch Event every minute, the minimum granularity AWS allows at the time of writing. For each invocation, the code first gets all the ENIs with a description that matches the ALB’s. Then, it gets the current IPs registered as targets in the NLB’s Target Group.

With both groups found, a two step process follows: First, the group of current ENIs’ IPs are registered in the Target Group. Finally, the IPs found in the Target Group that are not in the list being registered are deregistered from the Target Group. The order here is important as otherwise we could find ourselves in a situation where all the current IPs are deregistered before a list of complete new ones are registered, leaving the Target Group empty for as long as the process takes.

This is the code that handles all that:

import boto3


def lambda_handler(event, context):
   port, alb_description, nlb_target_group_arn = __get_parameters_from(event)

   alb_ips = __get_alb_ips_from(alb_description)

   client = boto3.client('elbv2')
   nlb_target_group_health = client.describe_target_health(
       TargetGroupArn=nlb_target_group_arn
   )

   current_ips = __get_ips_from(nlb_target_group_health)
   client.register_targets(
       TargetGroupArn=nlb_target_group_arn,
       Targets=__create_targets_from(alb_ips, port)
   )

   invalid_ips = set(current_ips) - set(alb_ips)
   if len(invalid_ips) > 0:
       client.deregister_targets(
           TargetGroupArn=nlb_target_group_arn,
           Targets=__create_targets_from(invalid_ips, event['targetPort'])
       )


def __get_parameters_from(event):
   port = event['targetPort']
   alb_description = event['albDescription']
   nlb_target_group_arn = event['nlbTargetGroupArn']
   print("albDescription:", alb_description)
   print("targetPort:", port)
   print("nlbTargetGroupArn:", nlb_target_group_arn)
   return port, alb_description, nlb_target_group_arn


def __get_alb_ips_from(alb_description):
   client = boto3.client('ec2')
   alb_nis = client.describe_network_interfaces(Filters=[
       {
           'Name': 'description',
           'Values': ['*' + alb_description + '*']
       }
   ], MaxResults=100)
   return [ni['PrivateIpAddress'] for ni in alb_nis['NetworkInterfaces']]


def __create_targets_from(ips, port):
   return [{'Id': ip, 'Port': int(port)} for ip in ips]


def __get_ips_from(current_targets):
   return [target['Target']['Id']
           for target in current_targets['TargetHealthDescriptions']]

The following CloudFormation snippet shows the configuration (without the IAM policies required):

BridgedPrivateAlbListener:
 Type: AWS::ElasticLoadBalancingV2::Listener
 DependsOn: BridgedPrivateAlb
 Description: Listener to forward calls from the NLB to any configured target
 Properties:
   DefaultActions:
     - Type: fixed-response
       FixedResponseConfig:
         StatusCode: 404
   LoadBalancerArn: !Ref BridgedPrivateAlb
   Port: 80
   Protocol: HTTP

NlbTargetGroupUpdater:
 Type: AWS::Serverless::Function
 Description: Lambda used to update the NLB target group IPs with the latest IPs from the ALB.
 Properties:
   FunctionName: !Sub '${ResourcesPrefix}-NlbTgUpdater'
   Handler: nlb_target_group_updater.lambda_handler
   Runtime: python3.7
   CodeUri: ./nlb_target_group_updater/nlb_target_group_updater.py
   Role: !GetAtt [ <<lambda_role>>, Arn ]
   AutoPublishAlias: PROD
   Environment:
     Variables:
       targetPort: 80
       nlbTargetGroupArn: !Ref BridgedPrivateNlbTargetGroup
       albDescription: !GetAtt [ BridgedPrivateAlb, LoadBalancerName ]

NlbTargetGroupUpdaterTrigger:
 Type: AWS::Events::Rule
 Description: Single event to trigger the lambda that updates the NLB target group with the current ALB IPs
 Properties:
   Name: !Sub '${ResourcesPrefix}-NlbTgUpdaterTriggerEvent'
   Description: Single event to trigger the lambda that updates the NLB target group with the current ALB IPs
   ScheduleExpression: rate(1 minute)
   Targets:
     - Arn: !GetAtt [ NlbTargetGroupUpdater, Arn ]
       Id: nlbTargetGroupUpdaterTriggerId
       Input: !Sub
         '{
             "targetPort":"80",
             "nlbTargetGroupArn":"${BridgedPrivateNlbTargetGroup}",
             "albDescription":"${BridgedPrivateAlb.LoadBalancerName}"
         }'

Conclusion

AWS API Gateway is a great service to accelerate development of applications for startups and big enterprises alike. Apart from exposing the API without the need to maintain your own fleet of servers, it also gives you other features like quota control, canary releases or authentication/authorization control.

It has evolved along the way by integrating with many old and new services from AWS and the implementation of PrivateLinks has opened the door to still more integrations. But, given the Application Load Balancer is the goto load balancer by its features, one can but wonder the technical reasons behind the decision to only allow integration with the Network Load Balancer at the moment.

PrivateLinks is a rather new addition to the portfolio of available services and history has already shown us how AWS moves forward by fixing and improving their offering. So my guess is that, at some point in the (near) future, this solution will become obsolete, replaced by native support for ALB integration.

Link to the code.