Setup API Gateway Robots.txt With AWS CDK

The guide demonstrates how to setup the robots.txt API Gateway resource with AWS CDK

Wednesday, August 25, 2021

You probably found yourself in a situation where you need to control the bot crawling activity of your API, possibly disable it entirely. In such a case you need to set up the robots.txt resource on your API Gateway. The setup of the resource is fairly straightforward, however, there are few quirks that one needs to remember. This guide will help you to set up the resource fast and easily.

Note: The robots.txt resource needs to be defined on the root path of your API. If, for example, your API is having URL https://api.example.com, the robots.txt resource needs to be defined at https://api.example.com/robots.txt

Note 2: If you have your API Gateway defined with the base path mapping of custom domain, e.g. https://api.example.com/petstore (petstore is the base path mapping), you need to create the robots.txt resource on the API that is having base path mapping to the root of the custom domain, i.e. https://api.example.com

Integration

An easy way to return the robots.txt content from API Gateway is by using the API Gateway mock integration. The definition of the integration can look like

The request mapping template is required to propagate the 200 status code to the mock endpoint. The response template for the 200 status code needs to be for the content type of text/plain. The response payload itself needs to be a valid robots.txt payload, the example demonstrates a denial of the API crawling for all user agents (explicitly for the bingbot - see the note).

Note: In the example above, there is a specific rule for the bingbot to disable the crawling of the API. After discussions with the Bing support team I learned that bingbot is ignoring the * user-agent rule and requires the specific rule for the bingbot user-agent. Maybe this information will come handy!

Method Options

The method options definition can look like

Notice that the response model 200 content type is also text/plain. We can use the Model.EMPTY_MODEL constant as we do not need to define any response model.

Putting It Together

Having the mock integration and method response, we can create the actual robots.txt resource and it's methods. It can look like

You need to define both GET and HEAD methods of the robotx.txt resource, as some crawlers might first execute the HEAD method to see content length of the response payload.

Further Reading

https://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-mock-integration.html
https://docs.aws.amazon.com/cdk/api/latest/docs/aws-apigateway-readme.html
https://developers.google.com/search/docs/advanced/robots/intro
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HEAD