AWS Step Function to Trigger GitHub Self-Hosted Runners

I literally lost hours of sleep trying to resolve an issue related to using an EventBridge Input Transformer to override an ECS container variable with a value sourced from an event payload. This type of transformation had already been covered during the setup for GitHub self-hosted ephemeral runners, where the event variable data type was a string, however, dealing with arrays was more challenging.

I never managed to find a root cause for the issue, however, results from a web search on input transformers behaving in a similar manner, seemed to indicate the likelihood of a bug:

The aim of this blog post is to outline the procedure for reproducing the issue, along with process for gathering relevant diagnostics.

When all else had failed, some further exploration, and effort, led to the use of AWS Step Functions as an alternative for fulfilling the same requirement. A brief overview on how this was achieved is also covered.

Before moving on, it is strongly advised to familiarize yourself with the content in the series Scalable self-hosted github runners on AWS cloud.

Problem Overview

  • An existing EventBridge rule includes a pattern to filter for GitHub workflow jobs in “queued” status arriving on a designated EventBus
  • A matched event should trigger the rule target to perform the following:
    • invoke an ECS task, consisting of a single container and some predefined environment variables, including RUNNER_LABELS
    • an Input Transformer should override the value of this variable with the value of an array sourced from the event payload at JSON path $.detail.workflow_job.labels
    • the ECS container should initialize and serve as a GitHub self-hosted ephemeral runner
  • Despite adhering to the guidelines, the override resulted in failed invocations of the ECS task

Reproducing the Issue

Source Event Sample

Assuming an Event bus named: github-actions-event-bus, the following is a sample (trimmed for brevity), event payload, as captured from the Event bus:

{
    "version": "0",
    "id": "d61295ba-edc2-4c45-9c2f-96b26c37ff07",
    "detail-type": "workflow_job",
    "source": "github.com",
    "account": "xxxxxxxxxxxx",
    "detail": {
        "action": "queued",
        "workflow_job": {
            "id": 899012345,
            "run_id": 6666666666,
            "status": "queued",
            "name": "test_job",
            "workflow_name": "test",
            "labels": [
                "new",
                "one"
            ]
        },
        "organization": {
            "login": "some-github-org"
        }
    }
}
  • The payload path $.detail.workflow_job.labels holds the workflow job runner labels—an array with a value of ["new","one"]
  • Ultimately, the InputPathsMap and InputTemplate for the EventBridge rule Input Transformer should be configured to reference this path/value

EventBridge Rule

Rule Details

Below are the details of the Eventbridge rule.

Rule Target Details

The properties for the target associated with the rule are:

Using the sample source event, the above the rule should trigger the target, and perform the following transformation:

  • Input transformer assigns the environment variable RUNNER_LABELS a value of ["new","one"] for the container github-actions-self-hosted-runner-debian-container
  • The task should then launch in the target cluster

And this is where the problem lies—the task invocation fails.

Root Cause Diagnosis

Without any hints offered by CloudTrail logs, I went down the path of adding an SQS Dead Letter Queue as an additional target to the EventBridge rule. This enables us to capture any unsuccessful event-to-target delivery attempts.

Image 26

Monitoring Failed Invocations

After creating the queue and performing an action to trigger the rule, the first step was monitor the invocation metrics via the AWS Console:

  • "Amazon EventBridge" > Rules > github-actions-trigger-runner
  • review Monitoring details
Image 8
  • The FailedInvocations metric confirms an invocation failed
  • To find out why, we can go ahead and check the SQS dead letter queue.

Inspect SQS Dead Letter Queue

  • After polling for the latest messages on the queue, and examining the output, we see the following JSON for the ContainerOverrides:
Image 9
  • Looking at the contents under the Attributes tab, the reason for the failure is: ERROR_CODE: (INVALID JSON).
Image 10
  • Running the JSON through a linter returns :”Valid json
Image 12
  • At this point, I can’t help but think that there is an inherent problem/bug with the input transformation mechanism. Either that, or, I’ve completely misinterpreted the official guidelines.

Change of Tactic: Use AWS Step Functions

After some background reading, I became aware of AWS Step Functions/State Machines and their potential advantages, especially when used with Amazon States Language and intrinsic functions.

For all examples to follow, we will use the same event listed in the section Source Event Sample.

To tackle the problem, I decided to take the following approach:

  • Create a new Step Functions State Machine
  • Create a new EventBridge rule, with the target being the new Step Functions state machine
  • Configure the rule to send part of the source/matched event to the state machine, i.e. in summary:
    • Configure input as target : “Part of the matched event"
    • Specify part of the matched event: $.detail

In the steps above, we are essentially sending $.detail, from the matched event to the state machine. Why? Because this will allow us to access more of the source event data, and not just the $.detail.workflow_job.labels.

Provision Step Function State Machine

The Step Function State Machine can be created through the AWS Console Step Functions service, however, for convenience, a CloudFormation (CF) template is used to provision the infrastructure.

CloudFormation Template

A sample CloudFormation template is shown below:

cfn_statemachine_template.yaml
AWSTemplateFormatVersion: 2010-09-09
Description: Cloudformation stack for EventBridge with target of Step functions state machine
Parameters:
  StateMachinename:
    Type: String
    Description: machine name
    Default: state-machine-ephemeral-runner
  BusName:
    Type: String
    Description: Eventbus name
    Default: github-actions-event-bus
  EBRuleName:
    Type: String
    Description: Eventbridge rule name
    Default: trigger-state-machine-ephemeral-runner
Resources:
  IAMManagedPolicyCW:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub ${StateMachinename}-cw-policy
      Path: /
      PolicyDocument: |-
        {
          "Statement": [
              {
                  "Action": [
                      "logs:CreateLogDelivery",
                      "logs:GetLogDelivery",
                      "logs:UpdateLogDelivery",
                      "logs:DeleteLogDelivery",
                      "logs:ListLogDeliveries",
                      "logs:PutResourcePolicy",
                      "logs:DescribeResourcePolicies",
                      "logs:DescribeLogGroups"
                  ],
                  "Effect": "Allow",
                  "Resource": "*"
              }
          ],
          "Version": "2012-10-17"
        }
      Roles:
        - !Ref IAMRole
  IAMManagedPolicyECS:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub ${StateMachinename}-ecs-policy
      Path: /
      PolicyDocument: !Sub |
        {
            "Version": "2012-10-17",
            "Statement":
            [
                {
                    "Effect": "Allow",
                    "Action": ["ecs:RunTask"],
                    "Resource": [
                        "arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/github-actions-self-hosted-runner-debian-task*",
                        "arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/github-actions-self-hosted-runner-debian-task:*"
                    ]
                },
                {
                    "Effect": "Allow",
                    "Action": "iam:PassRole",
                    "Resource": ["*"],
                    "Condition": {
                        "StringLike": {
                            "iam:PassedToService": "ecs-tasks.amazonaws.com"
                        }
                    }
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "ecs:StopTask",
                        "ecs:DescribeTasks",
                        "transfer:DescribeExecution",
                        "states:DescribeExecution",
                        "states:StartExecution",
                        "states:StopExecution"
                    ],
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "events:PutTargets",
                        "events:PutRule",
                        "events:DescribeRule"
                    ],
                    "Resource": [
                        "arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/${EBRuleName}"
                    ]
                }
            ]
        }
      Roles:
        - !Ref IAMRole
  IAMManagedPolicyXRAY:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub ${StateMachinename}-xray-policy
      Path: /
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - xray:PutTraceSegments
              - xray:PutTelemetryRecords
              - xray:GetSamplingRules
              - xray:GetSamplingTargets
            Resource:
              - '*'
      Roles:
        - !Ref IAMRole
  IAMManagedPolicyEB:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: !Sub ${StateMachinename}-eb-policy
      Path: /
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - states:StartExecution
            Resource:
              - !Sub ${StepFunctionsStateMachine}
      Roles:
        - !Ref IAMRoleEB
  StepFunctionsStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      StateMachineName: !Sub ${StateMachinename}
      DefinitionString: !Sub |-
        {
          "Comment": "State Machine for GitHub Ephemeral Runners",
          "StartAt": "Construct_Vars_EB_Payload",
          "States": {
            "Construct_Vars_EB_Payload": {
              "Type": "Pass",
              "Next": "TriggerEphemeralActionsRunner",
              "Parameters": {
                "org.$": "$.organization.login",
                "labels.$": "States.JsonToString($.workflow_job.labels)",
                "config_args": "--unattended --replace --disableupdate --no-default-labels --ephemeral"
              }
            },
            "TriggerEphemeralActionsRunner": {
              "Type": "Task",
              "Resource": "arn:${AWS::Partition}:states:::aws-sdk:ecs:runTask",
              "Parameters": {
                "Cluster": "arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:cluster/GitHub-Actions-Runners",
                "TaskDefinition": "arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/github-actions-self-hosted-runner-debian-task",
                "Overrides": {
                  "ContainerOverrides": [
                    {
                      "Environment": [
                        {
                          "Name": "ORGANIZATION",
                          "Value.$": "$.org"
                        },
                        {
                          "Name": "RUNNER_CONFIG_ARGS",
                          "Value.$": "$.config_args"
                        },
                        {
                          "Name": "RUNNER_LABELS",
                          "Value.$": "$.labels"
                        }
                      ],
                      "Name": "github-actions-self-hosted-runner-debian-container"
                    }
                  ]
                }
              },
              "End": true
            }
          }
        }
      RoleArn: !GetAtt IAMRole.Arn
      StateMachineType: STANDARD
      LoggingConfiguration:
        Destinations:
          - CloudWatchLogsLogGroup:
              LogGroupArn: !GetAtt LogsLogGroup.Arn
        IncludeExecutionData: true
        Level: ERROR
  IAMRole:
    Type: AWS::IAM::Role
    Properties:
      Path: /service-role/
      RoleName: !Sub ${StateMachinename}-role
      AssumeRolePolicyDocument: |-
        {
           "Version": "2012-10-17",
           "Statement": [
              {
                 "Effect": "Allow",
                 "Principal": {
                    "Service": "states.amazonaws.com"
                 },
                 "Action": "sts:AssumeRole"
              }
           ]
        }
      MaxSessionDuration: 3600
  LogsLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: !Sub /aws/state-machines/${StateMachinename}/
  LogsLogStream:
    Type: AWS::Logs::LogStream
    Properties:
      LogGroupName: !Ref LogsLogGroup
      LogStreamName: !Sub log-stream-${StateMachinename}
  IAMRoleEB:
    Type: AWS::IAM::Role
    Properties:
      Path: /service-role/
      RoleName: !Sub ${StateMachinename}-eb-role
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: events.amazonaws.com
            Action: sts:AssumeRole
      MaxSessionDuration: 3600
  EventsRule:
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub ${EBRuleName}
      EventBusName: !Sub ${BusName}
      RoleArn: !GetAtt IAMRoleEB.Arn
      EventPattern: !Sub |-
        {
        "detail": {
            "organization": {
                "login": ["some-github-org"]
            },
            "workflow_job": {
                "status": ["queued"]
            }
        },
        "detail-type": ["workflow_job"],
        "source": ["github.com"],
        "account": ["xxxxxxxxxxxx"]
        }
      State: ENABLED
      Targets:
        - Arn: !Ref StepFunctionsStateMachine
          Id: !GetAtt StepFunctionsStateMachine.Name
          InputPath: $.detail
          RoleArn: !GetAtt IAMRoleEB.Arn

The following key resources will be created once the stack is deployed successfully.

  • EventBridge Rule: trigger-state-machine-ephemeral-runner
    • Target: state-machine-ephemeral-runner
  • Step Functions State Machine: state-machine-ephemeral-runner
    • LogGroup Name: /aws/state-machines/state-machine-ephemeral-runner/

The following resources are assumed to already be present:

ECS Cluster:
arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:cluster/GitHub-Actions-Runners

ECS Task Definition:
arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/github-actions-self-hosted-runner-debian-task

EventBridge BusName:
github-actions-event-bus

CloudFormation Stack: State Machine

  • The state machine provisioned by the CFN stack, and as viewed through the the AWS Console Step Functions service, is shown below
Image 28
  • To the left hand side is the state machine JSON definition (code also listed below), and to the right, is the workflow/visual representation.
{
  "Comment": "State Machine for GitHub Ephemeral Runners",
  "StartAt": "Construct_Vars_EB_Payload",
  "States": {
    "Construct_Vars_EB_Payload": {
      "Type": "Pass",
      "Next": "TriggerEphemeralActionsRunner",
      "Parameters": {
        "org.$": "$.organization.login",
        "labels.$": "States.JsonToString($.workflow_job.labels)",
        "config_args": "--unattended --replace --disableupdate --no-default-labels --ephemeral"
      }
    },
    "TriggerEphemeralActionsRunner": {
      "Type": "Task",
      "Resource": "arn:${AWS::Partition}:states:::aws-sdk:ecs:runTask",
      "Parameters": {
        "Cluster": "arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:cluster/GitHub-Actions-Runners",
        "TaskDefinition": "arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/github-actions-self-hosted-runner-debian-task",
        "Overrides": {
          "ContainerOverrides": [
            {
              "Environment": [
                {
                  "Name": "ORGANIZATION",
                  "Value.$": "$.org"
                },
                {
                  "Name": "RUNNER_CONFIG_ARGS",
                  "Value.$": "$.config_args"
                },
                {
                  "Name": "RUNNER_LABELS",
                  "Value.$": "$.labels"
                }
              ],
              "Name": "github-actions-self-hosted-runner-debian-container"
            }
          ]
        }
      },
      "End": true
    }
  }
}

A description of each state/task follows.

State name: Construct_Vars_EB_Payload

The state type equates to, Pass i.e., "Type": "Pass", and its purpose is to derive parameters identified by key:value pairs, using $.detail as input. Values are transformed/cast into the desired format by using the States Language intrinsic functions.

      "Parameters": {
        "org.$": "$.organization.login",
        "labels.$": "States.JsonToString($.workflow_job.labels)",
        "config_args": "--unattended --replace --disableupdate --no-default-labels --ephemeral"
      }

For example, the following,

"labels.$": "States.JsonToString($.workflow_job.labels)",

retrieves an array value sourced from path $.workflow_job.labels (relative to input $.detail) and applies the States.JsonToString intrinsic function to perform a JSON to string conversion. Within the state machine context, this string variable can then be referenced using labels.$.

The following instruction contains the name of the Next state to execute, i.e, in this case: TriggerEphemeralActionsRunner:

...
      "Next": "TriggerEphemeralActionsRunner",
...
State name: TriggerEphemeralActionsRunner

From the following, we can infer this state runs and ecs task:

  "Type": "Task",
  "Resource": "arn:${AWS::Partition}:states:::aws-sdk:ecs:runTask",

The derived parameters (passed from the preceding state) are resolved to their respective values within the ContainerOverride before running the task.

So, for example, recalling that,

"labels.$": "States.JsonToString($.workflow_job.labels)"

comes from parameter definitions passed from preceding state (Construct_Vars_EB_Payload), the value of $.labels used in the following override:

...
...
"ContainerOverrides": [
{
  "Environment": [
...
...
...
    {
      "Name": "RUNNER_LABELS",
      "Value.$": "$.labels"
    }
  ],
...
...

would resolve to value "[\"new\",\"one\"]".

Test Step Function State Machine

Once again, perform an action to trigger the event.

Check the task environment variables from the target Cluster’s task list:

Image 30

Check the state machine execution log:

Image 29