Scale an AWS Aurora cluster's writer node

13 Mar 2018 - Aaron Dodd

While there is no “autoscaling” for RDS, for adjusting an instance on a schedule, the AWS CLI can be used. For an Aurora cluster, when you resize the primary node (the writer), AWS fails writes over to one of the readers but never switches that role back (see the AWS doc for details on the failover settings and logic). This would leave the cluster in a state where the up-sized node becomes a “reader” and one of the smaller nodes would remain a “writer”.

In this use-case, I needed to increase the capacity of the writer for a few hours a day for a known ingestion event (the fleet of readers could remain the same size and number), but then decrease the writer afterwards. The standard “aws rds modify-db-instance” call works as expected, but after scaling the instance, Aurora still leaves a smaller reader in the cluster as the primary (writer).

Below is a script that does the resizing, then waits until the change has taken effect, then switches the “writer” back to the newly resized node.

Example usage:

scriptname.sh db.m4.xlarge

Where the parameter is the new shape to apply.

The cluster ID and node names used can be found in the RDS / Cluster page in the AWS console.

#!/bin/bash
instance_size=$1
cluster_id="mycoolcluster-xxxx"
primary_node="mycoolcluster-node"
region="us-east-1"
pending_status=""
check_for='"PendingModifiedValues": {},'

aws rds modify-db-instance --db-instance-identifier "${primary_node}" --db-instance-class "${instance_size}" --apply-immediately --region "${region}"

echo "Checking status for ${primary_node}..."
until [ "${pending_status}" != "" ]; do
    pending_status=$(aws rds describe-db-instances --db-instance-identifier "${primary_node}" --region "${region}" --output json | grep "${check_for}")
    echo "${primary_node} still pending changes, waiting."
    sleep 10
done

echo "Failing back to ${primary_node}"
# sometimes it seems pending status is removed but node is not yet ready for failback (likely pending-reboot but not shown in CLI response)
# so if an error occurs, keep trying (there's probably a better way to do this)
while [ $? -ne 0 ]; do
    aws rds failover-db-cluster --db-cluster-identifier "${cluster_id}" --target-db-instance-identifier "${primary_node}" --region "${region}"
    sleep 5
done

Gist

This can run as either a cron or part of the ingestion process.

There are likely better ways to actually check for the status.