Add metrics to CloudWatch

FEBRUARY 10, 2023

Here's a Python class that can track and push metrics to AWS CloudWatch.

Metrics are reset to their initial values on creation and when metrics are uploaded to CloudWatch.

# metrics.py
'''
A metrics class ready to track and push metrics to AWS CloudWatch.
'''

from datetime import datetime
import os
import boto3


# CloudWatch metrics namespace.
METRICS_NAMESPACE = 'my_metrics_namespace'

# Duration to wait between metric uploads.
METRICS_UPLOAD_THRESHOLD_SECONDS = 50


class Metrics:
    '''
    Holds metrics, serializes them to CloudWatch format,
    and ingests foreign metric values.
    '''

    def __init__(self):
        self.reset()

    def reset(self):
        '''
        Resets metric values and last upload time.
        '''
        self.last_upload_time = datetime.now()
        # Your custom metrics and initial values
        # Note that here we're using 'my_prefix' as
        # a custom prefix in case you want this class
        # to add a prefix namespace to all its metrics.
        self.my_prefix_first_metric = 0
        self.my_prefix_second_metric = 0

    def to_data(self):
        '''
        Serializes metrics and their values.
        '''
        def to_cloudwatch_format(name, value):
            return {'MetricName': name, 'Value': value}

        result = []
        for name, value in vars(self).items():
            if name != 'last_upload_time':
                result.append(to_cloudwatch_format(name, value))
        return result

    def ingest(self, metrics, prefix=''):
        '''
        Adds foreign metric values to this metrics object.
        '''
        input_metric_names = [attr for attr in dir(metrics)
                              if not callable(getattr(metrics, attr))
                              and not attr.startswith("__")]

        # Iterate through foreign keys and add metric values.
        for metric_name in input_metric_names:

            # Get value of foreign metric.
            input_metric_value = getattr(metrics, metric_name)

            # Get metric key.
            metric_key = f'{prefix}_{metric_name}'

            # Get metric value.
            metric_value = getattr(self, metric_key)

            # Add foreign values to this metrics object.
            setattr(
              self,
              metric_key,
              input_metric_value + metric_value
            )

    def upload(self, force=False):
        '''
        Uploads metrics to CloudWatch when time since last
        upload is above a duration or when forced.
        '''

        # Get time elapsed since last upload.
        seconds_since_last_upload = \
            (datetime.now() - self.last_upload_time).seconds

        # Only upload if duration is greater than threshold,
        # or when the force flag is set to True.
        if seconds_since_last_upload > 50 or force:
            # Upload metrics to CloudWatch.
            cloudwatch = boto3.client(
                           'cloudwatch',
                           os.getenv('AWS_REGION')
                         )
            cloudwatch.put_metric_data(
                Namespace=METRICS_NAMESPACE,
                MetricData=self.to_data()
            )
            # Reset metrics.
            self.reset()

To use this class, we just have to instantiate a metrics object, track some metrics, and upload them.

# Create a metrics object.
metrics = Metrics()

# Add values to its metrics.
metrics.my_prefix_first_metric += 3
metrics.my_prefix_second_metric += 1

# Upload metrics to CloudWatch.
metrics.upload(force=True)

If you were processing metrics at a fast pace, you don't want to upload metrics every single time you increase their value, as otherwise CloudWatch will complain. In certain cases, AWS CloudWatch's limit is 5 transactions per second (TPS) per account or AWS Region. When this limit is reached, you'll receive a RateExceeded throttling error.

By calling metrics.upload(force=False) we only upload once every METRICS_UPLOAD_THRESHOLD_SECONDS. (In this example, at maximum every 50 seconds.)

import time

# Create a metrics object.
metrics = Metrics()

for i in range(0, 100, 1):
    # Wait for illustration purposes,
    # as if we were doing work.
    time.sleep(1)

    # Add values to its metrics.
    metrics.my_prefix_first_metric += 3
    metrics.my_prefix_second_metric += 1

    # Only upload if more than the threshold
    # duration has passed since we last uploaded.
    metrics.upload()

# Force-upload metrics to CloudWatch once we're done.
metrics.upload(force=True)

Lastly, here's how to ingest foreign metrics with or without a prefix.

# We define a foreign metrics class.
class OtherMetrics:

    def __init__(self):
        self.reset()

    def reset(self):
        # Note that here we don't have 'my_prefix'.
        self.first_metric = 0
        self.second_metric = 0

# We instantiate both metric objects.
metrics = Metrics()
other_metrics = OtherMetrics()

# The foreign metrics track values.
other_metrics.first_metric += 15
other_metrics.second_metric += 3

# Then our main metrics class ingests those metrics.
metrics.ingest(other_metrics, prefix='my_prefix')

# Then our main metrics class has those values.
print(metrics.my_prefix_first_metric)
# Returns 15

print(metrics.my_prefix_second_metric)
# Returns 3

If you found this useful, let me know!

Take a look at other posts about code, Python, and Today I Learned(s).

AWS announces the Resource Explorer

NOVEMBER 10, 2022

Earlier this week, Amazon AWS announced yet another service release, this time called Resource Explorer.

AWS Resource Explorer [is] a managed capability that simplifies the search and discovery of resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Kinesis streams, and Amazon DynamoDB tables, across AWS Regions in your AWS account. AWS Resource Explorer is available at no additional charge to you.

Start your resource search in the AWS Resource Explorer console, the AWS Command Line Interface (AWS CLI), the AWS SDKs, or the unified search bar from wherever you are in the AWS Management Console. From the search results displayed in the console, you can go to your resource’s service console and Region with a single step and take action.

To turn on AWS Resource Explorer, see the AWS Resource Explorer console. Read about getting started in our AWS Resource Explorer documentation, or explore the AWS Resource Explorer product page.

How SageMaker's model_dir, output_dir, and output_data_dir Parameters Work

DECEMBER 2, 2021

SageMaker can be quite confusing. Here are some notes I took while learning how the model and output parameters work.

model_dir is provided as an Estimator function parameter.
output_dir and output_data_dir are provided as Estimator hyperparameters.

(See how to provide these arguments in code below.)

After a successful run, whatever is saved to each of these directories will be uploaded to a specific S3 location within your job's folder and bucket.

model.tar.gz will contain files saved to /opt/ml/model
output.tar.gz will contain files saved to /opt/ml/output and (inside of the data subfolder) the files saved to /opt/ml/output/data

Here's the sample directory tree with a train.py entry point that saves a text file to each of these locations.

# Files saved to /opt/ml/model/
model.tar.gz
    model.txt

# Files saved to /opt/ml/output/
output.tar.gz
    output.txt
    success
    # Files saved to /opt/ml/output/data/
    data/
        output_data.txt

# Files in the Estimator's source_dir
source/
    sourcedir.tar.gz
        # All files in your source_dir

Here's how you'd override these locations in your Estimator.

# Create a TensorFlow Estimator
estimator = sagemaker.tensorflow.estimator.TensorFlow(
    ...
    model_dir='/opt/ml/model',
    hyperparameters={
        'output_data_dir': '/opt/ml/output/data/',
        'output_dir': '/opt/ml/output/',
    },
    ...
)

And here's how you'd read their values inside of your entry point, e.g., train.py. Note that, even if you don't pass these three variables to your Estimator and its hyperparameters, you can capture them in your entry point script by defaulting to the SageMaker environment variables, namely, SM_MODEL_DIR, SM_OUTPUT_DIR, and SM_OUTPUT_DATA_DIR, which default to /opt/ml/model, /opt/ml/output, and /opt/ml/output/data.

import argparse
import os

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_dir", type=str,
                                    default=os.environ.get('SM_MODEL_DIR'),
                                       help="Directory to save model files.")
    parser.add_argument("--output_dir", type=str,
                                    default=os.environ.get('SM_OUTPUT_DIR'),
                                       help="Directory to save output artifacts.")
    parser.add_argument("--output_data_dir", type=str,
                                    default=os.environ.get('SM_OUTPUT_DATA_DIR'),
                                       help="Directory to save output data artifacts.")

    opt = parser.parse_args()
    print(f'model_dir › {opt.model_dir}')
    print(f'output_dir › {opt.output_dir}')
    print(f'output_data_dir › {opt.output_data_dir}')

Testing this functionality, I saved a text file to each of these locations to see what the SageMaker SDK was uploading to S3. (The resulting directory structure can be seen above.)

# Save a text file to model_dir
f = open(os.path.join(opt.model_dir, 'model.txt'), 'w')
f.write('Contents of model.txt!')
f.close()

# Save a text file to output_dir
f = open(os.path.join(opt.output_dir, 'output.txt'), 'w')
f.write('Contents of output.txt!')
f.close()

# Save a text file to output_data_dir
f = open(os.path.join(opt.output_data_dir, 'output_data.txt'), 'w')
f.write('Contents of output_data.txt!')
f.close()

What's the difference between output_dir and output_data_dir?

SageMaker provides two different folders, the parent output folder and the output data subfolder. According to official AWS GitHub samples, output_dir is the directory where training success/failure indications will be written—which is an empty file named either success or failure—, output_data_dir is reserved to save non-model artifacts, such as diagrams, TensorBoard logs, or any other artifacts you want to generate during the training process.

I hope the read was helpful!

List Lambda Functions and Lambda Layers with the AWS CLI

JULY 15, 2021

Here are a few helper functions to list Lambda functions and layers (and to count them) using the AWS Command Line Interface (AWS CLI) to inspect the serverless resources of your Amazon Web Services (AWS) account.

Listing Lambda Layers of a Function

aws lambda get-function --function-name {name|arn} | \
jq .Configuration.Layers

[
  {
    "Arn": "arn:aws:lambda:us-west-2:00000000:layer:layer-name:1",
    "CodeSize": 1231231
  }
]

Counting Lambda Layers of a Function

aws lambda get-function --function-name {name|arn} | \
jq '.Configuration.Layers | length'

# Returns 1 (or number of layers attached to function)

Counting Lambda Layers in an AWS account

aws lambda list-layers | \
jq '.Layers | length'

# Returns 4 (or number of layers in your account)

Listing All Layers in an AWS account

aws lambda list-layers

{
    "Layers": [
        {
            "LayerName": "layer-name",
            "LayerArn": "arn:aws:lambda:us-west-2:0123456789:layer:layer-name",
            "LatestMatchingVersion": {
                "LayerVersionArn": "arn:aws:lambda:us-west-2:0123456789:layer:layer-name:1",
                "Version": 1,
                "Description": "Layer Description",
                "CreatedDate": "2021-07-14T14:00:27.370+0000",
                "CompatibleRuntimes": [
                    "python3.7"
                ],
                "LicenseInfo": "MIT"
            }
        },
        {
            "LayerName": "another-layer-name",
            "LayerArn": "arn:aws:lambda:us-west-2:0123456789:layer:another-layer-name",
            "LatestMatchingVersion": {
                "LayerVersionArn": "arn:aws:lambda:us-west-2:0123456789:layer:another-layer-name:4",
                "Version": 4,
                "Description": "Layer Description",
                "CreatedDate": "2021-07-14T11:41:45.520+0000",
                "CompatibleRuntimes": [
                    "python3.6"
                ],
                "LicenseInfo": "MIT"
            }
        }
    ]
}

Listing Lambda Functions in an AWS account

aws lambda list-functions

{
    "Functions": [
        {
            "FunctionName": "function-name",
            "FunctionArn": "arn:aws:lambda:us-west-2:0123456789:function:function-name",
            "Runtime": "python3.7",
            "Role": "arn:aws:iam::0123456789:role/role-name",
            "Handler": "lambda_function.lambda_handler",
            "CodeSize": 1234,
            "Description": "Function description.",
            "Timeout": 30,
            "MemorySize": 128,
            "LastModified": "2021-07-14T16:48:19.052+0000",
            "CodeSha256": "28ua8s0aw0820492r=",
            "Version": "$LATEST",
            "Environment": {
                "Variables": {
                }
            },
            "TracingConfig": {
                "Mode": "PassThrough"
            },
            "RevisionId": "1b0be4c3-4eb6-4254-9061-050702646940",
            "Layers": [
                {
                    "Arn": "arn:aws:lambda:us-west-2:0123456789:layer:layer-name:1",
                    "CodeSize": 1563937
                }
            ],
            "PackageType": "Zip"
        }
    ]
}

Invoke an AWS Lambda Function with awscli

JANUARY 19, 2021

Here's how to execute a deployed AWS Lambda function with the AWS command-line interface.

Create a payload.json file that contains a JSON payload.

{
  "foo": "bar"
}

Then convert the payload to base64.

base64 payload.json
# returns ewogICJmb28iOiAiYmFyIgp9Cg==

And replace the contents of payload.json with that base64 string.

ewogICJmb28iOiAiYmFyIgp9Cg==

Invoke your Lambda function using that payload.

aws lambda invoke \
--function-name My-Lambda-Function-Name \
--payload file://payload.json \
output.json

The request's response will be printed in the console and the output will be saved in output.json.

If you're developing locally, you can use the aws lambda update-function-code function to synchronize your local code with your Lambda funciton.

Update AWS Lambda Function Code with awscli

NOVEMBER 4, 2020

The Amazon Web Services (AWS) command-line interface — the AWS Cli — lets you update the code of a Lambda function right from the cli. Here's how.

aws lambda update-function-code \
--function-name my-function-name \
--region us-west-2 \
--zip-file fileb://lambda.zip

Let's understand what you need to run this command.

aws lambda update-function-code - to execute this command you need the awscli installed on your machine and your authentication information has to be configured to your account
--function-name - this is the name of an existing Lambda function in your AWS account
--region - the region in which your Lambda lives (in this case, it's Oregon, whose code is us-west-2, you can see a list of regions and their codes here)
--zip-file - this is the path to your zipped Lambda code with the fileb:// prefix, in the example, there's a lambda.zip file in the current directory, alternatively you can use the --s3-bucket and --s3-key to use a zip file from an S3 bucket)

After your function code has been updated, you can invoke the Lambda function to verify everything is working as expected.

If you want to learn more about this command, here's the AWS CLI command reference guide, and here's the free Kindle version. Among other things, it lets you create Lambda Layer versions, invoke functions, and much more.

Amazon SageMaker Lifecycle Configurations and Custom Kernel Environments

LAST UPDATED JUNE 3, 2020

2020.06.03

I've found that if the creation or starting of a notebook takes longer than 5 minutes the notebook will fail, plus re-creating the conda environment every time you start an existing notebook makes the wait really long. Another solution which I'm preferring now is to use these persistent-conda-ebs scripts—on-create.sh and on-start.sh—provided by Amazon Sagemaker as examples. To keep it short, they download Miniconda and create an environment on-create with whatever Python version you choose, you can customize your environment (say, installing Python packages with pip or conda inside of it), and then that environment is persistent across sessions and future starts that will run the on-start script and have your notebook running in 1–2 minutes. Hope that helps! That's the way I'm using lifecycle configurations now.

2020.03.24

Here's something I learned about Amazon SageMaker today at work.

You can create notebook instances with different instance types (say, ml.t2.medium or ml.p3.2xlarge) and use a set of kernels that have been setup for you. These are conda (Anaconda) environments exposed as Jupyter notebook kernels that execute the commands you write on the Python notebook.

What I learned today that I didn't know is that you can create your own conda environment and expose them as kernels so you're not limited to run with the kernels offered by Amazon AWS.

This is the sample environment I setup today. These commands should be run on a Terminal window in a SageMaker notebook but they most likely can run on any environment with conda installed.

# Create new conda environment named env_tf210_p36
$ conda create --name env_tf210_p36 python=3.6 tensorflow-gpu=2.1.0 ipykernel tensorflow-datasets matplotlib pillow keras

# Enable conda on bash
$ echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc

# Enter bash (if you're not already running in bash)
$ bash

# Activate your freshly created environment
$ conda activate env_tf210_p36

# Install GitHub dependencies
$ pip install git+https://github.com/tensorflow/examples.git

# Now you have your environment setup - Party!
# ..

# When you're ready to leave
$ conda deactivate

How do we expose our new conda environment as a SageMaker kernel?

# Activate the conda environment (as it has ipykernel installed)
$ conda activate env_tf210_p36

# Expose your conda environment with ipykernel
$ python -m ipykernel install --user --name env_tf210_p36 --display-name "My Env (tf_2.1.0 py_3.6)"

After reloading your notebook instance you should see your custom environment appear in the launcher and in the notebook kernel selector.

What if you don't want to repeat this process over and over and over?

You can create a lifecycle configuration on SageMaker that will run this initial environment creation setup every time you create a new notebook instance. (You create a new Lifecycle Configuration and paste the following code inside of the Create Notebook tab.)


#!/bin/bash

set -e

# OVERVIEW
# This script creates and configures the env_tf210_p36 environment.

sudo -u ec2-user -i <<EOF

echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc

# Create custom conda environment
conda create --name env_tf210_p36 python=3.6 tensorflow-gpu=2.1.0 ipykernel tensorflow-datasets matplotlib pillow keras -y

# Activate our freshly created environment
source /home/ec2-user/anaconda3/bin/activate env_tf210_p36

# Install git-repository dependencies
pip install -q git+https://github.com/tensorflow/examples.git

# Expose environment as kernel
python -m ipykernel install --user --name env_tf210_p36 --display-name My_Env_tf_2.1.0_py_3.6

# Deactivate environment
source /home/ec2-user/anaconda3/bin/deactivate

EOF

That way you won't have to setup each new notebook instance you create. You'll just have to pick the lifecycle you just created. Take a look at Amazon SageMaker notebook instance Lifecycle Configuration samples.

Want to see older publications? Visit the archive.

Listen to Getting Simple .