SageMaker can be quite confusing. Here are some notes I took while learning how the model and output parameters work.
model_dir
is provided as an Estimator function parameter.output_dir
and output_data_dir
are provided as Estimator hyperparameters.(See how to provide these arguments in code below.)
After a successful run, whatever is saved to each of these directories will be uploaded to a specific S3 location within your job's folder and bucket.
model.tar.gz
will contain files saved to /opt/ml/model
output.tar.gz
will contain files saved to /opt/ml/output
and (inside of the data
subfolder) the files saved to /opt/ml/output/data
Here's the sample directory tree with a train.py
entry point that saves a text file to each of these locations.
# Files saved to /opt/ml/model/
model.tar.gz
model.txt
# Files saved to /opt/ml/output/
output.tar.gz
output.txt
success
# Files saved to /opt/ml/output/data/
data/
output_data.txt
# Files in the Estimator's source_dir
source/
sourcedir.tar.gz
# All files in your source_dir
Here's how you'd override these locations in your Estimator.
# Create a TensorFlow Estimator
estimator = sagemaker.tensorflow.estimator.TensorFlow(
...
model_dir='/opt/ml/model',
hyperparameters={
'output_data_dir': '/opt/ml/output/data/',
'output_dir': '/opt/ml/output/',
},
...
)
And here's how you'd read their values inside of your entry point, e.g., train.py
. Note that, even if you don't pass these three variables to your Estimator and its hyperparameters, you can capture them in your entry point script by defaulting to the SageMaker environment variables, namely, SM_MODEL_DIR
, SM_OUTPUT_DIR
, and SM_OUTPUT_DATA_DIR
, which default to /opt/ml/model
, /opt/ml/output
, and /opt/ml/output/data
.
import argparse
import os
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--model_dir", type=str,
default=os.environ.get('SM_MODEL_DIR'),
help="Directory to save model files.")
parser.add_argument("--output_dir", type=str,
default=os.environ.get('SM_OUTPUT_DIR'),
help="Directory to save output artifacts.")
parser.add_argument("--output_data_dir", type=str,
default=os.environ.get('SM_OUTPUT_DATA_DIR'),
help="Directory to save output data artifacts.")
opt = parser.parse_args()
print(f'model_dir › {opt.model_dir}')
print(f'output_dir › {opt.output_dir}')
print(f'output_data_dir › {opt.output_data_dir}')
Testing this functionality, I saved a text file to each of these locations to see what the SageMaker SDK was uploading to S3. (The resulting directory structure can be seen above.)
# Save a text file to model_dir
f = open(os.path.join(opt.model_dir, 'model.txt'), 'w')
f.write('Contents of model.txt!')
f.close()
# Save a text file to output_dir
f = open(os.path.join(opt.output_dir, 'output.txt'), 'w')
f.write('Contents of output.txt!')
f.close()
# Save a text file to output_data_dir
f = open(os.path.join(opt.output_data_dir, 'output_data.txt'), 'w')
f.write('Contents of output_data.txt!')
f.close()
SageMaker provides two different folders, the parent output folder and the output data subfolder. According to official AWS GitHub samples, output_dir
is the directory where training success/failure indications will be written—which is an empty file named either success
or failure
—, output_data_dir
is reserved to save non-model artifacts, such as diagrams, TensorBoard logs, or any other artifacts you want to generate during the training process.
I hope the read was helpful!
2020.06.03
I've found that if the creation or starting of a notebook takes longer than 5 minutes the notebook will fail, plus re-creating the conda environment every time you start an existing notebook makes the wait really long. Another solution which I'm preferring now is to use these persistent-conda-ebs scripts—on-create.sh
and on-start.sh
—provided by Amazon Sagemaker as examples. To keep it short, they download Miniconda and create an environment on-create
with whatever Python version you choose, you can customize your environment (say, installing Python packages with pip
or conda
inside of it), and then that environment is persistent across sessions and future starts that will run the on-start
script and have your notebook running in 1–2 minutes. Hope that helps! That's the way I'm using lifecycle configurations now.
2020.03.24
Here's something I learned about Amazon SageMaker today at work.
You can create notebook instances with different instance types (say, ml.t2.medium
or ml.p3.2xlarge
) and use a set of kernels that have been setup for you. These are conda
(Anaconda) environments exposed as Jupyter notebook kernels that execute the commands you write on the Python notebook.
What I learned today that I didn't know is that you can create your own conda
environment and expose them as kernels so you're not limited to run with the kernels offered by Amazon AWS.
This is the sample environment I setup today. These commands should be run on a Terminal window in a SageMaker notebook but they most likely can run on any environment with conda
installed.
# Create new conda environment named env_tf210_p36
$ conda create --name env_tf210_p36 python=3.6 tensorflow-gpu=2.1.0 ipykernel tensorflow-datasets matplotlib pillow keras
# Enable conda on bash
$ echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
# Enter bash (if you're not already running in bash)
$ bash
# Activate your freshly created environment
$ conda activate env_tf210_p36
# Install GitHub dependencies
$ pip install git+https://github.com/tensorflow/examples.git
# Now you have your environment setup - Party!
# ..
# When you're ready to leave
$ conda deactivate
How do we expose our new conda
environment as a SageMaker kernel?
# Activate the conda environment (as it has ipykernel installed)
$ conda activate env_tf210_p36
# Expose your conda environment with ipykernel
$ python -m ipykernel install --user --name env_tf210_p36 --display-name "My Env (tf_2.1.0 py_3.6)"
After reloading your notebook instance you should see your custom environment appear in the launcher and in the notebook kernel selector.
What if you don't want to repeat this process over and over and over?
You can create a lifecycle configuration on SageMaker that will run this initial environment creation setup every time you create a new notebook instance. (You create a new Lifecycle Configuration and paste the following code inside of the Create Notebook tab.)
#!/bin/bash
set -e
# OVERVIEW
# This script creates and configures the env_tf210_p36 environment.
sudo -u ec2-user -i <<EOF
echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
# Create custom conda environment
conda create --name env_tf210_p36 python=3.6 tensorflow-gpu=2.1.0 ipykernel tensorflow-datasets matplotlib pillow keras -y
# Activate our freshly created environment
source /home/ec2-user/anaconda3/bin/activate env_tf210_p36
# Install git-repository dependencies
pip install -q git+https://github.com/tensorflow/examples.git
# Expose environment as kernel
python -m ipykernel install --user --name env_tf210_p36 --display-name My_Env_tf_2.1.0_py_3.6
# Deactivate environment
source /home/ec2-user/anaconda3/bin/deactivate
EOF
That way you won't have to setup each new notebook instance you create. You'll just have to pick the lifecycle you just created. Take a look at Amazon SageMaker notebook instance Lifecycle Configuration samples.