SageMaker Image Classification preprocessing

The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network (ResNet) that can be trained from scratch or trained using transfer learning when a large number of training images are not available.

The recommended input format for the Amazon SageMaker image classification algorithms is Apache MXNet RecordIO. However, you can also use raw images in .jpg or .png format. Refer to this discussion for a broad overview of efficient data preparation and loading for machine learning systems.

We think that training with the RecordIO format is the easiest way to get started with your image classification PoC on SageMaker. For full details on the input-output interface, see this.

Prepare your dataset in ImageRecord format

Reference

Raw images are natural data format for computer vision tasks. However, when loading data from image files for training, disk IO might be a bottleneck. For instance, when training a ResNet50 model with ImageNet on an AWS p3.16xlarge instance, The parallel training on 8 GPUs makes it so fast, with which even reading images from ramdisk can’t catch up. To boost the performance on top-configured platform, we suggest users to train with MXNet’s ImageRecord format.

It is as simple as a few lines of code to create ImageRecord file for your own images.

Assuming we have a folder ./example, in which images are places in different subfolders representing classes:

./example/class_A/1.jpg
./example/class_A/2.jpg
./example/class_A/3.jpg
./example/class_B/4.jpg
./example/class_B/5.jpg
./example/class_B/6.jpg
./example/class_C/100.jpg
./example/class_C/1024.jpg
./example/class_D/65535.jpg
./example/class_D/0.jpg
...

Download prerequisite packages

wget https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py ./
pip install mxnet

Generate a .lst file, i.e. a list of these images containing label and filename information.

python im2rec.py ./example_rec ./example/ --recursive --list --num-thread 8 --test-ratio=0.3 --train-ratio=0.7

Then create the .rec files for training and validation

python im2rec.py ./example_rec ./example/ --recursive --pass-through --pack-label --num-thread 8

It gives you two more files: example_rec.idx and example_rec.rec. Now, you can use them to train!

Then use this link to upload your .rec files to s3!

For example, do:

import sagemaker
sess = sagemaker.Session()

trainpath = sess.upload_data(
    path='example_rec_train.rec', bucket='mybucketname',
    key_prefix='sagemaker/input')

testpath = sess.upload_data(
    path='example_rec_test.rec', bucket='mybucketname',
    key_prefix='sagemaker/input')
``

Updated on 07 Feb 2020