SageMaker Kmeans preprocessing

Per the documentation, “For training, the k-means algorithm expects data to be provided in the train channel (recommended S3DataDistributionType=ShardedByS3Key), with an optional test channel (recommended S3DataDistributionType=FullyReplicated) to score the data on. Both recordIO-wrapped-protobuf and CSV formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as recordIO-wrapped-protobuf or as CSV.”

Using the python SDK for Sagemaker, there is a far simpler way to use KMeans (yes, even simpler than a CSV)

Assume you start with a CSV that looks like the following (this is sample data from s3://aws-ml-blog-sagemaker-census-segmentation)

CensusId	State	County	TotalPop	Men		Women	Hispanic	White	Black	Native ...
0	1001	Alabama	Autauga	55221		26745	28476	2.6			75.8	18.5	0.4	...
1	1003	Alabama	Baldwin	195121		95314	99807	4.5			83.1	9.5		0.6	...
2	1005	Alabama	Barbour	26932		14497	12435	4.6			46.2	46.7	0.2 ...	
3	1007	Alabama	Bibb	22604		12073	10531	2.2			74.5	21.4	0.4	...
4	1009	Alabama	Blount	57710		28512	29198	8.6			87.9	1.5		0.3	...

Read the CSV file

import pandas as pd
data = pd.read_csv('data.csv', header=0, delimiter=",", low_memory=False)

Drop any column that has items that are NaN (Not a Number)


Consider doing one-hot-encoding or any other data preprocessing, but use this as template for the minimal data preprocessing that you will need to do. (Optional) For example, you can use SKlearn to do some standard preprocessing like scaling:

from sklearn.preprocessing import MinMaxScaler

Make sure your numerical values in the dataframe are Float values:

train_data = data_scaled.values.astype('float32')

Note - Consider using the dataframe as is, since the next training step will become easier. If you really need to convert to CSV, do:



It is possible that you have a large number of columns in your training dataset; it is common practice to then use Principle Component Analysis to reduce the number of columns while retaining most of the information. We can use the Sagemaker built in PCA algorithm to achieve this.

Using the train_data dataframe, you can do:

from sagemaker import PCA
from sagemaker import get_execution_role
role = get_execution_role()

pca_SM = PCA(role=role,
             output_path='s3://'+ bucket +'/counties/',

… and then train a PCA model:

train_data = train_data.values.astype('float32')

This reduces the number of columns you have in your original dataset to num_components=20

Once you are done training, store the new “components” in a new dataset, which will be your training dataset for Kmeans.

To do this, first deploy your model to Sagemaker, and then do a prediction:

pca_predictor = pca_SM.deploy(initial_instance_count=1, instance_type='ml.t2.medium')
result = pca_predictor.predict(train_data)

Then recover your train dataset to be used with kmeans

last_n = 5

for a in result:
data_transformed.index=data_scaled.index #Note that this uses indexes from a dataframe we used in previous steps

The variable last_n controls the number of columns you want to use (here, we are using the last 5 columns of data). You can also use all columns of data_transformed as it is already a dataset with only 20 columns, whereas your original dataset for Kmeans may have been 100's or 1000's of columns.

Related content: