- Use cases
-
1. Preprocessing
- SageMaker Object Detection preprocessing
- Rekognition Object Detection preprocessing
- SageMaker Kmeans preprocessing
- Autopilot preprocessing
- DeepAR preprocessing
- Personalize preprocessing
- Select, drop or extract Columns
- Split dataset to Train and Test
- Upload to s3
- Forecast preprocessing
- Rekognition Classification preprocessing
- SageMaker Image Classification preprocessing
- Xgboost preprocessing
- Blazingtext preprocessing
- Comprehend custom preprocessing
-
2. Training
- SageMaker Object Detection training
- Rekognition Object Detection training
- Forecast training
- Personalize training
- BlazingText training
- DeepAR training
- SageMaker Kmeans training
- Comprehend custom training
- Autopilot Training
- Xgboost Training
- Autogluon training
- Rekognition Classification training
- SageMaker Image Classification training
-
3. Inference
- SageMaker Object Detection inference
- Forecast inference
- Rekognition Object Detection inference
- Comprehend custom inference
- Personalize inference
- Autopilot Inference
- BlazingText Inference
- Custom SageMaker model Inference
- DeepAR Inference
- Rekognition Classification inference
- SageMaker Image Classification inference
- SageMaker Kmeans inference
- Xgboost Inference
- Contribute a use case or contact us for help.
- Frequently Asked Questions
SageMaker Kmeans preprocessing
Per the documentation, “For training, the k-means algorithm expects data to be provided in the train channel (recommended S3DataDistributionType=ShardedByS3Key), with an optional test channel (recommended S3DataDistributionType=FullyReplicated) to score the data on. Both recordIO-wrapped-protobuf and CSV formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as recordIO-wrapped-protobuf or as CSV.”
Using the python SDK for Sagemaker, there is a far simpler way to use KMeans (yes, even simpler than a CSV)
Assume you start with a CSV that looks like the following (this is sample data from s3://aws-ml-blog-sagemaker-census-segmentation)
CensusId State County TotalPop Men Women Hispanic White Black Native ...
0 1001 Alabama Autauga 55221 26745 28476 2.6 75.8 18.5 0.4 ...
1 1003 Alabama Baldwin 195121 95314 99807 4.5 83.1 9.5 0.6 ...
2 1005 Alabama Barbour 26932 14497 12435 4.6 46.2 46.7 0.2 ...
3 1007 Alabama Bibb 22604 12073 10531 2.2 74.5 21.4 0.4 ...
4 1009 Alabama Blount 57710 28512 29198 8.6 87.9 1.5 0.3 ...
Read the CSV file
import pandas as pd
data = pd.read_csv('data.csv', header=0, delimiter=",", low_memory=False)
Drop any column that has items that are NaN (Not a Number)
data.dropna(inplace=True)
Consider doing one-hot-encoding or any other data preprocessing, but use this as template for the minimal data preprocessing that you will need to do. (Optional) For example, you can use SKlearn to do some standard preprocessing like scaling:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data))
data_scaled.columns=data.columns
data_scaled.index=data.index
Make sure your numerical values in the dataframe are Float values:
train_data = data_scaled.values.astype('float32')
Note - Consider using the dataframe as is, since the next training step will become easier. If you really need to convert to CSV, do:
train_data.to_csv(index=False)
Optional
It is possible that you have a large number of columns in your training dataset; it is common practice to then use Principle Component Analysis to reduce the number of columns while retaining most of the information. We can use the Sagemaker built in PCA algorithm to achieve this.
Using the train_data dataframe, you can do:
from sagemaker import PCA
from sagemaker import get_execution_role
role = get_execution_role()
num_components=20
pca_SM = PCA(role=role,
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path='s3://'+ bucket +'/counties/',
num_components=num_components)
… and then train a PCA model:
train_data = train_data.values.astype('float32')
pca_SM.fit(pca_SM.record_set(train_data))
This reduces the number of columns you have in your original dataset to num_components=20
Once you are done training, store the new “components” in a new dataset, which will be your training dataset for Kmeans.
To do this, first deploy your model to Sagemaker, and then do a prediction:
pca_predictor = pca_SM.deploy(initial_instance_count=1, instance_type='ml.t2.medium')
result = pca_predictor.predict(train_data)
Then recover your train dataset to be used with kmeans
data_transformed=pd.DataFrame()
last_n = 5
for a in result:
b=a.label['projection'].float32_tensor.values
data_transformed=data_transformed.append([list(b)])
data_transformed.index=data_scaled.index #Note that this uses indexes from a dataframe we used in previous steps
data_transformed=data_transformed.iloc[:,-last_n:]
data_transformed.columns=PCA_list
The variable last_n
controls the number of columns you want to use (here, we are using the last 5 columns of data). You can also use all columns of data_transformed
as it is already a dataset with only 20 columns, whereas your original dataset for Kmeans may have been 100's or 1000's of columns.