- Use cases
-
1. Preprocessing
- SageMaker Object Detection preprocessing
- Rekognition Object Detection preprocessing
- SageMaker Kmeans preprocessing
- Autopilot preprocessing
- DeepAR preprocessing
- Personalize preprocessing
- Select, drop or extract Columns
- Split dataset to Train and Test
- Upload to s3
- Forecast preprocessing
- Rekognition Classification preprocessing
- SageMaker Image Classification preprocessing
- Xgboost preprocessing
- Blazingtext preprocessing
- Comprehend custom preprocessing
-
2. Training
- SageMaker Object Detection training
- Rekognition Object Detection training
- Forecast training
- Personalize training
- BlazingText training
- DeepAR training
- SageMaker Kmeans training
- Comprehend custom training
- Autopilot Training
- Xgboost Training
- Autogluon training
- Rekognition Classification training
- SageMaker Image Classification training
-
3. Inference
- SageMaker Object Detection inference
- Forecast inference
- Rekognition Object Detection inference
- Comprehend custom inference
- Personalize inference
- Autopilot Inference
- BlazingText Inference
- Custom SageMaker model Inference
- DeepAR Inference
- Rekognition Classification inference
- SageMaker Image Classification inference
- SageMaker Kmeans inference
- Xgboost Inference
- Contribute a use case or contact us for help.
- Frequently Asked Questions
Personalize preprocessing
Import data
Amazon Personalize imports data only from files that are in the comma-separated values (CSV) format. Amazon Personalize requires the first row of your CSV file to contain column headers. The column headers in your CSV file need to map to the schema to create the dataset. Don't enclose headers in quotation marks ("). Upload your file into an Amazon Simple Storage Service (Amazon S3) bucket that Amazon Personalize can access
Let's assume your dataset contains file named “ratings.csv” with following data.
userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
It reads like this: user 1 gave movie 2 a 3.5 rating. Same for movies 29, 32, 47, 50 and so on!
Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. Name the file as “my-personalize-schema.json”
{"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields":[
{"name": "ITEM_ID", "type": "string"},
{"name": "USER_ID", "type": "string"},
{"name": "TIMESTAMP", "type": "long"}
],
"version": "1.0"}
Prepare dataset
Essentially, what the following python code does:
- Shuffle reviews.
- Keep only movies rated 4 and above, and drop the ratings columns: Purpose for this is to recommend movies that users should really like.
- Rename columns to the names used in the schema.
- Keep only 100,000 interactions to minimize training time
import pandas, boto3
from sklearn.utils import shuffle
ratings = pandas.read_csv('ratings.csv')
ratings = shuffle(ratings)
ratings = ratings[ratings['rating']>4]
ratings = ratings.drop(columns='rating')
ratings.columns = ['USER_ID','ITEM_ID','TIMESTAMP']
ratings = ratings[:100000]
ratings.to_csv('ratings.processed.csv',index=False)
s3 = boto3.client('s3')
s3.upload_file('ratings.processed.csv','my-personalize-bucket','ratings.processed.csv')
Create dataset group, create dataset and create import job - using Python, CLI and Console
Python
import boto3
personalize = boto3.client('personalize')
dsg_response = personalize.create_dataset_group(name = 'my-personalize-dataset-group')
dsg_arn = dsg_response['datasetGroupArn']
with open('my-personalize-schema.json') as f:
createSchemaResponse = personalize.create_schema(
name = 'MyPersonalizeSchema',
schema = f.read()
)
schema_arn = createSchemaResponse['schemaArn']
ds_response = personalize.create_dataset(
name = 'MyPersonalizeDataset',
schemaArn = schema_arn,
datasetGroupArn = dsg_arn,
datasetType = 'Interactions')
ds_arn = ds_response['datasetArn']
dsij_response = personalize.create_dataset_import_job(
jobName = 'YourImportJob',
datasetArn = ds_arn,
dataSource = {'dataLocation':'s3://my-personalize-bucket/ratings.processed.csv'},
roleArn = 'role_arn')
dsij_arn = dsij_response['datasetImportJobArn']
CLI
aws personalize create-dataset-group --name my-personalize-dataset-group
aws personalize create-schema --name my-personalize-schema \
--schema file://my-personalize-schema.json
aws personalize create-dataset --schema-arn $SCHEMA_ARN \
--dataset-group-arn $DATASET_GROUP_ARN \
--dataset-type INTERACTIONS
aws personalize create-dataset-import-job --job-name my-personalize-import-job \
--role-arn $ROLE_ARN
--dataset-arn $DATASET_ARN \
--data-source dataLocation=s3://my-personalize-bucket/ratings.processed.csv