- Use cases
-
1. Preprocessing
- SageMaker Object Detection preprocessing
- Rekognition Object Detection preprocessing
- SageMaker Kmeans preprocessing
- Autopilot preprocessing
- DeepAR preprocessing
- Personalize preprocessing
- Select, drop or extract Columns
- Split dataset to Train and Test
- Upload to s3
- Forecast preprocessing
- Rekognition Classification preprocessing
- SageMaker Image Classification preprocessing
- Xgboost preprocessing
- Blazingtext preprocessing
- Comprehend custom preprocessing
-
2. Training
- SageMaker Object Detection training
- Rekognition Object Detection training
- Forecast training
- Personalize training
- BlazingText training
- DeepAR training
- SageMaker Kmeans training
- Comprehend custom training
- Autopilot Training
- Xgboost Training
- Autogluon training
- Rekognition Classification training
- SageMaker Image Classification training
-
3. Inference
- SageMaker Object Detection inference
- Forecast inference
- Rekognition Object Detection inference
- Comprehend custom inference
- Personalize inference
- Autopilot Inference
- BlazingText Inference
- Custom SageMaker model Inference
- DeepAR Inference
- Rekognition Classification inference
- SageMaker Image Classification inference
- SageMaker Kmeans inference
- Xgboost Inference
- Contribute a use case or contact us for help.
- Frequently Asked Questions
Xgboost preprocessing
Pre-reqs
Make sure you have a CSV file with the column you want to predict along with other data columns or features, call it ‘file.csv’
Make some necessary imports
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import numpy as np
Read data using pandas
data = pd.read_csv('file.csv')
Choose the column you want to predict
predictor_column_name = 'column-name'
Clean up the data
# split data into X (features) and y (column you want to predict)
X = data.drop(predictor_column_name,axis=1)
Y = data[predictor_column_name]
# convert text labels (Y) into numbers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
# Convert features (X) to numbers
features = pd.get_dummies(X).values
alldata = np.vstack((label_encoded_y.T, features.T)).T
# Replace anything that is not a number with zero and infinity with large finite numbers
alldata = np.nan_to_num(alldata)
Split and write the data
train_data, validation_data, test_data = np.split(alldata, [int(0.7 * len(alldata)), int(0.9 * len(alldata))])
np.savetxt('train.csv', train_data, delimiter=',')
np.savetxt('validation.csv', validation_data, delimiter=',')
np.savetxt('test.csv', test_data, delimiter=',')
Upload these three files to three folders / prefixes on S3 using this link