- Use cases
-
1. Preprocessing
- SageMaker Object Detection preprocessing
- Rekognition Object Detection preprocessing
- SageMaker Kmeans preprocessing
- Autopilot preprocessing
- DeepAR preprocessing
- Personalize preprocessing
- Select, drop or extract Columns
- Split dataset to Train and Test
- Upload to s3
- Forecast preprocessing
- Rekognition Classification preprocessing
- SageMaker Image Classification preprocessing
- Xgboost preprocessing
- Blazingtext preprocessing
- Comprehend custom preprocessing
-
2. Training
- SageMaker Object Detection training
- Rekognition Object Detection training
- Forecast training
- Personalize training
- BlazingText training
- DeepAR training
- SageMaker Kmeans training
- Comprehend custom training
- Autopilot Training
- Xgboost Training
- Autogluon training
- Rekognition Classification training
- SageMaker Image Classification training
-
3. Inference
- SageMaker Object Detection inference
- Forecast inference
- Rekognition Object Detection inference
- Comprehend custom inference
- Personalize inference
- Autopilot Inference
- BlazingText Inference
- Custom SageMaker model Inference
- DeepAR Inference
- Rekognition Classification inference
- SageMaker Image Classification inference
- SageMaker Kmeans inference
- Xgboost Inference
- Contribute a use case or contact us for help.
- Frequently Asked Questions
Split dataset to Train and Test
Files are in a folder, and I like linux commands
The easiest way to do this is using the linux awk
command. Suppose you have a file called in.csv
or a directory of files that look like
Folder
├── in1.csv
├── in2.csv
├── .
├── .
└── in2000.csv
… and assuming the delimiter used is a comma (,), and you want to select the first three columns, do
!awk '{if( rand() <= 0.2){ print $0 > "test_data.csv"} else {print $0 > "train_data.csv"}}' Folder/in*csv
That's it!
To count the number of lines in the resulting csv files, do:
wc -l <filename>
How about in Python?
from numpy.random import RandomState
import pandas as pd
df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()
#For a 70-30 split
train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]
train.to_csv('train.csv',index=False,header=False)
test.to_csv('test.csv',index=False,header=False)