- Use cases
-
1. Preprocessing
- SageMaker Object Detection preprocessing
- Rekognition Object Detection preprocessing
- SageMaker Kmeans preprocessing
- Autopilot preprocessing
- DeepAR preprocessing
- Personalize preprocessing
- Select, drop or extract Columns
- Split dataset to Train and Test
- Upload to s3
- Forecast preprocessing
- Rekognition Classification preprocessing
- SageMaker Image Classification preprocessing
- Xgboost preprocessing
- Blazingtext preprocessing
- Comprehend custom preprocessing
-
2. Training
- SageMaker Object Detection training
- Rekognition Object Detection training
- Forecast training
- Personalize training
- BlazingText training
- DeepAR training
- SageMaker Kmeans training
- Comprehend custom training
- Autopilot Training
- Xgboost Training
- Autogluon training
- Rekognition Classification training
- SageMaker Image Classification training
-
3. Inference
- SageMaker Object Detection inference
- Forecast inference
- Rekognition Object Detection inference
- Comprehend custom inference
- Personalize inference
- Autopilot Inference
- BlazingText Inference
- Custom SageMaker model Inference
- DeepAR Inference
- Rekognition Classification inference
- SageMaker Image Classification inference
- SageMaker Kmeans inference
- Xgboost Inference
- Contribute a use case or contact us for help.
- Frequently Asked Questions
Blazingtext preprocessing
From the Sagemaker example here…
BlazingText expects a preprocessed text files in S3 with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by “label".
What does this mean?
Let's say you have a CSV file with 2 columns,
CATEGORY,Text of document 1
CATEGORY,Text of document 2
CATEGORY,Text of document 3
A “document” here can be a sentence, a paragraph or several paragraphs.
Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them.
Read the file using pandas
like this:
import pandas as pd
data = pd.read_csv('documents.txt', names = {'category','text'})
Modify the category column and write out a preprocessed file:
data.category = '__' + data.category + '__'
import nltk
nltk.download('punkt')
data.text.apply(lambda x: ' '.join(nltk.word_tokenize(str.lower(x))))
# Don't include headers or indices
data.to_csv('out.csv',index=False,header=False)
Your CSV file that is ready for blazing text will now look like…
__CATEGORY__,Text of document 1
__CATEGORY__,Text of document 2
__CATEGORY__,Text of document 3
Note : Repeat this for all files that is part of your dataset.
Upload out.csv
to an S3 localtion that looks like s3://bucketname/train/out.csv
Also see this link for information on how to split this output file into two files, for training and testing.
Related content:
- ☞ BlazingText training – 1 min read
- ☞ Autogluon training – 2 min read
- ☞ BlazingText Inference – 2 min read