Blazingtext preprocessing

From the Sagemaker example here

BlazingText expects a preprocessed text files in S3 with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by “label".

What does this mean?

Let's say you have a CSV file with 2 columns,

CATEGORY,Text of document 1
CATEGORY,Text of document 2
CATEGORY,Text of document 3

A “document” here can be a sentence, a paragraph or several paragraphs.

Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them.

Read the file using pandas like this:

import pandas as pd

data = pd.read_csv('documents.txt', names = {'category','text'})

Modify the category column and write out a preprocessed file:

data.category  = '__' + data.category + '__'

import nltk
nltk.download('punkt')
data.text.apply(lambda x: ' '.join(nltk.word_tokenize(str.lower(x))))

# Don't include headers or indices
data.to_csv('out.csv',index=False,header=False)

Your CSV file that is ready for blazing text will now look like…

__CATEGORY__,Text of document 1
__CATEGORY__,Text of document 2
__CATEGORY__,Text of document 3

Note : Repeat this for all files that is part of your dataset.

Upload out.csv to an S3 localtion that looks like s3://bucketname/train/out.csv

Also see this link for information on how to split this output file into two files, for training and testing.


Related content: