Blazingtext preprocessing

From the Sagemaker example here…

BlazingText expects a preprocessed text files in S3 with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by “label".

What does this mean?

Let's say you have a CSV file with 2 columns,

CATEGORY,Text of document 1
CATEGORY,Text of document 2
CATEGORY,Text of document 3

A “document” here can be a sentence, a paragraph or several paragraphs.

Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them.

Read the file using pandas like this:

import pandas as pd

data = pd.read_csv('documents.txt', names = {'category','text'})

Modify the category column and write out a preprocessed file:

data.category  = '__' + data.category + '__'

import nltk
nltk.download('punkt')
data.text.apply(lambda x: ' '.join(nltk.word_tokenize(str.lower(x))))

# Don't include headers or indices
data.to_csv('out.csv',index=False,header=False)

Your CSV file that is ready for blazing text will now look like…

__CATEGORY__,Text of document 1
__CATEGORY__,Text of document 2
__CATEGORY__,Text of document 3

Note : Repeat this for all files that is part of your dataset.

Upload out.csv to an S3 localtion that looks like s3://bucketname/train/out.csv

Also see this link for information on how to split this output file into two files, for training and testing.

Updated on 07 Feb 2020

Blazingtext preprocessing

What does this mean?

Related content: