BlazingText training

Make sure you saw this link for preprocessing first.

At the end of the preprocessing for BlazingText page, you converted your CSV file into a format that BlazingText accepts, and uploaded the file to s3://bucketname/train/out.csv.

On a SageMaker notebook, initialize the estimator:

import sagemaker

session = sagemaker.Session()

region = session.boto_region_name

estimator = sagemaker.estimator.Estimator(
    sagemaker_session=session,
    image_name=sagemaker.amazon.amazon_estimator.get_image_uri(region, "blazingtext", "latest"),
    role=sagemaker.get_execution_role(),
    train_instance_count=1,
    train_instance_type='ml.c4.2xlarge',
    base_job_name='blazingtext-poc',
    train_volume_size = 30,
    train_max_run = 360000,
    input_mode= 'File',
    output_path='s3://bucket-name/path/to/output')

Assume you have timestamps that are 1 hour apart, and you want to use 10 values in the past to predict 1 value in the future; set hyperparameters as follows:

estimator.set_hyperparameters(mode="supervised",
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=True,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=2)

Learn more about hyperparameters here

Next, train your BlazingText model using the Sagemaker Python SDK:

Here, we assume that you have a folder with (one or more) train files and test files. Make sure you saw this link to help you split the input files.

data_channels = {
    "train": "s3://bucketname/train/",
    "validation": "s3://bucketname/test/"
}

estimator.fit(inputs=data_channels, wait=True, logs=True)

When adding the path to the file for input data, go up to the folder and not the actual .csv file. This is set up so that a train folder for example, may contain multiple .csv files.