Comprehend custom preprocessing

From Comprehend Custom Classifier, it supports two modes: multi-class and multi-label

In multi-class classification, each document can have one and only one class assigned to it. The individual classes are mutually exclusive.

In multi-label classification, individual classes represent different categories, but these categories are somehow related and are not mutually exclusive.

Comprehend custom classifier expects training data with exactly two columns in each row. Column one is one of the possible labels, Column two is the content of the document itself.

What does it mean?

Let's say you have a CSV file with 2 columns,

Multi-Class Mode:

Sample label 1,Text of document 1
Sample label2,Text of document 2
samplelabel3,Text of document 3

A “document” here can be a sentence, a paragraph or several paragraphs.

Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them or join them into a single column.

For training dataset, the file format must conform to the following requirements:

File has exactly two columns in each row. Column one is one of the possible labels, Column two is the content of the document itself. No header Format UTF-8, carriage return ā€œ\nā€.

Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped.

Read the file using pandas like this:

import pandas as pd

data = pd.read_csv('file.csv', names = {'label','text'})

Modify the label column and write out a preprocessed file:

data.label = data.label.upper()
data.label = data.label.replace (" ", "_")

# Don't include headers or indices
data.to_csv('out.csv',header=False,index=False,escapechar='\\',doublequote=False,quotechar='"')

Your CSV file that is ready for comprehend custom classifier training will now look like…

SAMPLE_LABEL_1,Text of document 1
SAMPLE_LABEL2,Text of document 2
SAMPLELABEL3,Text of document 3

Multi-Label Mode:

Sample label 1|Sample label2,Text of document 1
Sample label2,Text of document 2
Sample label 1|Sample label2|samplelabel3,Text of document 3

A “document” here can be a sentence, a paragraph or several paragraphs.

Our recommendation is that you provide nothing more than a paragraph. If you have anything more than these two columns, drop them or join them into a single column.

For training dataset, the file format must conform to the following requirements:

File has exactly two columns in each row. Column one is one, many, or all of the possible labels, each separated by a delimiter chosen from the available options. The default delimiter is bar (|). Column two is the content of the document itself. No header Format UTF-8, carriage return ā€œ\nā€.

Labels must be uppercase, can be multi-token, have white space, consist of multiple words connected by underscores or hyphens, or may even contain a comma, as long as it is correctly escaped.

Read the file using pandas like this:

import pandas as pd

data = pd.read_csv('file.csv', names = {'label','text'})

Modify the label column and write out a preprocessed file:

data.label = data.label.upper()
data.label = data.label.replace (" ", "_")

# Don't include headers or indices
data.to_csv('out.csv',header=False,index=False,escapechar='\\',doublequote=False,quotechar='"')

Your CSV file that is ready for comprehend custom classifier training will now look like…

SAMPLE_LABEL_1|SAMPLE_LABEL2,Text of document 1
SAMPLE_LABEL2,Text of document 2
SAMPLE_LABEL_1|SAMPLE_LABEL2|SAMPLELABEL3,Text of document 3

Note : Repeat this for all files that is part of your dataset