Comprehend custom training

Make sure you saw this link for preprocessing first

Custom classification is a two step process:

Identify labels and create and train a custom classifier to recognize those labels.
Once amazon Comprehend trains the classifier, send unlabeled documents to be classified using that classifier.

Training a Custom Classifier

Using the AWS SDK for Python:

Instantiate Boto3 SDK:

import boto3
client = boto3.client('comprehend', region_name='region')

To create a Classifier:

create_response = client.create_document_classifier( InputDataConfig={ 'S3Uri': 's3://S3Bucket/docclass/file name' }, DataAccessRoleArn='arn:aws:iam::account number:role/resource name', DocumentClassifierName='SampleCodeClassifier1', LanguageCode='en')

To run a custom classifier job:

start_response = client.start_document_classification_job( InputDataConfig={ 'S3Uri': 's3://S3Bucket/docclass/file name', 'InputFormat': 'ONE_DOC_PER_LINE' }, OutputDataConfig={ 'S3Uri': 's3://S3Bucket/output' }, DataAccessRoleArn='arn:aws:iam::account number:role/resource name',
DocumentClassifierArn= 'arn:aws:comprehend:region:account number:document-classifier/SampleCodeClassifier1')

Some other notes

To train a custom classifier, identify the classes you want to use for classification. For each class, for more accurate training, we recommend at least 50 documents or more for each class.
When training your classifier, the data must be in a single .csv file. The format of the data will depend onwhich classifier mode you choose.
After you train the custom classifier, you can analyze documents in either asynchronous (batch) or synchronous operations (real time)
Multi-class mode supports up to 1 million examples containing up to 1000 unique classes.
Multi-label mode supports up to 1 million examples containing up to 100 unique classes.

Updated on 02 Mar 2020

Comprehend custom training

Training a Custom Classifier

Using the AWS SDK for Python:

Some other notes

Related content: