Xgboost preprocessing


Make sure you have a CSV file with the column you want to predict along with other data columns or features, call it ‘file.csv’

Make some necessary imports

from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
import numpy as np

Read data using pandas

data = pd.read_csv('file.csv')

Choose the column you want to predict

predictor_column_name = 'column-name'

Clean up the data

# split data into X (features) and y (column you want to predict)
X = data.drop(predictor_column_name,axis=1)
Y = data[predictor_column_name]

# convert text labels (Y) into numbers
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

# Convert features (X) to numbers
features = pd.get_dummies(X).values
alldata = np.vstack((label_encoded_y.T, features.T)).T

# Replace anything that is not a number with zero and infinity with large finite numbers
alldata = np.nan_to_num(alldata)

Split and write the data

train_data, validation_data, test_data = np.split(alldata, [int(0.7 * len(alldata)), int(0.9 * len(alldata))])

np.savetxt('train.csv', train_data, delimiter=',')
np.savetxt('validation.csv', validation_data, delimiter=',')
np.savetxt('test.csv', test_data, delimiter=',')

Upload these three files to three folders / prefixes on S3 using this link

Related content: