supervised text classification
Text classification
Text classification models are used to categorize text into organized groups. Text is analyzed by a model and then the appropriate tags are applied based on the content. Machine learning models that can automatically apply tags for classification are known as classifiers.
Classifiers can't just work automatically, they need to be trained to be able to make specific predictions for texts. Training a classifier is done by:
1)defining a set of tags that the model will work with
2)making associations between pieces of text and the corresponding tag or tags
Once enough texts have been tagged, the classifier can learn from those associations and begin to make predictions with new texts.
Supervised Learning for Text Classification.
PART-I : Training
- During training, a feature extractor is used to transform each input value to a feature set.
- These feature sets, which capture the basic information about each input that should be used to categorize it.
- Pairs of feature sets and labels are fed into the machine learning algorithm to produce a model.
PART-II : Prediction
- During prediction, the same feature extractor is used to transform unobserved inputs to feature sets. These feature sets are then fed into the model, which produces predicted labels.
load dataset:
To prepare the dataset, load the downloaded data into a pandas dataframe containing two columns – text and label.
load_data = dataset_load();
trainDF = load_data.load_cvs_dataset("../corpus.csv")
txt_label = trainDF['label']
txt_text = trainDF['text']
this code segment found in trainmodel_write.py
def load_cvs_dataset(self,dataset_path):
#Set Random seed
np.random.seed(500)
# Add the Data using pandas
Corpus = pd.read_csv(dataset_path,encoding='latin-1',error_bad_lines=False)
return Corpus
this code segment found in dataset_load.py
Text Feature Engineering:
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement the following different ideas in order to obtain relevant features from our dataset.
Count Vectors as features
TF-IDF Vectors as features
Word level
N-Gram level
Character level
Lets look at the implementation of these ideas in detail.
Count Vectors as features:
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(txt_text,txt_label)
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(txt_text)
# transform the training and validation data using count vectorizer object
Train_X_count = count_vect.transform(Train_X)
Test_X_count = count_vect.transform(Test_X)
this code segment found in count_vectorizer.py
TF-IDF Vectors as features:
TF-IDF score represents the relative importance of a term in the document and the entire corpus. TF-IDF score is composed by two terms: the first computes the normalized Term Frequency (TF), the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
TF-IDF Vectors can be generated at different levels of input tokens (words, characters, n-grams)
a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different documents
b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams
c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams in the corpus
Word Level TF-IDF :
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(txt_text, txt_label)
encoder = preprocessing.LabelEncoder()
Train_Y = encoder.fit_transform(Train_Y)
Test_Y = encoder.fit_transform(Test_Y)
tfidf_vect = TfidfVectorizer(analyzer='word',max_features=5000)
tfidf_vect.fit(txt_text)
Train_X_Tfidf = tfidf_vect.transform(Train_X)
Test_X_Tfidf = tfidf_vect.transform(Test_X)
This code segment fond in word_tf_idf.py
N-gram Level TF-IDF:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(txt_text, txt_label)
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', ngram_range=(2, 3), max_features=5000)
tfidf_vect_ngram.fit(txt_text)
Train_X_ngram = tfidf_vect_ngram.transform(Train_X)
Test_X_ngram = tfidf_vect_ngram.transform(Test_X)
This code segment fond in ngram_tf_idf.py
Character Level TF-IDF:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(txt_text, txt_label)
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(txt_text)
Train_X_ngram_chars = tfidf_vect_ngram_chars.transform(Train_X)
Test_X_ngram_chars = tfidf_vect_ngram_chars.transform(Test_X)
This code segment fond in char_tf_idf.py
Model Training & evaluate the performance of model:
The final step in the text classification framework is to train a classifier using the features created in the previous step. There are many different choices of machine learning models which can be used to train a final model. We will implement following different classifiers for this purpose:
Naive Bayes Classifier
Linear Classifier
Support Vector Machine
Bagging Models
Lets implement these models and understand their details. The following function is a utility function which can be used to train a model. It accepts the classifier, feature_vector of training data, labels of training data and feature vectors of valid data as inputs. Using these inputs, the model is trained and accuracy score is computed.
def train_model(self,classifier, train_input,test_input, train_target, test_target, is_neural_net=False):
classifier.fit(train_input, train_target)
predictions = classifier.predict(test_input)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
return accuracy_score(predictions, test_target)
Naive Bayes:
Implementing a naive bayes model using sklearn implementation with different features
Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature
model_input = char_tf_idf().convert_feature(txt_text, txt_label)
naive = naive_bayes.MultinomialNB()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(),
model_input.get_train_target(),
model_input.get_test_target())
print ("NB, char_tf_idf accuracy is : ", accuracy * 100)
model_input = count_vectorizer().convert_feature(txt_text, txt_label)
naive = naive_bayes.MultinomialNB()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("NB, count_vectorizer accuracy is : ", accuracy * 100)
model_input = ngram_tf_idf().convert_feature(txt_text, txt_label)
naive = naive_bayes.MultinomialNB()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("NB, ngram_tf_idf accuracy is : ", accuracy * 100)
model_input = word_tf_idf().convert_feature(txt_text, txt_label)
naive = naive_bayes.MultinomialNB()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("NB, word_tf_idf accuracy is : ", accuracy * 100)
this code segment found in Naive_Bay_Clf.py
NB, char_tf_idf accuracy is : 81.28
NB, count_vectorizer accuracy is : 82.96
NB, ngram_tf_idf accuracy is : 81.92
NB, word_tf_idf accuracy is : 85.96000000000001
Linear Classifier:
Implementing a Linear Classifier (Logistic Regression)
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic/sigmoid function.
#Build Text Classification Model and Evaluating the Model
naive = linear_model.LogisticRegression()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("Linear_Clf, count_vectorizer accuracy is : ", accuracy * 100)
# Text feature engineering
model_input = ngram_tf_idf().convert_feature(txt_text, txt_label)
# Text feature engineering
# Build Text Classification Model and Evaluating the Model
naive = linear_model.LogisticRegression()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("Linear_Clf, ngram_tf_idf accuracy is : ", accuracy * 100)
# Text feature engineering
model_input = word_tf_idf().convert_feature(txt_text, txt_label)
# Text feature engineering
Build Text Classification Model and Evaluating the Model
naive = linear_model.LogisticRegression()
accuracy = Classifier().train_model(naive, model_input.get_train_input(), model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("Linear_Clf, word_tf_idf accuracy is : ", accuracy * 100)
Linear_Clf, char_tf_idf accuracy is : 84.36
Linear_Clf, count_vectorizer accuracy is : 85.92
Linear_Clf, ngram_tf_idf accuracy is : 82.64
Linear_Clf, word_tf_idf accuracy is : 87.4
this code segment found in Linear_Clf.py
SVM Model:
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. The model extracts a best possible hyper-plane / line that segregates the two classes.
#Build Text Classification Model and Evaluating the Model
naive=svm.SVC()
accuracy = Classifier().train_model(naive,model_input.get_train_input(),model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("Svm_clf, ngram_tf_idf accuracy is : ", accuracy*100)
Svm_clf, ngram_tf_idf accuracy is : 51.76
this code segment found in Svm_clf.py
Random Forest Model:
Random Forest models are a type of ensemble models, particularly bagging models. They are part of the tree based model family.
#Build Text Classification Model and Evaluating the Model
naive=ensemble.RandomForestClassifier()
accuracy = Classifier().train_model(naive,model_input.get_train_input(),model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("RandomForest_Clf, count_vectorizer accuracy is : ", accuracy*100)
#Text feature engineering
model_input=word_tf_idf().convert_feature(clear_txt,txt_label)
#Text feature engineering
#Build Text Classification Model and Evaluating the Model
naive=ensemble.RandomForestClassifier()
accuracy = Classifier().train_model(naive,model_input.get_train_input(),model_input.get_test_input(), model_input.get_train_target(), model_input.get_test_target())
print ("RandomForest_Clf, word_tf_idf accuracy is : ", accuracy*100)
RandomForest_Clf, count_vectorizer accuracy is : 77.84
RandomForest_Clf, word_tf_idf accuracy is : 78.52
this code segment found in RandomForest_Clf.py