Codementor Events

supervised text classification with mutinomial native bayes algorithm

Published Sep 07, 2019

Text classification models are used to categorize text into organized groups. Text is analyzed by a model and then the appropriate tags are applied based on the content. Machine learning models that can automatically apply tags for classification are known as classifiers.
Classifiers can't just work automatically, they need to be trained to be able to make specific predictions for texts. Training a classifier is done by:

  1. defining a set of tags that the model will work with
  2. making associations between pieces of text and the corresponding tag or tags
    Once enough texts have been tagged, the classifier can learn from those associations and begin to make predictions with new texts.

supervised learning in text classification
As we know that data is mostly either structured or unstructured i.e data can have labels or might not have labels and could be only a series of texts. When we have data with labels, we can perform supervised learning on it. Supervised text classification basically means that you have a set of examples where we know the correct answers. So for the machine to learn as we do, we should provide a set of text and its labels as an input. For example, if we have a set of fruits and vegetable data like [‘apple’,’banana’,’mango’] belongs to fruit category and [‘potato’, ‘cabbage’, ’radish’] belongs to a vegetable category. Now that we have the data with its label, we can train the machine with these data. The machine then tries to learn it like we used to do when we were schools in order to predict the category of new data which can be either fruit or vegetable. Eg if I pass a new data as “watermelon”, the machine should return its label as “fruit”. Hope this gave a slight understanding of text classification.

supervised learning step in text classification

PART-I : Training

  1. During training, a feature extractor is used to transform each input value to a feature set.
  2. These feature sets, which capture the basic information about each input that should be used to categorize it.
  3. Pairs of feature sets and labels are fed into the machine learning algorithm to produce a model.

PART-II : Prediction

  1. During prediction, the same feature extractor is used to transform unobserved inputs to feature sets. These feature sets are then fed into the model, which produces predicted labels

multinomial native bayes algorithm

The general idea of Naive Bayes:

  1. Represent a document X as a set of (w, a frequency of w) pairs.
  2. For each label y, build a probabilistic model P(X| Y = y) of documents in class y.
  3. To classify, select label y which is most likely to generate X:
    finalmaxscore.png

Assumptions:

  1. The order of the words in document X makes no difference but repetitions of words do.
  2. Words appear independently of each other, given the document class.
    Based on these assumptions, we have the following equations to estimate P(X|y):

likehood.png

There are problems with these equations:
For equation (1), if our document has more than 100 words, P(w₁,…, w_n|Y = y) will be a product of very small word probabilities ( < 0.1), leading to the UNDERFLOW problem => Working with logarithms is desirable to maintain numerical stability.

multinomial_loglikehood.png

For equation (2), if we have a new word w in the new text that we need to classify, P(W = w | Y = y) = 0 as w has never appeared in our training data. => one solution is to smooth the probabilities. Assume we have mexamples with P(w|y) = p. This use of m and p is a Dirichlet prior for the multinomial distribution. Please note that there are many smoothing methods.

multinomial_likehood.png

Putting all these together, we have the following algorithm:
final_algorithm.png

Now, let’s work on a hypothetical example to understand the algorithm:
Suppose we have 3 documents:
X₁ = “The government shutdown” with label y₁ = news
X₂ = “Federal employees are protesting shutdown” with label y₂ = news
X₃ = “Turn melancholy forth to funerals” with label y₃ = poetry
and a new document to classify:
X_new = “The shutdown affects federal employees benefit”
We then can have this count table from our training data
word_table.png

For the purpose of simplicity, I did not exclude stopwords, but in practice, you definitely should. Besides, to prevent the underflow problem, I define the smoothing parameters: p = 0.5 and m = 1. We can then calculate the score of each label for the new document X_new as following.

score_example.png

We can see that the score for label “news” is higher than the score for label “poetry”, so we will classify X_new as “news”.

multinomial native bayes algorithm example with sapnil_machinelearning machine learning library:
To prepare the dataset, load the downloaded data into a pandas dataframe containing two columns – text and label.

trainDF = load_cvs_dataset("../setup.csv")
txt_label = trainDF['category']
txt_text = trainDF['text']

this code segment  found in multinomial_example.py

def load_cvs_dataset(dataset_path):
    # Set Random seed
    np.random.seed(500)
    # Add the Data using pandas
    Corpus = pd.read_csv(dataset_path, encoding='latin-1', error_bad_lines=False)

    return Corpus

this code segment found in dataset_load.py

Text Feature Engineering:
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement Count Vectors as features in order to obtain relevant features from our dataset.
Count Vectors as features:
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.

trainSet, testcopy, labelset, testlabelcopy=splitDataset(txt_text, txt_label,0.2)
model_input=count_word_fit(trainSet,labelset)

this code segment  found in multinomial_example.py

clean the text from the each text document before the feature frequency matrix generation


    tag_map = defaultdict(lambda : wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    
   
    
    #Using Python's stop-words package to get the stop words in English
    
    stop_words = get_stop_words('english')
    for doc in doc_list:
        
        result_doc = re.sub(r'\d+', '', doc)
        
        words = word_tokenize(result_doc)
        
        
        low_tokens = [w.lower() for w in words]
      
        #REMOVE punctuation mark
        table = str.maketrans('', '', string.punctuation)
        pun_words = [w.translate(table) for w in low_tokens]
        emp_str_list = list(filter(None, pun_words)) 
        #REMOVE punctuation mark
        
        # Lemmatize list of words and join
        # Init the Wordnet Lemmatizer
        # Declaring Empty List to store the words that follow the rules for this step
        
        Final_words = []
        # Initializing WordNetLemmatizer()
        word_Lemmatized = WordNetLemmatizer()
        # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
        for word, tag in pos_tag(emp_str_list):
        # Below condition is to check for Stop words and consider only alphabets
            if word not in stopwords.words('english') and word.isalpha():
                word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
                Final_words.append(word_Final)
        
        #remove stop words
        stop_words = set(stopwords.words('english'))
        rvm_stop_words = [w for w in Final_words if not w in stop_words]
        #remove stop words

this code segment  found in  count_word_fit.py

after the cleaning text on each document generate the frequency matrix of feature on each document .

total_class_token={}
  
  #print(vocabulary)
  class_eachtoken_count={} 
  
  for class_label in class_labels: 
      total_class_token[class_label]=0
      class_eachtoken_count[class_label]={}
      for voc in vocabulary:
          class_eachtoken_count[class_label] [voc] = 0
         
  
  doccount=0
  total_voca_count=0
  for doc in doc_list:
      words = doc.split(" ");
     
      class_label=temp_class_labels[doccount]
    
      for word in words:
          if word in vocabulary:
              class_eachtoken_count[class_label]		  [word]=class_eachtoken_count[class_label][word]+1 
              total_class_token[class_label]=total_class_token[class_label]+1
              total_voca_count=total_voca_count+1
     
      
      doccount=doccount+1

this code segment  found in  count_word_fit.py

multinomial native bayes algorithm

following code segment is the implementation of multinomial native bayes algorithm

  score_Class = 0
     max_score = 0
     final_class_label = ''
    
     vocabularyCount = model_data.get_vocabularyCount()
     total_class_token = model_data.get_total_class_token()
     for class_label in model_data.get_class_labels():
         score_Class = math.log(total_class_token[class_label] / vocabularyCount) + condProbabilityOfTermClass[class_label]
         # print('the score_Class ',score_Class)
         if(max_score > score_Class):
             max_score = score_Class
             final_class_label = class_label
             # print('the class label '+final_class_label)
     final_doc_class_label['doc' + '-' + str(doccount)] = final_class_label
     # print(final_doc_class_label['doc'+'-'+str(doccount)])
     doccount = doccount + 1 

this code segment  found in  multinomial_nativebayes.py

algorithm performance evalution:

def accuracy_score(testlabelcopy, final_doc_class_label):
    label_count = 0
    wrong_count = 0
    for label in testlabelcopy:
        if label != final_doc_class_label['doc' + '-' + str(label_count)] :
            wrong_count = wrong_count + 1
    label_count = label_count + 1
    
    accuracy = ((len(testlabelcopy) - wrong_count) / len(testlabelcopy)) * 100
    
    return accuracy

this code segment  found in  multinomial_nativebayes.py

limitation of multinomial native bayes algorithm

  1. if each document in test dataset contain very few of the known words in the vocabulary set it decrease the accuracy level of the native bayes algorithm.because posterior value of input feature set for each class are same which return wroung class/category for each document in test dataset.As a result accuracy level multinomial native bayes algorithm will decrease.
  2. the basic BOW approach does not consider the meaning of the word in the document. It completely ignores the context in which it’s used. The same word can be used in multiple places based on the context or nearby words.
  3. For a large document, the vector size can be huge resulting in a lot of computation and time. You may need to ignore words based on relevance to your use case.
Discover and read more posts from Moahammad Nasir Uddin
get started