Stokastik

Machine Learning, AI and Programming

Learning From Unlabelled Data - EM Approach

Accurately labelled data can be a bottleneck in many machine learning problems as they are difficult and expensive to obtain, and even if we obtain some labelled data, the labels might not be 100% accurate. Many startups working in machine learning space resort to crowdsourcing of the labelling task.

Inspired by this research paper, I am going to try and use lots of unlabelled data in addition to small amounts of labelled data in order to improve on the prediction accuracy of a classifier. Results based on the referred article and from my own experiments has shown that, the performance of a classification model built using labelled+unlabelled data is significantly better than that using labelled data alone only when the amount of labelled data is significantly less than the amount of unlabelled data.

The algorithm behind learning from unlabelled data is pretty straightforward to understand. It is an iterative algorithm based on the principle of Expectation Maximization. As in EM where we have incomplete data, the problem of learning from unlabelled data is also assumed to be an incomplete data problem where unlabelled data is assumed to have missing data.

  1. Train a classifier only on the labelled training data. Estimate the model parameters \theta^{(0)} based only on the labelled data. This is the initialization step.
  2. E Step : Based on the current estimates of the model parameters \theta^{(k)}, compute the expected class labels and their probabilities for the unlabelled documents.
  3. M Step : Re-compute the model parameters \theta^{(k+1)} based on the both the labelled documents as well as the expected class labels (probabilities) for unlabelled documents computed in the E step.
  4. Repeat the E and M steps until convergence.

The chosen classifier is assumed to be of the generative type, since we are assuming that the model parameters estimated with the labelled data alone also holds good when computing the expected class labels for the unlabelled data, i.e. the unlabelled data is also generated from the same model parameters. Thus we choose Naive Bayes as our base classifier.

Working with Naive Bayes and the EM algorithm deals with the following assumptions :

  • The EM algorithm assumes the model to be a mixture-model, i.e. the data we are dealing with has been generated from a mixture of Gaussian models, which in practice may not be true.
  • In Naive Bayes, the word events are assumed to be independent of each other, i.e. the model is assumed to independently generate the words in a document and the position of a word does not matter in the document. This also may not be true.
  • In our implementation, it is assumed that each class has been generated from a single mixture component, i.e. one-to-one correspondence between a mixture component and a class label. This also may not be true, a class can have several sub-topics which might be from different mixture components. For example, "Sports" class can have both Cricket and Football related documents.

The model parameters for the Naive Bayes classifier are the prior probabilities, the probabilities of the class labels c_k, P(c_k) and the conditional probabilities of the document features w_j given the class labels, P(w_j|c_k).

To begin with, we define a bunch of typedefs in C++ for different kinds of maps and nested maps used throughout the code.

typedef std::unordered_map<int, std::unordered_map<int, double>> Int_Int_Double;
typedef std::unordered_map<int, double> Int_Double;
typedef std::unordered_map<int, int> Int_Int;

The below C++ function is used for computing the table of class label counts for the documents. It uses the following formula for computing the prior class counts.

T(c_k)=\sum_{i=1}^{|D|}{\Lambda}(i)P(y_i=c_k|d_i)

where |D| is the total number of documents (labelled + unlabelled)

{\Lambda}(i)=1 for a labelled document but {\Lambda}(i)=\lambda for some constant \lambda between 0 and 1 for unlabelled documents.

P(y_i=c_k|d_i) is the probability of the i-th document being in class c_k.

Note that, the raw counts T(c_k) are not normalized to probabilities at this stage. The probabilities are computed only at the time of computing the posterior class probabilities.

The function accepts a joint map ("rowClassProbs") of each document and each class to the probability of that document belonging to that class. For example if the 10th document has a probability of 0.87 being in class 0, then "rowClassProbs[10][0]=0.87". It also accepts a map from each document to its actual class label. For the labelled document 10, "rowClassMap[10]" is equal to the class label assigned to labelled document number 10. In our example, we assign a class label of -1 to all unlabelled documents, thus for the unlabelled document number 100, rowClassMap[100] would be -1.

"rowClassProbs" corresponds to the term P(y_i=c_k|d_i) in the above equation and "rowClassMap" is used for deciding the value of {\Lambda}(i) in the above equation.

As defined above "lambda" is weighting factor for the unlabelled documents.

Int_Double computeClassCounts(Int_Int_Double &rowClassProbs, Int_Int &rowClassMap, 
                              const double &lambda) {
  
  Int_Double classPriorCounts;
  
  for (auto p = rowClassProbs.begin(); p != rowClassProbs.end(); ++p) {
    Int_Double classProbs = p->second;
    
    for (auto q = classProbs.begin(); q != classProbs.end(); ++q) {
      if (rowClassMap[p->first] == -1) classPriorCounts[q->first] += lambda*(q->second);
      else classPriorCounts[q->first] += (q->second);
    }
  }
  
  return classPriorCounts;
}

The next function is used for computing the feature probabilities given the class labels for all the documents. The function uses the below equation to determine the counts :

T(w_j|c_k)=\sum_{i=1}^{|D|}{\Lambda}(i)N(w_j, d_i)P(y_i=c_k|d_i).

As above, the parameter "rowClassProbs" corresponds to the probabilities P(y_i=c_k|d_i) and "rowClassMap" is used for deciding the value of {\Lambda}(i) in the above equation.

N(w_j, d_i) is the weight (TF, TF-IDF etc) of the j-th feature in the i-th document corresponding to the [i,j]-th entry in the Document Term Matrix.

The parameter "rowColValMap" corresponds to N(w_j, d_i) in the above equation.

Int_Int_Double computePriorClassFeatureCounts(Int_Int_Double &rowClassProbs, Int_Int_Double &rowColValMap,
                                              Int_Int &rowClassMap, const double &lambda) {
  
  Int_Int_Double priorCounts;
  
  for (auto p = rowColValMap.begin(); p != rowColValMap.end(); ++p) {
    int row = p->first;
    
    Int_Double colValMap = p->second;
    Int_Double classProbs = rowClassProbs[row];
    
    for (auto q = classProbs.begin(); q != classProbs.end(); ++q) {
      for (auto r = colValMap.begin(); r != colValMap.end(); ++r) {
        if (rowClassMap[row] == -1) priorCounts[q->first][r->first] += lambda*(r->second)*(q->second);
        else priorCounts[q->first][r->first] += (r->second)*(q->second);
      }
    }
  }
  
  return priorCounts;
}

Next, we define the function for computing the posterior class probabilities P(y_i=c_k|d_i). It uses the following equations to compute the probabilities :

T(y_i=c_k|d_i)=P(c_k)P(d_i|c_k)

P(y_i=c_k|d_i)=\frac{T(y_i=c_k|d_i)}{\sum_{k=1}^{|C|}T(y_i=c_k|d_i)}

where P(c_k) are the normalized probabilities from the "computeClassProbs" function above, it is computed as :

P(c_k)=\frac{T(c_k)}{\sum_{k=1}^{|C|}T(c_k)}

From the word independence assumption of Naive Bayes, we have the following formula :

P(d_i|c_k)=\prod_{j=1}^{|W_{d_i}|}P(w_j|c_k)

where |W_{d_i}| is the number of text features in the document d_i. The product is over all the text features found inside document d_i. The values of P(w_j|c_k) are computed by normalizing the counts T(w_j|c_k) over all the features.

P(w_j|c_k)=\frac{T(w_j|c_k)}{\sum_{j=1}^{|W|}T(w_j|c_k)}

Int_Int_Double computePosteriorProbabilities(Int_Int_Double &priorClassFeatureCounts, Int_Double &priorClassCounts, 
                                             Int_Int_Double &rowColValMap, const std::set<int> &classLabels,
                                             const double &numDocs, const double &numFeatures) {
  
  Int_Int_Double posteriorProbs;
  
  Int_Double classFeatureProbsSum;
  
  int numClasses = (int)classLabels.size();
  
  for (auto p = priorClassFeatureCounts.begin(); p != priorClassFeatureCounts.end(); ++p) {
    Int_Double x = p->second;
    for (auto q = x.begin(); q != x.end(); ++q) classFeatureProbsSum[p->first] += q->second;
  }
  
  for (auto p = rowColValMap.begin(); p != rowColValMap.end(); ++p) {
    Int_Double colValMap = p->second;
    
    for (auto q = classLabels.begin(); q != classLabels.end(); ++q) {
      
      double classProb = (1+priorClassCounts[*q])/((double)numClasses + (double)numDocs);
      posteriorProbs[p->first][*q] += log(classProb);
      
      double constant = (double)classFeatureProbsSum[*q] + (double)numFeatures;
      
      double featureProb = 0;
      
      for (auto r = colValMap.begin(); r != colValMap.end(); ++r) {
        if (priorClassFeatureCounts[*q].find(r->first) != priorClassFeatureCounts[*q].end()) 
          featureProb += log(1+priorClassFeatureCounts[*q][r->first])-log(constant);
        
        else featureProb -= log(constant);
      }
      
      posteriorProbs[p->first][*q] += featureProb;
    }
    
    Int_Double x = posteriorProbs[p->first];
    
    double maxVal = -std::numeric_limits<double>::max();
    
    for (auto q = x.begin(); q != x.end(); ++q) if (q->second > maxVal) maxVal = q->second;
    
    for (auto q = x.begin(); q != x.end(); ++q) posteriorProbs[p->first][q->first] = exp((q->second)-maxVal);
  }
  
  for (auto p = posteriorProbs.begin(); p != posteriorProbs.end(); ++p) {
    Int_Double x = p->second;

    double sum = 0;
    for (auto q = x.begin(); q != x.end(); ++q) sum += q->second;

    for (auto q = x.begin(); q != x.end(); ++q) posteriorProbs[p->first][q->first] = (q->second)/sum;
  }
  
  return posteriorProbs;
}

Note that instead of computing the product of the probabilities, I am computing their log sum inside the for loop. When there are many features in a document, the product of all the class feature probabilities can go close to zero very fast. The machine would round off any significantly small value to 0. To avoid this rounding off error, I am taking their log sum values. For example, the machine would round off e^{-1000} to 0, but the logarithm would be -1000.

After the log sum computation, we are computing the final probability by taking the exponential of the log sum values. Note that in this case also, if the log sum value is too less (e.g. goes below -500, in my machine), then the exponential term will again go to zero. We use the observation that we are computing the normalized probabilities. For example, if the log sum values for class 0 and class 1 are 'a' and 'b' respectively, then the probabilities would be given as :

P(0)=\frac{e^a}{e^a+e^b} and P(1)=\frac{e^b}{e^a+e^b}

Note that if we subtract a constant 'c' from both 'a' and 'b', we still get the same probabilities.

P(0)=\frac{e^{a-c}}{e^{a-c}+e^{b-c}}=\frac{e^a.e^{-c}}{e^{-c}.(e^a+e^b)}=\frac{e^a}{e^a+e^b}

Thus we can choose the const 'c' to be anything. But usually it is better to choose 'c' to be the mean of 'a' and 'b'. For example if 'a'=-700 and 'b'=-702, then 'c'=-701. Instead of computing e^{-700} and e^{-702}, we only need to compute e^{1} and e^{-1}, which will not go to zero.

Next we define the main function for doing either normal Naive Bayes model parameters estimation or the estimation with semi-supervised learning (i.e. learning from both labelled as well as unlabelled data).

Unlike in the paper, where the EM iterations are done until the likelihood of all the data has been maximized and converged to within a certain threshold, I am using a maximum number of iterations parameter as it prevents me from computing the likelihood again and again in each iteration (although the overall complexity would remain the same).

The algorithm proceeds as follows :

Step 1 : Initialization, compute P^{(0)}(c_k) and P^{(0)}(w_k|c_k) as defined above, but only with the labelled data to begin with.

Step 2 : E Step, in the k-th iteration, compute P^{(k)}(y_i=c_k|d^u_i) but only for the unlabelled documents.

Step 3 : M Step, in the k-th iteration, re-compute P^{(k+1)}(c_k) and P^{(k+1)}(w_k|c_k) for all the documents (labelled + unlabelled).

Repeat steps 2 and 3 for "maxIter" number of iterations.

The semi-supervised code gets activated only when there are unlabelled documents present in the input.

The output model from this function, is returned as a List of two DataFrames to R function call via Rcpp library. One data frame for the class counts and one data frame for the class-feature counts.

// [[Rcpp::export]]
List cpp__nb(DataFrame inputSparseMatrix, std::vector<int> classLabels, 
             int numDocs, int numFeatures, double lambda=0.5, int maxIter=5) {
  
  std::vector<int> rows = inputSparseMatrix["i"];
  std::vector<int> cols = inputSparseMatrix["j"];
  std::vector<double> vals = inputSparseMatrix["v"];
  
  std::set<int> uniqueRows(rows.begin(), rows.end());
  std::set<int> uniqueLabels(classLabels.begin(), classLabels.end());
  
  if (uniqueLabels.find(-1) != uniqueLabels.end()) uniqueLabels.erase(uniqueLabels.find(-1));
  
  Int_Int_Double rowColValMap, unlabelledDocsRowColValMap, labelledDocsRowColValMap;
  for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i];

  Int_Int rowClassMap;
  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) rowClassMap[*p] = classLabels[*p-1];
  
  for (auto p = rowColValMap.begin(); p != rowColValMap.end(); ++p) {
    if (rowClassMap[p->first] == -1) unlabelledDocsRowColValMap[p->first] = p->second;
    else labelledDocsRowColValMap[p->first] = p->second;
  }
  

  Int_Int_Double rowClassProbs;

  for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) if (rowClassMap[*p] != -1) rowClassProbs[*p][rowClassMap[*p]] = 1;

  Int_Double priorClassCounts = computeClassCounts(rowClassProbs, rowClassMap, lambda);
  Int_Int_Double priorClassFeatureCounts = computePriorClassFeatureCounts(rowClassProbs, rowColValMap, rowClassMap, lambda);

  if (unlabelledDocsRowColValMap.size() > 0) {
    
    double counter = 0;
    
    while(counter < maxIter) {
      
      Int_Int_Double unlabelledDocsRowClassProbs = computePosteriorProbabilities(priorClassFeatureCounts, priorClassCounts, 
                                                                                 unlabelledDocsRowColValMap, uniqueLabels,
                                                                                 (double)numDocs, (double)numFeatures);
      
      for (auto p = uniqueRows.begin(); p != uniqueRows.end(); ++p) if (rowClassMap[*p] == -1) rowClassProbs[*p] = unlabelledDocsRowClassProbs[*p];
      
      priorClassCounts = computeClassCounts(rowClassProbs, rowClassMap, lambda);
      priorClassFeatureCounts = computePriorClassFeatureCounts(rowClassProbs, rowColValMap, rowClassMap, lambda);
      
      counter++;
    }
  }
  
  
  std::vector<int> clLabels1, clLabels2, featureLabels2;
  std::vector<double> clCounts1, clFeatureCounts2;
  
  for (auto p = priorClassCounts.begin(); p != priorClassCounts.end(); ++p) {
    clLabels1.push_back(p->first);
    clCounts1.push_back(p->second);
  }
  
  for (auto p = priorClassFeatureCounts.begin(); p != priorClassFeatureCounts.end(); ++p) {
    Int_Double x = p->second;
    for (auto q = x.begin(); q != x.end(); ++q) {
      clLabels2.push_back(p->first);
      featureLabels2.push_back(q->first);
      clFeatureCounts2.push_back(q->second);
    }
  }
  
  return List::create(_["ClassProbs"]=DataFrame::create(_["Class"]=clLabels1, _["Count"]=clCounts1),
                      _["ClassFeatureProbs"]=DataFrame::create(_["Class"]=clLabels2, 
                                              _["Feature"]=featureLabels2, 
                                              _["Count"]=clFeatureCounts2));
}

Next we define the function for validating our model on unseen test documents. This part is pretty simple as it involves converting the input model of List of DataFrames into appropriate map data structures for faster execution through C++.

// [[Rcpp::export]]
Int_Int_Double cpp__nbTest(DataFrame inputSparseMatrix, List model) {
  
  std::vector<int> rows = inputSparseMatrix["i"];
  std::vector<int> cols = inputSparseMatrix["j"];
  std::vector<double> vals = inputSparseMatrix["v"];
  
  DataFrame clCounts = as<DataFrame>(model["ClassProbs"]);
  DataFrame clFeatureCounts = as<DataFrame>(model["ClassFeatureProbs"]);
  
  Int_Int_Double rowColValMap;
  for (size_t i = 0; i < rows.size(); i++) rowColValMap[rows[i]][cols[i]] = vals[i];
  
  Int_Double priorClassCounts;
  Int_Int_Double priorClassFeatureCounts;
  
  std::vector<int> clLabels1 = as<std::vector<int>>(clCounts["Class"]);
  std::vector<double> clCounts1 = as<std::vector<double>>(clCounts["Count"]);
  
  std::set<int> uniqueLabels(clLabels1.begin(), clLabels1.end());
  
  std::vector<int> clLabels2 = as<std::vector<int>>(clFeatureCounts["Class"]);
  std::vector<int> featureLabels2 = as<std::vector<int>>(clFeatureCounts["Feature"]);
  std::vector<double> clFeatureCounts2 = as<std::vector<double>>(clFeatureCounts["Count"]);
  
  std::set<int> featureSet(featureLabels2.begin(), featureLabels2.end());
  
  double numDocs = std::accumulate(clCounts1.begin(), clCounts1.end(), 0.0, 
                                   [](double &sum, const double &cnt){return sum+cnt;});
  
  double numFeatures = (double)featureSet.size();
  
  for (size_t i = 0; i < clLabels1.size(); i++) priorClassCounts[clLabels1[i]] = clCounts1[i];
  for (size_t i = 0; i < clLabels2.size(); i++) priorClassFeatureCounts[clLabels2[i]][featureLabels2[i]] = clFeatureCounts2[i];
  
  return computePosteriorProbabilities(priorClassFeatureCounts, priorClassCounts, rowColValMap, uniqueLabels, 
                                       numDocs, numFeatures);
}

For testing the code, I have used the 20 Newsgroup data. The data consists of 18846 documents spanning 20 different classes. What I have done is, read the documents into a Sparse Document Term matrix in R using the "tm" package. The text features used are 1 grams and 2 grams. After doing some feature selection and pruning, I was left with some 2811 unique features.

I divide up the dataset into 3 parts after random stratified sampling , the first part is the set of labelled documents, the second part is assumed to be unlabelled data (although it has labels but instead they have been chosen to have a class label of -1), and the third part is the validation data used for computing accuracy of the model.

In the first experiment, build the model only with the first part of the dataset as defined above and test it on the third part or the validation data. In this experiment there are no unlabelled data.

In the second experiment, combine the first and second part and then train the model using semi-supervised learning and then similarly as above test the model on the validation data.

Observations from the above two experiments :

  • Included only 1% of the documents in supervised training and included about 75% of the total documents in the semi-supervised training. The accuracy on the validation set for the first experiment varied from 45% to  50% only, whereas with a value of \lambda=1, the accuracy with the second experiment varied from 65% to 70%. The point to be taken is that with small amount of labelled data and large amount of unlabelled data, inclusion of unlabelled data into the model improves the model performance.
  • Included 25% of the documents in supervised training and 50% of the documents in semi-supervised training. The accuracy on the validation set for the first experiment varied from somewhere around 82% to 83%. Whereas the accuracy for the second experiment actually dropped when used with \lambda=1. But when used with smaller values of \lambda, such as 0.1 or 0.05, the accuracy remained almost the same as compared to the first experiment.

Accuracy with Semi-Supervised Learning (Green Curve) vs. Supervised Learning (Blue Curve)

In the above plot, the green curve represents the accuracy on validation data (25%) with semi-supervised learning. In the semi-supervised experiments, the labelled + unlabelled data is kept to be 75% and \lambda=1. The blue curve represents supervised training accuracy on the validation data. The labelled data is varied from 1% to 30%. As you see that for lower values of labelled data, the semi-supervised approach gives a better result compared to the supervised approach, but as the percentage of labelled data is increased, the semi-supervised approach performs worse than supervised approach.

Conclusion is that, the effect of learning from unlabelled data is prominent only when the amount of labelled data is insufficient for the model to learn variation in the data.

Access the full code on my Github profile.

Categories: MACHINE LEARNING, PROBLEM SOLVING

Tags: , , , ,