Active Learning is a semi-supervised technique that allows labeling less data by selecting the most important samples from the learning process (loss) standpoint. It can have a huge impact on the project cost in the case when the amount of data is large and the labeling rate is high. For example, object detection and NLP-NER problems.
The article is based on the following code: Active Learning on MNIST

Data for the experiment

#load 4000 of MNIST data for train and 400 for testing
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_full = x_train[:4000] / 255
y_full = y_train[:4000]
x_test = x_test[:400] /255
y_test = y_test[:400]
x_full.shape, y_full.shape, x_test.shape, y_test.shape

((4000, 28, 28), (4000,), (400, 28, 28), (400,))

plt.imshow(x_full[3999])

<matplotlib.image.AxesImage at 0x7f59087e5978>

I will use a subset of MNIST dataset which is 60K pictures of digits with labels and 10K test samples. For the purposes of quicker training, 4000 samples (pictures) are needed for training and 400 for a test (neural network will never see it during the training). For normalization, I divide the grayscale image points by 255.

Model, training and labeling processes

#build computation graph
x = tf.placeholder(tf.float32, [None, 28, 28])
x_flat = tf.reshape(x, [-1, 28 * 28])
y_ = tf.placeholder(tf.int32, [None])
W = tf.Variable(tf.zeros([28 * 28, 10]), tf.float32)
b = tf.Variable(tf.zeros([10]), tf.float32)
y = tf.matmul(x_flat, W) + b
y_sm = tf.nn.softmax(y)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_, logits=y))
train = tf.train.AdamOptimizer(0.1).minimize(loss)
accuracy = tf.reduce_mean(tf.cast(tf.equal(y_, tf.cast(tf.argmax(y, 1), tf.int32)), tf.float32))

As a framework, one can use the TensorFlow computation graph that will build ten neurons (for every digit). W and b are the weights for the neurons. A softmax output y_sm will help with the probabilities (confidence) of digits. The loss will be a typical “softmaxed” cross entropy between the predicted and labeled data. The choice for the optimizer is a popular Adam, the learning rate is almost default – 0.1. Accuracy over test dataset will serve as the main metric

def reset():
    '''Initialize data sets and session'''
    global x_labeled, y_labeled, x_unlabeled, y_unlabeled
    x_labeled = x_full[:0]
    y_labeled = y_full[:0]
    x_unlabeled = x_full
    y_unlabeled = y_full
    tf.global_variables_initializer().run()
    tf.local_variables_initializer().run() 

def fit():
    '''Train current labeled dataset until overfit.'''
    trial_count = 10
    acc = sess.run(accuracy, feed_dict={x:x_test, y_:y_test})
    weights = sess.run([W, b])
    while trial_count > 0:
        sess.run(train, feed_dict={x:x_labeled, y_:y_labeled})
        acc_new = sess.run(accuracy, feed_dict={x:x_test, y_:y_test})
        if acc_new <= acc:
            trial_count -= 1
        else:
            trial_count = 10
            weights = sess.run([W, b])
            acc = acc_new

    sess.run([W.assign(weights[0]), b.assign(weights[1])])    
    acc = sess.run(accuracy, feed_dict={x:x_test, y_:y_test})
    print('Labels:', x_labeled.shape[0], '\tAccuracy:', acc)

def label_manually(n):
    '''Human powered labeling (actually copying from the prelabeled MNIST dataset).'''
    global x_labeled, y_labeled, x_unlabeled, y_unlabeled
    x_labeled = np.concatenate([x_labeled, x_unlabeled[:n]])
    y_labeled = np.concatenate([y_labeled, y_unlabeled[:n]])
    x_unlabeled = x_unlabeled[n:]
    y_unlabeled = y_unlabeled[n:]

Here I define these three procedures for more convenient coding.
reset() – empties the labeled dataset, puts all data in the unlabeled dataset and resets the session variables
fit() – runs a training attempting to reach the best accuracy. If it cannot improve during the first ten attempts, the training stops at the last best result. We cannot use just any big number of training epochs as the model tends to quickly overfit or needs an intensive L2 regularization.
label_manually() – this is an emulation of human data labeling. Actually, we take the labels from the MNIST dataset that has been labeled already.

Ground Truth

#train full dataset of 1000
reset()
label_manually(4000)
fit()

Labels:   4000              Accuracy:  0.9225

If we are so lucky as to have enough resources to label the whole dataset, we will receive the 92.25% of accuracy.

Clustering

#apply clustering
kmeans = tf.contrib.factorization.KMeansClustering(10, use_mini_batch=False)
kmeans.train(lambda: tf.train.limit_epochs(x_full.reshape(4000, 784).astype(np.float32), 10))

centers = kmeans.cluster_centers().reshape([10, 28, 28])
plt.imshow(np.concatenate([centers[i] for i in range(10)], axis=1))

<matplotlib.image.AxesImage at 0x7f58d033a860>

Here I try to use the k-means clustering to find a group of digits and use this information for automatic labeling. I run the Tensorflow clustering estimator and then visualize the resulting ten centroids. As you can see, the result is far from perfect – digit “9” appears three times, sometimes mixed with “8” and “3”.

Random Labeling

#try to run on random 400
reset()
label_manually(400)
fit()

Labels:  400     Accuracy:  0.8375

Let’s try to label only 10% of data (400 samples) and we will receive 83.75% of accuracy that is pretty far from 92.25% of the ground truth.

Active Learning

#now try to run on 10
reset()
label_manually(10)
fit()

Labels: 10          Accuracy:  0.38

#pass unlabeled rest 3990 through the early model
res = sess.run(y_sm, feed_dict={x:x_unlabeled})
#find less confident samples
pmax = np.amax(res, axis=1)
pidx = np.argsort(pmax)
#sort the unlabeled corpus on the confidency
x_unlabeled = x_unlabeled[pidx]
y_unlabeled = y_unlabeled[pidx]
plt.plot(pmax[pidx])

[<matplotlib.lines.Line2D at 0x7f58d0192f28>]

Now we will label the same 10% of data (400 samples) using active learning. To do that, we take one batch out of the 10 samples and train a very primitive model. Then, we pass the rest of the data (3990 samples) through this model and evaluate the maximum softmax output. This will show what is the probability that the selected class is the correct answer (in other words, the confidence of the neural network). After sorting, we can see on the plot that the distribution of confidence varies from 20% to 100%. The idea is to select the next batch for labeling exactly from the LESS CONFIDENT samples.

#do the same in a loop for 400 samples
for i  in range(39):
    label_manually(10)
    fit()
    
    res = sess.run(y_sm, feed_dict={x:x_unlabeled})
    pmax = np.amax(res, axis=1)
    pidx = np.argsort(pmax)
    x_unlabeled = x_unlabeled[pidx]
    y_unlabeled = y_unlabeled[pidx]

After running such a procedure for the 40 batches of 10 samples, we can see that the resulting accuracy is almost 90%. This is far more than the 83.75% achieved in the case with randomly labeled data.

What to do with the rest of unlabeled data

#pass rest unlabeled data through the model and try to autolabel 
res = sess.run(y_sm, feed_dict={x:x_unlabeled}) 
y_autolabeled = res.argmax(axis=1) 
x_labeled = np.concatenate([x_labeled, x_unlabeled]) 
y_labeled = np.concatenate([y_labeled, y_autolabeled]) 
#train on 400 labeled by active learning and 3600 stochasticly autolabeled data 
fit()

Labels:  4000    Accuracy: 0.8975

The classical way would be to run the rest of the dataset through the existing model and automatically label the data. Then, pushing it in the training process would maybe help to better tune the model. In our case though, it did not give us any better result.
My approach is to do the same but, as in the active learning, taking in consideration the confidence:

#pass rest of unlabeled (3600) data trough the model for automatic labeling and show most confident samples
res = sess.run(y_sm, feed_dict={x:x_unlabeled})
y_autolabeled = res.argmax(axis=1)
pmax = np.amax(res, axis=1)
pidx = np.argsort(pmax)
#sort by confidency
x_unlabeled = x_unlabeled[pidx]
y_autolabeled = y_autolabeled[pidx]
plt.plot(pmax[pidx])

[<matplotlib.lines.Line2D at 0x7f58cf918fd0>]

#automatically label 10 most confident sample and train for it
x_labeled = np.concatenate([x_labeled, x_unlabeled[-10:]])
y_labeled = np.concatenate([y_labeled, y_autolabeled[-10:]])
x_unlabeled = x_unlabeled[:-10]
fit()

Labels:   410        Accuracy:    0.8975

Here we run the rest of unlabeled data through model evaluation and we still can see that the confidence differs for the rest of the samples. Thus, the idea is to take a batch of ten MOST CONFIDENT samples and train the model.

#run rest of unlabelled samples starting from most confident
for i in range(359):
    res = sess.run(y_sm, feed_dict={x:x_unlabeled})
    y_autolabeled = res.argmax(axis=1)
    pmax = np.amax(res, axis=1)
    pidx = np.argsort(pmax)
    x_unlabeled = x_unlabeled[pidx]
    y_autolabeled = y_autolabeled[pidx]
    x_labeled = np.concatenate([x_labeled, x_unlabeled[-10:]])
    y_labeled = np.concatenate([y_labeled, y_autolabeled[-10:]])
    x_unlabeled = x_unlabeled[:-10]
    fit()

This process takes some time and gives us extra 0.8% of accuracy.

Results

Experiment Accuracy
4000 samples 92.25%
400 random samples 83.75%
400 active learned samples 89.75%
       + auto-labeling 90.50%

Conclusion

Of course, this approach has its drawbacks, like the heavy use of computation resources and the fact that a special procedure is required for data labeling mixed with early model evaluation. Also, for the testing purposes data needs to be labeled as well. However, if the cost of a label is high (especially for NLP, CV projects), this method can save a significant amount of resources and drive better project results.

This very approach has already helped us save a few thousands of dollars for one of our customers while working on the project related to Document Recognition and Classification. More details about the case can be obtained via the link.

Author:
Andy Bosyi, CEO/Lead Data Scientist MindCraft
Information Technology & Data Science