# Self Training Methods¶

## Pseudo Label¶

class tllib.self_training.pseudo_label.ConfidenceBasedSelfTrainingLoss(threshold)[source]

Self training loss that adopts confidence threshold to select reliable pseudo labels from Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (ICML 2013).

Parameters

threshold (float) – Confidence threshold.

Inputs:
• y: unnormalized classifier predictions.

• y_target: unnormalized classifier predictions which will used for generating pseudo labels.

Returns

A tuple, including
• self_training_loss: self training loss with pseudo labels.

• mask: binary mask that indicates which samples are retained (whose confidence is above the threshold).

• pseudo_labels: generated pseudo labels.

Shape:
• y, y_target: $$(minibatch, C)$$ where C means the number of classes.

• self_training_loss: scalar.

• mask, pseudo_labels $$(minibatch, )$$.

## $$\Pi$$ Model¶

class tllib.self_training.pi_model.ConsistencyLoss(distance_measure, reduction='mean')[source]

Consistency loss between two predictions. Given distance measure $$D$$, predictions $$p_1, p_2$$, binary mask $$mask$$, the consistency loss is

$D(p_1, p_2) * mask$
Parameters
• distance_measure (callable) – Distance measure function.

• reduction (str, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed. Default: 'mean'

Inputs:
• p1: the first prediction

• p2: the second prediction

• mask: binary mask. Default: 1. (use all samples when calculating loss)

Shape:
• p1, p2: $$(N, C)$$ where C means the number of classes.

• mask: $$(N, )$$ where N means mini-batch size.

class tllib.self_training.pi_model.L2ConsistencyLoss(reduction='mean')[source]

L2 consistency loss. Given two predictions $$p_1, p_2$$ and binary mask $$mask$$, the L2 consistency loss is

$\text{MSELoss}(p_1, p_2) * mask$

## Mean Teacher¶

class tllib.self_training.mean_teacher.EMATeacher(model, alpha)[source]

Exponential moving average model from Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results (NIPS 2017)

We use $$\theta_t'$$ to denote parameters of the teacher model at training step t, use $$\theta_t$$ to denote parameters of the student model at training step t. Given decay factor $$\alpha$$, we update the teacher model in an exponential moving average manner

$\theta_t'=\alpha \theta_{t-1}' + (1-\alpha)\theta_t$
Parameters
Inputs:

x (tensor): input tensor

Examples:

>>> classifier = ImageClassifier(backbone, num_classes=31, bottleneck_dim=256).to(device)
>>> # initialize teacher model
>>> teacher = EMATeacher(classifier, 0.9)
>>> num_iterations = 1000
>>> for _ in range(num_iterations):
>>>     # x denotes input of one mini-batch
>>>     # you can get teacher model's output by teacher(x)
>>>     y_teacher = teacher(x)
>>>     # when you want to update teacher, you should call teacher.update()
>>>     teacher.update()


## Self Ensemble¶

class tllib.self_training.self_ensemble.ClassBalanceLoss(num_classes)[source]

Class balance loss that penalises the network for making predictions that exhibit large class imbalance. Given predictions $$p$$ with dimension $$(N, C)$$, we first calculate the mini-batch mean per-class probability $$p_{mean}$$ with dimension $$(C, )$$, where

$p_{mean}^j = \frac{1}{N} \sum_{i=1}^N p_i^j$

Then we calculate binary cross entropy loss between $$p_{mean}$$ and uniform probability vector $$u$$ with the same dimension where $$u^j$$ = $$\frac{1}{C}$$

$loss = \text{BCELoss}(p_{mean}, u)$
Parameters

num_classes (int) – Number of classes

Inputs:
• p (tensor): predictions from classifier

Shape:
• p: $$(N, C)$$ where C means the number of classes.

## UDA¶

class tllib.self_training.uda.StrongWeakConsistencyLoss(threshold, temperature)[source]

Consistency loss between strong and weak augmented samples from Unsupervised Data Augmentation for Consistency Training (NIPS 2020).

Parameters
• threshold (float) – Confidence threshold.

• temperature (float) – Temperature.

Inputs:
• y_strong: unnormalized classifier predictions on strong augmented samples.

• y: unnormalized classifier predictions on weak augmented samples.

Shape:
• y, y_strong: $$(minibatch, C)$$ where C means the number of classes.

• Output: scalar.

## MCC: Minimum Class Confusion¶

class tllib.self_training.mcc.MinimumClassConfusionLoss(temperature)[source]

Minimum Class Confusion loss minimizes the class confusion in the target predictions.

You can see more details in Minimum Class Confusion for Versatile Domain Adaptation (ECCV 2020)

Parameters

temperature (float) – The temperature for rescaling, the prediction will shrink to vanilla softmax if temperature is 1.0.

Note

Make sure that temperature is larger than 0.

Inputs: g_t
• g_t (tensor): unnormalized classifier predictions on target domain, $$g^t$$

Shape:
• g_t: $$(minibatch, C)$$ where C means the number of classes.

• Output: scalar.

Examples::
>>> temperature = 2.0
>>> loss = MinimumClassConfusionLoss(temperature)
>>> # logits output from target domain
>>> g_t = torch.randn(batch_size, num_classes)
>>> output = loss(g_t)


MCC can also serve as a regularizer for existing methods. Examples:

>>> from tllib.modules.domain_discriminator import DomainDiscriminator
>>> num_classes = 2
>>> feature_dim = 1024
>>> batch_size = 10
>>> temperature = 2.0
>>> discriminator = DomainDiscriminator(in_feature=feature_dim, hidden_size=1024)
>>> mcc_loss = MinimumClassConfusionLoss(temperature)
>>> # features from source domain and target domain
>>> f_s, f_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim)
>>> # logits output from source domain adn target domain
>>> g_s, g_t = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes)
>>> total_loss = cdan_loss(g_s, f_s, g_t, f_t) + mcc_loss(g_t)


## MMT: Mutual Mean-Teaching¶

Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification (ICLR 2020)

State of the art unsupervised domain adaptation methods utilize clustering algorithms to generate pseudo labels on target domain, which are noisy and thus harmful for training. Inspired by the teacher-student approaches, MMT framework provides robust soft pseudo labels in an on-line peer-teaching manner.

We denote two networks as $$f_1,f_2$$, their parameters as $$\theta_1,\theta_2$$. The authors also propose to use the temporally average model of each network $$\text{ensemble}(f_1),\text{ensemble}(f_2)$$ to generate more reliable soft pseudo labels for supervising the other network. Specifically, the parameters of the temporally average models of the two networks at current iteration $$T$$ are denoted as $$E^{(T)}[\theta_1]$$ and $$E^{(T)}[\theta_2]$$ respectively, which can be calculated as

$E^{(T)}[\theta_1] = \alpha E^{(T-1)}[\theta_1] + (1-\alpha)\theta_1$
$E^{(T)}[\theta_2] = \alpha E^{(T-1)}[\theta_2] + (1-\alpha)\theta_2$

where $$E^{(T-1)}[\theta_1],E^{(T-1)}[\theta_2]$$ indicate the temporal average parameters of the two networks in the previous iteration $$(T-1)$$, the initial temporal average parameters are $$E^{(0)}[\theta_1]=\theta_1,E^{(0)}[\theta_2]=\theta_2$$ and $$\alpha$$ is the momentum.

These two networks cooperate with each other in three ways:

• When running clustering algorithm, we average features produced by $$\text{ensemble}(f_1)$$ and

$$\text{ensemble}(f_2)$$ instead of only considering one of them.

• A soft triplet loss is optimized between $$f_1$$ and $$\text{ensemble}(f_2)$$ and vice versa

to force one network to learn from temporally average of another network.

• A cross entropy loss is optimized between $$f_1$$ and $$\text{ensemble}(f_2)$$ and vice versa

to force one network to learn from temporally average of another network.

The above mentioned loss functions are listed below, more details can be found in training scripts.

class tllib.vision.models.reid.loss.SoftTripletLoss(margin=None, normalize_feature=False)[source]

Soft triplet loss from Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification (ICLR 2020). Consider a triplet $$x,x_p,x_n$$ (anchor, positive, negative), corresponding features are $$f,f_p,f_n$$. We optimize for a smaller distance between $$f$$ and $$f_p$$ and a larger distance between $$f$$ and $$f_n$$. Inner product is adopted as their similarity measure, soft triplet loss is thus defined as

$loss = \mathcal{L}_{\text{bce}}(\frac{\text{exp}(f^Tf_p)}{\text{exp}(f^Tf_p)+\text{exp}(f^Tf_n)}, 1)$

where $$\mathcal{L}_{\text{bce}}$$ means binary cross entropy loss. We denote the first term in above loss function as $$T$$. When features from another teacher network can be obtained, we can calculate $$T_{teacher}$$ as labels, resulting in the following soft version

$loss = \mathcal{L}_{\text{bce}}(T, T_{teacher})$
Parameters
• margin (float, optional) – margin of triplet loss. If None, soft labels from another network will be adopted when computing loss. Default: None.

• normalize_feature (bool, optional) – if True, normalize features into unit norm first before computing loss. Default: False.

class tllib.vision.models.reid.loss.CrossEntropyLoss[source]

We use $$C$$ to denote the number of classes, $$N$$ to denote mini-batch size, this criterion expects unnormalized predictions $$y\_{logits}$$ of shape $$(N, C)$$ and $$target\_{logits}$$ of the same shape $$(N, C)$$. Then we first normalize them into probability distributions among classes

$y = \text{softmax}(y\_{logits})$
$target = \text{softmax}(target\_{logits})$

Final objective is calculated as

$\text{loss} = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^C -target_i^j \times \text{log} (y_i^j)$

## Self Tuning¶

class tllib.self_training.self_tuning.Classifier(backbone, num_classes, projection_dim=1024, bottleneck_dim=1024, finetune=True, pool_layer=None)[source]

Classifier class for Self-Tuning.

Parameters
• backbone (torch.nn.Module) – Any backbone to extract 2-d features from data

• num_classes (int) – Number of classes.

• projection_dim (int, optional) – Dimension of the projector head. Default: 128

• finetune (bool) – Whether finetune the classifier or train from scratch. Default: True

Inputs:
• x (tensor): input data fed to backbone

Outputs:
In the training mode,
• h: projections

• y: classifier’s predictions

In the eval mode,
• y: classifier’s predictions

Shape:
• Inputs: (minibatch, *) where * means, any number of additional dimensions

• y: (minibatch, num_classes)

• h: (minibatch, projection_dim)

class tllib.self_training.self_tuning.SelfTuning(encoder_q, encoder_k, num_classes, K=32, m=0.999, T=0.07)[source]

Self-Tuning module in Self-Tuning for Data-Efficient Deep Learning (self-tuning, ICML 2021).

Parameters
• encoder_q (Classifier) – Query encoder.

• encoder_k (Classifier) – Key encoder.

• num_classes (int) – Number of classes

• K (int) – Queue size. Default: 32

• m (float) – Momentum coefficient. Default: 0.999

• T (float) – Temperature. Default: 0.07

Inputs:
• im_q (tensor): input data fed to encoder_q

• im_k (tensor): input data fed to encoder_k

• labels (tensor): classification labels of input data

Outputs: pgc_logits, pgc_labels, y_q
• pgc_logits: projector’s predictions on both positive and negative samples

• pgc_labels: contrastive labels

• y_q: query classifier’s predictions

Shape:
• im_q, im_k: (minibatch, *) where * means, any number of additional dimensions

• labels: (minibatch, )

• y_q: (minibatch, num_classes)

• pgc_logits: (minibatch, 1 + num_classes $$\times$$ K, projection_dim)

• pgc_labels: (minibatch, 1 + num_classes $$\times$$ K)

## FlexMatch¶

class tllib.self_training.flexmatch.DynamicThresholdingModule(threshold, warmup, mapping_func, num_classes, n_unlabeled_samples, device)[source]

Dynamic thresholding module from FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. At time $$t$$, for each category $$c$$, the learning status $$\sigma_t(c)$$ is estimated by the number of samples whose predictions fall into this class and above a threshold (e.g. 0.95). Then, FlexMatch normalizes $$\sigma_t(c)$$ to make its range between 0 and 1

$\beta_t(c) = \frac{\sigma_t(c)}{\underset{c'}{\text{max}}~\sigma_t(c')}.$

The dynamic threshold is formulated as

$\mathcal{T}_t(c) = \mathcal{M}(\beta_t(c)) \cdot \tau,$

where tau denotes the pre-defined threshold (e.g. 0.95), $$\mathcal{M}$$ denotes a (possibly non-linear) mapping function.

Parameters
• threshold (float) – The pre-defined confidence threshold

• warmup (bool) – Whether perform threshold warm-up. If True, the number of unlabeled data that have not been used will be considered when normalizing $$\sigma_t(c)$$

• mapping_func (callable) – An increasing mapping function. For example, this function can be (1) concave $$\mathcal{M}(x)=\text{ln}(x+1)/\text{ln}2$$, (2) linear $$\mathcal{M}(x)=x$$, and (3) convex $$\mathcal{M}(x)=2/2-x$$

• num_classes (int) – Number of classes

• n_unlabeled_samples (int) – Size of the unlabeled dataset

• device (torch.device) – Device

get_threshold(pseudo_labels)[source]

Calculate and return dynamic threshold

update(idxes, selected_mask, pseudo_labels)[source]

Update the learning status

Parameters
• idxes (tensor) – Indexes of corresponding samples

• selected_mask (tensor) – A binary mask, a value of 1 indicates the prediction for this sample will be updated

• pseudo_labels (tensor) – Network predictions

## Debiased Self-Training¶

class tllib.self_training.dst.ImageClassifier(backbone, num_classes, bottleneck_dim=1024, width=2048, **kwargs)[source]

Classifier with non-linear pseudo head $$h_{\text{pseudo}}$$ and worst-case estimation head $$h_{\text{worst}}$$ from Debiased Self-Training for Semi-Supervised Learning. Both heads are directly connected to the feature extractor $$\psi$$. We implement end-to-end adversarial training procedure between $$\psi$$ and $$h_{\text{worst}}$$ by introducing a gradient reverse layer. Note that both heads can be safely discarded during inference, and thus will introduce no inference cost.

Parameters
• backbone (torch.nn.Module) – Any backbone to extract 2-d features from data

• num_classes (int) – Number of classes

• bottleneck_dim (int, optional) – Feature dimension of the bottleneck layer.

• width (int, optional) – Hidden dimension of the non-linear pseudo head and worst-case estimation head.

Inputs:
• x (tensor): input data fed to backbone

Outputs:
• outputs: predictions of the main head $$h$$

• outputs_adv: predictions of the worst-case estimation head $$h_{\text{worst}}$$

• outputs_pseudo: predictions of the pseudo head $$h_{\text{pseudo}}$$

Shape:
• Inputs: (minibatch, *) where * means, any number of additional dimensions

• outputs, outputs_adv, outputs_pseudo: (minibatch, num_classes)

class tllib.self_training.dst.WorstCaseEstimationLoss(eta_prime)[source]

Worst-case Estimation loss from Debiased Self-Training for Semi-Supervised Learning that forces the worst possible head $$h_{\text{worst}}$$ to predict correctly on all labeled samples $$\mathcal{L}$$ while making as many mistakes as possible on unlabeled data $$\mathcal{U}$$. In the classification task, it is defined as:

$loss(\mathcal{L}, \mathcal{U}) = \eta' \mathbb{E}_{y^l, y_{adv}^l \sim\hat{\mathcal{L}}} -\log\left(\frac{\exp(y_{adv}^l[h_{y^l}])}{\sum_j \exp(y_{adv}^l[j])}\right) + \mathbb{E}_{y^u, y_{adv}^u \sim\hat{\mathcal{U}}} -\log\left(1-\frac{\exp(y_{adv}^u[h_{y^u}])}{\sum_j \exp(y_{adv}^u[j])}\right),$

where $$y^l$$ and $$y^u$$ are logits output by the main head $$h$$ on labeled data and unlabeled data, respectively. $$y_{adv}^l$$ and $$y_{adv}^u$$ are logits output by the worst-case estimation head $$h_{\text{worst}}$$. $$h_y$$ refers to the predicted label when the logits output is $$y$$.

Parameters

eta_prime (float) – the trade-off hyper parameter $$\eta'$$.

Inputs:
• y_l: logits output $$y^l$$ by the main head on labeled data

• y_l_adv: logits output $$y^l_{adv}$$ by the worst-case estimation head on labeled data

• y_u: logits output $$y^u$$ by the main head on unlabeled data

• y_u_adv: logits output $$y^u_{adv}$$ by the worst-case estimation head on unlabeled data

Shape:
• Inputs: $$(minibatch, C)$$ where C denotes the number of classes.

• Output: scalar.