Self Training Methods¶
Pseudo Label¶

class
tllib.self_training.pseudo_label.
ConfidenceBasedSelfTrainingLoss
(threshold)[source]¶ Self training loss that adopts confidence threshold to select reliable pseudo labels from PseudoLabel : The Simple and Efficient SemiSupervised Learning Method for Deep Neural Networks (ICML 2013).
 Parameters
threshold (float) – Confidence threshold.
 Inputs:
y: unnormalized classifier predictions.
y_target: unnormalized classifier predictions which will used for generating pseudo labels.
 Returns
 A tuple, including
self_training_loss: self training loss with pseudo labels.
mask: binary mask that indicates which samples are retained (whose confidence is above the threshold).
pseudo_labels: generated pseudo labels.
 Shape:
y, y_target: \((minibatch, C)\) where C means the number of classes.
self_training_loss: scalar.
mask, pseudo_labels \((minibatch, )\).
\(\Pi\) Model¶

class
tllib.self_training.pi_model.
ConsistencyLoss
(distance_measure, reduction='mean')[source]¶ Consistency loss between two predictions. Given distance measure \(D\), predictions \(p_1, p_2\), binary mask \(mask\), the consistency loss is
\[D(p_1, p_2) * mask\] Parameters
distance_measure (callable) – Distance measure function.
reduction (str, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Default:'mean'
 Inputs:
p1: the first prediction
p2: the second prediction
mask: binary mask. Default: 1. (use all samples when calculating loss)
 Shape:
p1, p2: \((N, C)\) where C means the number of classes.
mask: \((N, )\) where N means minibatch size.
Mean Teacher¶

class
tllib.self_training.mean_teacher.
EMATeacher
(model, alpha)[source]¶ Exponential moving average model from Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results (NIPS 2017)
We use \(\theta_t'\) to denote parameters of the teacher model at training step t, use \(\theta_t\) to denote parameters of the student model at training step t. Given decay factor \(\alpha\), we update the teacher model in an exponential moving average manner
\[\theta_t'=\alpha \theta_{t1}' + (1\alpha)\theta_t\] Parameters
model (torch.nn.Module) – the student model
alpha (float) – decay factor for EMA.
 Inputs:
x (tensor): input tensor
Examples:
>>> classifier = ImageClassifier(backbone, num_classes=31, bottleneck_dim=256).to(device) >>> # initialize teacher model >>> teacher = EMATeacher(classifier, 0.9) >>> num_iterations = 1000 >>> for _ in range(num_iterations): >>> # x denotes input of one minibatch >>> # you can get teacher model's output by teacher(x) >>> y_teacher = teacher(x) >>> # when you want to update teacher, you should call teacher.update() >>> teacher.update()
Self Ensemble¶

class
tllib.self_training.self_ensemble.
ClassBalanceLoss
(num_classes)[source]¶ Class balance loss that penalises the network for making predictions that exhibit large class imbalance. Given predictions \(p\) with dimension \((N, C)\), we first calculate the minibatch mean perclass probability \(p_{mean}\) with dimension \((C, )\), where
\[p_{mean}^j = \frac{1}{N} \sum_{i=1}^N p_i^j\]Then we calculate binary cross entropy loss between \(p_{mean}\) and uniform probability vector \(u\) with the same dimension where \(u^j\) = \(\frac{1}{C}\)
\[loss = \text{BCELoss}(p_{mean}, u)\] Parameters
num_classes (int) – Number of classes
 Inputs:
p (tensor): predictions from classifier
 Shape:
p: \((N, C)\) where C means the number of classes.
UDA¶

class
tllib.self_training.uda.
StrongWeakConsistencyLoss
(threshold, temperature)[source]¶ Consistency loss between strong and weak augmented samples from Unsupervised Data Augmentation for Consistency Training (NIPS 2020).
 Inputs:
y_strong: unnormalized classifier predictions on strong augmented samples.
y: unnormalized classifier predictions on weak augmented samples.
 Shape:
y, y_strong: \((minibatch, C)\) where C means the number of classes.
Output: scalar.
MCC: Minimum Class Confusion¶

class
tllib.self_training.mcc.
MinimumClassConfusionLoss
(temperature)[source]¶ Minimum Class Confusion loss minimizes the class confusion in the target predictions.
You can see more details in Minimum Class Confusion for Versatile Domain Adaptation (ECCV 2020)
 Parameters
temperature (float) – The temperature for rescaling, the prediction will shrink to vanilla softmax if temperature is 1.0.
Note
Make sure that temperature is larger than 0.
 Inputs: g_t
g_t (tensor): unnormalized classifier predictions on target domain, \(g^t\)
 Shape:
g_t: \((minibatch, C)\) where C means the number of classes.
Output: scalar.
 Examples::
>>> temperature = 2.0 >>> loss = MinimumClassConfusionLoss(temperature) >>> # logits output from target domain >>> g_t = torch.randn(batch_size, num_classes) >>> output = loss(g_t)
MCC can also serve as a regularizer for existing methods. Examples:
>>> from tllib.modules.domain_discriminator import DomainDiscriminator >>> num_classes = 2 >>> feature_dim = 1024 >>> batch_size = 10 >>> temperature = 2.0 >>> discriminator = DomainDiscriminator(in_feature=feature_dim, hidden_size=1024) >>> cdan_loss = ConditionalDomainAdversarialLoss(discriminator, reduction='mean') >>> mcc_loss = MinimumClassConfusionLoss(temperature) >>> # features from source domain and target domain >>> f_s, f_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim) >>> # logits output from source domain adn target domain >>> g_s, g_t = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes) >>> total_loss = cdan_loss(g_s, f_s, g_t, f_t) + mcc_loss(g_t)
MMT: Mutual MeanTeaching¶
State of the art unsupervised domain adaptation methods utilize clustering algorithms to generate pseudo labels on target domain, which are noisy and thus harmful for training. Inspired by the teacherstudent approaches, MMT framework provides robust soft pseudo labels in an online peerteaching manner.
We denote two networks as \(f_1,f_2\), their parameters as \(\theta_1,\theta_2\). The authors also propose to use the temporally average model of each network \(\text{ensemble}(f_1),\text{ensemble}(f_2)\) to generate more reliable soft pseudo labels for supervising the other network. Specifically, the parameters of the temporally average models of the two networks at current iteration \(T\) are denoted as \(E^{(T)}[\theta_1]\) and \(E^{(T)}[\theta_2]\) respectively, which can be calculated as
where \(E^{(T1)}[\theta_1],E^{(T1)}[\theta_2]\) indicate the temporal average parameters of the two networks in the previous iteration \((T1)\), the initial temporal average parameters are \(E^{(0)}[\theta_1]=\theta_1,E^{(0)}[\theta_2]=\theta_2\) and \(\alpha\) is the momentum.
These two networks cooperate with each other in three ways:
 When running clustering algorithm, we average features produced by \(\text{ensemble}(f_1)\) and
\(\text{ensemble}(f_2)\) instead of only considering one of them.
 A soft triplet loss is optimized between \(f_1\) and \(\text{ensemble}(f_2)\) and vice versa
to force one network to learn from temporally average of another network.
 A cross entropy loss is optimized between \(f_1\) and \(\text{ensemble}(f_2)\) and vice versa
to force one network to learn from temporally average of another network.
The above mentioned loss functions are listed below, more details can be found in training scripts.

class
tllib.vision.models.reid.loss.
SoftTripletLoss
(margin=None, normalize_feature=False)[source]¶ Soft triplet loss from Mutual MeanTeaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Reidentification (ICLR 2020). Consider a triplet \(x,x_p,x_n\) (anchor, positive, negative), corresponding features are \(f,f_p,f_n\). We optimize for a smaller distance between \(f\) and \(f_p\) and a larger distance between \(f\) and \(f_n\). Inner product is adopted as their similarity measure, soft triplet loss is thus defined as
\[loss = \mathcal{L}_{\text{bce}}(\frac{\text{exp}(f^Tf_p)}{\text{exp}(f^Tf_p)+\text{exp}(f^Tf_n)}, 1)\]where \(\mathcal{L}_{\text{bce}}\) means binary cross entropy loss. We denote the first term in above loss function as \(T\). When features from another teacher network can be obtained, we can calculate \(T_{teacher}\) as labels, resulting in the following soft version
\[loss = \mathcal{L}_{\text{bce}}(T, T_{teacher})\]

class
tllib.vision.models.reid.loss.
CrossEntropyLoss
[source]¶ We use \(C\) to denote the number of classes, \(N\) to denote minibatch size, this criterion expects unnormalized predictions \(y\_{logits}\) of shape \((N, C)\) and \(target\_{logits}\) of the same shape \((N, C)\). Then we first normalize them into probability distributions among classes
\[y = \text{softmax}(y\_{logits})\]\[target = \text{softmax}(target\_{logits})\]Final objective is calculated as
\[\text{loss} = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^C target_i^j \times \text{log} (y_i^j)\]
Self Tuning¶

class
tllib.self_training.self_tuning.
Classifier
(backbone, num_classes, projection_dim=1024, bottleneck_dim=1024, finetune=True, pool_layer=None)[source]¶ Classifier class for SelfTuning.
 Parameters
backbone (torch.nn.Module) – Any backbone to extract 2d features from data
num_classes (int) – Number of classes.
projection_dim (int, optional) – Dimension of the projector head. Default: 128
finetune (bool) – Whether finetune the classifier or train from scratch. Default: True
 Inputs:
x (tensor): input data fed to backbone
 Outputs:
 In the training mode,
h: projections
y: classifier’s predictions
 In the eval mode,
y: classifier’s predictions
 Shape:
Inputs: (minibatch, *) where * means, any number of additional dimensions
y: (minibatch, num_classes)
h: (minibatch, projection_dim)

class
tllib.self_training.self_tuning.
SelfTuning
(encoder_q, encoder_k, num_classes, K=32, m=0.999, T=0.07)[source]¶ SelfTuning module in SelfTuning for DataEfficient Deep Learning (selftuning, ICML 2021).
 Parameters
encoder_q (Classifier) – Query encoder.
encoder_k (Classifier) – Key encoder.
num_classes (int) – Number of classes
K (int) – Queue size. Default: 32
m (float) – Momentum coefficient. Default: 0.999
T (float) – Temperature. Default: 0.07
 Inputs:
im_q (tensor): input data fed to encoder_q
im_k (tensor): input data fed to encoder_k
labels (tensor): classification labels of input data
 Outputs: pgc_logits, pgc_labels, y_q
pgc_logits: projector’s predictions on both positive and negative samples
pgc_labels: contrastive labels
y_q: query classifier’s predictions
 Shape:
im_q, im_k: (minibatch, *) where * means, any number of additional dimensions
labels: (minibatch, )
y_q: (minibatch, num_classes)
pgc_logits: (minibatch, 1 + num_classes \(\times\) K, projection_dim)
pgc_labels: (minibatch, 1 + num_classes \(\times\) K)
FlexMatch¶

class
tllib.self_training.flexmatch.
DynamicThresholdingModule
(threshold, warmup, mapping_func, num_classes, n_unlabeled_samples, device)[source]¶ Dynamic thresholding module from FlexMatch: Boosting SemiSupervised Learning with Curriculum Pseudo Labeling. At time \(t\), for each category \(c\), the learning status \(\sigma_t(c)\) is estimated by the number of samples whose predictions fall into this class and above a threshold (e.g. 0.95). Then, FlexMatch normalizes \(\sigma_t(c)\) to make its range between 0 and 1
\[\beta_t(c) = \frac{\sigma_t(c)}{\underset{c'}{\text{max}}~\sigma_t(c')}.\]The dynamic threshold is formulated as
\[\mathcal{T}_t(c) = \mathcal{M}(\beta_t(c)) \cdot \tau,\]where tau denotes the predefined threshold (e.g. 0.95), \(\mathcal{M}\) denotes a (possibly nonlinear) mapping function.
 Parameters
threshold (float) – The predefined confidence threshold
warmup (bool) – Whether perform threshold warmup. If True, the number of unlabeled data that have not been used will be considered when normalizing \(\sigma_t(c)\)
mapping_func (callable) – An increasing mapping function. For example, this function can be (1) concave \(\mathcal{M}(x)=\text{ln}(x+1)/\text{ln}2\), (2) linear \(\mathcal{M}(x)=x\), and (3) convex \(\mathcal{M}(x)=2/2x\)
num_classes (int) – Number of classes
n_unlabeled_samples (int) – Size of the unlabeled dataset
device (torch.device) – Device
Debiased SelfTraining¶

class
tllib.self_training.dst.
ImageClassifier
(backbone, num_classes, bottleneck_dim=1024, width=2048, **kwargs)[source]¶ Classifier with nonlinear pseudo head \(h_{\text{pseudo}}\) and worstcase estimation head \(h_{\text{worst}}\) from Debiased SelfTraining for SemiSupervised Learning. Both heads are directly connected to the feature extractor \(\psi\). We implement endtoend adversarial training procedure between \(\psi\) and \(h_{\text{worst}}\) by introducing a gradient reverse layer. Note that both heads can be safely discarded during inference, and thus will introduce no inference cost.
 Parameters
backbone (torch.nn.Module) – Any backbone to extract 2d features from data
num_classes (int) – Number of classes
bottleneck_dim (int, optional) – Feature dimension of the bottleneck layer.
width (int, optional) – Hidden dimension of the nonlinear pseudo head and worstcase estimation head.
 Inputs:
x (tensor): input data fed to backbone
 Outputs:
outputs: predictions of the main head \(h\)
outputs_adv: predictions of the worstcase estimation head \(h_{\text{worst}}\)
outputs_pseudo: predictions of the pseudo head \(h_{\text{pseudo}}\)
 Shape:
Inputs: (minibatch, *) where * means, any number of additional dimensions
outputs, outputs_adv, outputs_pseudo: (minibatch, num_classes)

class
tllib.self_training.dst.
WorstCaseEstimationLoss
(eta_prime)[source]¶ Worstcase Estimation loss from Debiased SelfTraining for SemiSupervised Learning that forces the worst possible head \(h_{\text{worst}}\) to predict correctly on all labeled samples \(\mathcal{L}\) while making as many mistakes as possible on unlabeled data \(\mathcal{U}\). In the classification task, it is defined as:
\[loss(\mathcal{L}, \mathcal{U}) = \eta' \mathbb{E}_{y^l, y_{adv}^l \sim\hat{\mathcal{L}}} \log\left(\frac{\exp(y_{adv}^l[h_{y^l}])}{\sum_j \exp(y_{adv}^l[j])}\right) + \mathbb{E}_{y^u, y_{adv}^u \sim\hat{\mathcal{U}}} \log\left(1\frac{\exp(y_{adv}^u[h_{y^u}])}{\sum_j \exp(y_{adv}^u[j])}\right),\]where \(y^l\) and \(y^u\) are logits output by the main head \(h\) on labeled data and unlabeled data, respectively. \(y_{adv}^l\) and \(y_{adv}^u\) are logits output by the worstcase estimation head \(h_{\text{worst}}\). \(h_y\) refers to the predicted label when the logits output is \(y\).
 Parameters
eta_prime (float) – the tradeoff hyper parameter \(\eta'\).
 Inputs:
y_l: logits output \(y^l\) by the main head on labeled data
y_l_adv: logits output \(y^l_{adv}\) by the worstcase estimation head on labeled data
y_u: logits output \(y^u\) by the main head on unlabeled data
y_u_adv: logits output \(y^u_{adv}\) by the worstcase estimation head on unlabeled data
 Shape:
Inputs: \((minibatch, C)\) where C denotes the number of classes.
Output: scalar.