Self Training Methods¶
Pseudo Label¶
-
class
tllib.self_training.pseudo_label.
ConfidenceBasedSelfTrainingLoss
(threshold)[source]¶ Self training loss that adopts confidence threshold to select reliable pseudo labels from Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (ICML 2013).
- Parameters
threshold (float) – Confidence threshold.
- Inputs:
y: unnormalized classifier predictions.
y_target: unnormalized classifier predictions which will used for generating pseudo labels.
- Returns
- A tuple, including
self_training_loss: self training loss with pseudo labels.
mask: binary mask that indicates which samples are retained (whose confidence is above the threshold).
pseudo_labels: generated pseudo labels.
- Shape:
y, y_target: \((minibatch, C)\) where C means the number of classes.
self_training_loss: scalar.
mask, pseudo_labels \((minibatch, )\).
\(\Pi\) Model¶
-
class
tllib.self_training.pi_model.
ConsistencyLoss
(distance_measure, reduction='mean')[source]¶ Consistency loss between two predictions. Given distance measure \(D\), predictions \(p_1, p_2\), binary mask \(mask\), the consistency loss is
\[D(p_1, p_2) * mask\]- Parameters
distance_measure (callable) – Distance measure function.
reduction (str, optional) – Specifies the reduction to apply to the output:
'none'
|'mean'
|'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Default:'mean'
- Inputs:
p1: the first prediction
p2: the second prediction
mask: binary mask. Default: 1. (use all samples when calculating loss)
- Shape:
p1, p2: \((N, C)\) where C means the number of classes.
mask: \((N, )\) where N means mini-batch size.
Mean Teacher¶
-
class
tllib.self_training.mean_teacher.
EMATeacher
(model, alpha)[source]¶ Exponential moving average model from Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results (NIPS 2017)
We use \(\theta_t'\) to denote parameters of the teacher model at training step t, use \(\theta_t\) to denote parameters of the student model at training step t. Given decay factor \(\alpha\), we update the teacher model in an exponential moving average manner
\[\theta_t'=\alpha \theta_{t-1}' + (1-\alpha)\theta_t\]- Parameters
model (torch.nn.Module) – the student model
alpha (float) – decay factor for EMA.
- Inputs:
x (tensor): input tensor
Examples:
>>> classifier = ImageClassifier(backbone, num_classes=31, bottleneck_dim=256).to(device) >>> # initialize teacher model >>> teacher = EMATeacher(classifier, 0.9) >>> num_iterations = 1000 >>> for _ in range(num_iterations): >>> # x denotes input of one mini-batch >>> # you can get teacher model's output by teacher(x) >>> y_teacher = teacher(x) >>> # when you want to update teacher, you should call teacher.update() >>> teacher.update()
Self Ensemble¶
-
class
tllib.self_training.self_ensemble.
ClassBalanceLoss
(num_classes)[source]¶ Class balance loss that penalises the network for making predictions that exhibit large class imbalance. Given predictions \(p\) with dimension \((N, C)\), we first calculate the mini-batch mean per-class probability \(p_{mean}\) with dimension \((C, )\), where
\[p_{mean}^j = \frac{1}{N} \sum_{i=1}^N p_i^j\]Then we calculate binary cross entropy loss between \(p_{mean}\) and uniform probability vector \(u\) with the same dimension where \(u^j\) = \(\frac{1}{C}\)
\[loss = \text{BCELoss}(p_{mean}, u)\]- Parameters
num_classes (int) – Number of classes
- Inputs:
p (tensor): predictions from classifier
- Shape:
p: \((N, C)\) where C means the number of classes.
UDA¶
-
class
tllib.self_training.uda.
StrongWeakConsistencyLoss
(threshold, temperature)[source]¶ Consistency loss between strong and weak augmented samples from Unsupervised Data Augmentation for Consistency Training (NIPS 2020).
- Inputs:
y_strong: unnormalized classifier predictions on strong augmented samples.
y: unnormalized classifier predictions on weak augmented samples.
- Shape:
y, y_strong: \((minibatch, C)\) where C means the number of classes.
Output: scalar.
MCC: Minimum Class Confusion¶
-
class
tllib.self_training.mcc.
MinimumClassConfusionLoss
(temperature)[source]¶ Minimum Class Confusion loss minimizes the class confusion in the target predictions.
You can see more details in Minimum Class Confusion for Versatile Domain Adaptation (ECCV 2020)
- Parameters
temperature (float) – The temperature for rescaling, the prediction will shrink to vanilla softmax if temperature is 1.0.
Note
Make sure that temperature is larger than 0.
- Inputs: g_t
g_t (tensor): unnormalized classifier predictions on target domain, \(g^t\)
- Shape:
g_t: \((minibatch, C)\) where C means the number of classes.
Output: scalar.
- Examples::
>>> temperature = 2.0 >>> loss = MinimumClassConfusionLoss(temperature) >>> # logits output from target domain >>> g_t = torch.randn(batch_size, num_classes) >>> output = loss(g_t)
MCC can also serve as a regularizer for existing methods. Examples:
>>> from tllib.modules.domain_discriminator import DomainDiscriminator >>> num_classes = 2 >>> feature_dim = 1024 >>> batch_size = 10 >>> temperature = 2.0 >>> discriminator = DomainDiscriminator(in_feature=feature_dim, hidden_size=1024) >>> cdan_loss = ConditionalDomainAdversarialLoss(discriminator, reduction='mean') >>> mcc_loss = MinimumClassConfusionLoss(temperature) >>> # features from source domain and target domain >>> f_s, f_t = torch.randn(batch_size, feature_dim), torch.randn(batch_size, feature_dim) >>> # logits output from source domain adn target domain >>> g_s, g_t = torch.randn(batch_size, num_classes), torch.randn(batch_size, num_classes) >>> total_loss = cdan_loss(g_s, f_s, g_t, f_t) + mcc_loss(g_t)
MMT: Mutual Mean-Teaching¶
State of the art unsupervised domain adaptation methods utilize clustering algorithms to generate pseudo labels on target domain, which are noisy and thus harmful for training. Inspired by the teacher-student approaches, MMT framework provides robust soft pseudo labels in an on-line peer-teaching manner.
We denote two networks as \(f_1,f_2\), their parameters as \(\theta_1,\theta_2\). The authors also propose to use the temporally average model of each network \(\text{ensemble}(f_1),\text{ensemble}(f_2)\) to generate more reliable soft pseudo labels for supervising the other network. Specifically, the parameters of the temporally average models of the two networks at current iteration \(T\) are denoted as \(E^{(T)}[\theta_1]\) and \(E^{(T)}[\theta_2]\) respectively, which can be calculated as
where \(E^{(T-1)}[\theta_1],E^{(T-1)}[\theta_2]\) indicate the temporal average parameters of the two networks in the previous iteration \((T-1)\), the initial temporal average parameters are \(E^{(0)}[\theta_1]=\theta_1,E^{(0)}[\theta_2]=\theta_2\) and \(\alpha\) is the momentum.
These two networks cooperate with each other in three ways:
- When running clustering algorithm, we average features produced by \(\text{ensemble}(f_1)\) and
\(\text{ensemble}(f_2)\) instead of only considering one of them.
- A soft triplet loss is optimized between \(f_1\) and \(\text{ensemble}(f_2)\) and vice versa
to force one network to learn from temporally average of another network.
- A cross entropy loss is optimized between \(f_1\) and \(\text{ensemble}(f_2)\) and vice versa
to force one network to learn from temporally average of another network.
The above mentioned loss functions are listed below, more details can be found in training scripts.
-
class
tllib.vision.models.reid.loss.
SoftTripletLoss
(margin=None, normalize_feature=False)[source]¶ Soft triplet loss from Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification (ICLR 2020). Consider a triplet \(x,x_p,x_n\) (anchor, positive, negative), corresponding features are \(f,f_p,f_n\). We optimize for a smaller distance between \(f\) and \(f_p\) and a larger distance between \(f\) and \(f_n\). Inner product is adopted as their similarity measure, soft triplet loss is thus defined as
\[loss = \mathcal{L}_{\text{bce}}(\frac{\text{exp}(f^Tf_p)}{\text{exp}(f^Tf_p)+\text{exp}(f^Tf_n)}, 1)\]where \(\mathcal{L}_{\text{bce}}\) means binary cross entropy loss. We denote the first term in above loss function as \(T\). When features from another teacher network can be obtained, we can calculate \(T_{teacher}\) as labels, resulting in the following soft version
\[loss = \mathcal{L}_{\text{bce}}(T, T_{teacher})\]
-
class
tllib.vision.models.reid.loss.
CrossEntropyLoss
[source]¶ We use \(C\) to denote the number of classes, \(N\) to denote mini-batch size, this criterion expects unnormalized predictions \(y\_{logits}\) of shape \((N, C)\) and \(target\_{logits}\) of the same shape \((N, C)\). Then we first normalize them into probability distributions among classes
\[y = \text{softmax}(y\_{logits})\]\[target = \text{softmax}(target\_{logits})\]Final objective is calculated as
\[\text{loss} = \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^C -target_i^j \times \text{log} (y_i^j)\]
Self Tuning¶
-
class
tllib.self_training.self_tuning.
Classifier
(backbone, num_classes, projection_dim=1024, bottleneck_dim=1024, finetune=True, pool_layer=None)[source]¶ Classifier class for Self-Tuning.
- Parameters
backbone (torch.nn.Module) – Any backbone to extract 2-d features from data
num_classes (int) – Number of classes.
projection_dim (int, optional) – Dimension of the projector head. Default: 128
finetune (bool) – Whether finetune the classifier or train from scratch. Default: True
- Inputs:
x (tensor): input data fed to backbone
- Outputs:
- In the training mode,
h: projections
y: classifier’s predictions
- In the eval mode,
y: classifier’s predictions
- Shape:
Inputs: (minibatch, *) where * means, any number of additional dimensions
y: (minibatch, num_classes)
h: (minibatch, projection_dim)
-
class
tllib.self_training.self_tuning.
SelfTuning
(encoder_q, encoder_k, num_classes, K=32, m=0.999, T=0.07)[source]¶ Self-Tuning module in Self-Tuning for Data-Efficient Deep Learning (self-tuning, ICML 2021).
- Parameters
encoder_q (Classifier) – Query encoder.
encoder_k (Classifier) – Key encoder.
num_classes (int) – Number of classes
K (int) – Queue size. Default: 32
m (float) – Momentum coefficient. Default: 0.999
T (float) – Temperature. Default: 0.07
- Inputs:
im_q (tensor): input data fed to encoder_q
im_k (tensor): input data fed to encoder_k
labels (tensor): classification labels of input data
- Outputs: pgc_logits, pgc_labels, y_q
pgc_logits: projector’s predictions on both positive and negative samples
pgc_labels: contrastive labels
y_q: query classifier’s predictions
- Shape:
im_q, im_k: (minibatch, *) where * means, any number of additional dimensions
labels: (minibatch, )
y_q: (minibatch, num_classes)
pgc_logits: (minibatch, 1 + num_classes \(\times\) K, projection_dim)
pgc_labels: (minibatch, 1 + num_classes \(\times\) K)
FlexMatch¶
-
class
tllib.self_training.flexmatch.
DynamicThresholdingModule
(threshold, warmup, mapping_func, num_classes, n_unlabeled_samples, device)[source]¶ Dynamic thresholding module from FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. At time \(t\), for each category \(c\), the learning status \(\sigma_t(c)\) is estimated by the number of samples whose predictions fall into this class and above a threshold (e.g. 0.95). Then, FlexMatch normalizes \(\sigma_t(c)\) to make its range between 0 and 1
\[\beta_t(c) = \frac{\sigma_t(c)}{\underset{c'}{\text{max}}~\sigma_t(c')}.\]The dynamic threshold is formulated as
\[\mathcal{T}_t(c) = \mathcal{M}(\beta_t(c)) \cdot \tau,\]where tau denotes the pre-defined threshold (e.g. 0.95), \(\mathcal{M}\) denotes a (possibly non-linear) mapping function.
- Parameters
threshold (float) – The pre-defined confidence threshold
warmup (bool) – Whether perform threshold warm-up. If True, the number of unlabeled data that have not been used will be considered when normalizing \(\sigma_t(c)\)
mapping_func (callable) – An increasing mapping function. For example, this function can be (1) concave \(\mathcal{M}(x)=\text{ln}(x+1)/\text{ln}2\), (2) linear \(\mathcal{M}(x)=x\), and (3) convex \(\mathcal{M}(x)=2/2-x\)
num_classes (int) – Number of classes
n_unlabeled_samples (int) – Size of the unlabeled dataset
device (torch.device) – Device
Debiased Self-Training¶
-
class
tllib.self_training.dst.
ImageClassifier
(backbone, num_classes, bottleneck_dim=1024, width=2048, **kwargs)[source]¶ Classifier with non-linear pseudo head \(h_{\text{pseudo}}\) and worst-case estimation head \(h_{\text{worst}}\) from Debiased Self-Training for Semi-Supervised Learning. Both heads are directly connected to the feature extractor \(\psi\). We implement end-to-end adversarial training procedure between \(\psi\) and \(h_{\text{worst}}\) by introducing a gradient reverse layer. Note that both heads can be safely discarded during inference, and thus will introduce no inference cost.
- Parameters
backbone (torch.nn.Module) – Any backbone to extract 2-d features from data
num_classes (int) – Number of classes
bottleneck_dim (int, optional) – Feature dimension of the bottleneck layer.
width (int, optional) – Hidden dimension of the non-linear pseudo head and worst-case estimation head.
- Inputs:
x (tensor): input data fed to backbone
- Outputs:
outputs: predictions of the main head \(h\)
outputs_adv: predictions of the worst-case estimation head \(h_{\text{worst}}\)
outputs_pseudo: predictions of the pseudo head \(h_{\text{pseudo}}\)
- Shape:
Inputs: (minibatch, *) where * means, any number of additional dimensions
outputs, outputs_adv, outputs_pseudo: (minibatch, num_classes)
-
class
tllib.self_training.dst.
WorstCaseEstimationLoss
(eta_prime)[source]¶ Worst-case Estimation loss from Debiased Self-Training for Semi-Supervised Learning that forces the worst possible head \(h_{\text{worst}}\) to predict correctly on all labeled samples \(\mathcal{L}\) while making as many mistakes as possible on unlabeled data \(\mathcal{U}\). In the classification task, it is defined as:
\[loss(\mathcal{L}, \mathcal{U}) = \eta' \mathbb{E}_{y^l, y_{adv}^l \sim\hat{\mathcal{L}}} -\log\left(\frac{\exp(y_{adv}^l[h_{y^l}])}{\sum_j \exp(y_{adv}^l[j])}\right) + \mathbb{E}_{y^u, y_{adv}^u \sim\hat{\mathcal{U}}} -\log\left(1-\frac{\exp(y_{adv}^u[h_{y^u}])}{\sum_j \exp(y_{adv}^u[j])}\right),\]where \(y^l\) and \(y^u\) are logits output by the main head \(h\) on labeled data and unlabeled data, respectively. \(y_{adv}^l\) and \(y_{adv}^u\) are logits output by the worst-case estimation head \(h_{\text{worst}}\). \(h_y\) refers to the predicted label when the logits output is \(y\).
- Parameters
eta_prime (float) – the trade-off hyper parameter \(\eta'\).
- Inputs:
y_l: logits output \(y^l\) by the main head on labeled data
y_l_adv: logits output \(y^l_{adv}\) by the worst-case estimation head on labeled data
y_u: logits output \(y^u\) by the main head on unlabeled data
y_u_adv: logits output \(y^u_{adv}\) by the worst-case estimation head on unlabeled data
- Shape:
Inputs: \((minibatch, C)\) where C denotes the number of classes.
Output: scalar.