Models¶
Image Classification¶
ResNets¶
Modified based on torchvision.models.resnet. @author: Junguang Jiang @contact: JiangJunguang1123@outlook.com
-
class
tllib.vision.models.resnet.
ResNet
(*args, **kwargs)[source]¶ ResNets without fully connected layer
-
property
out_features
¶ The dimension of output features
-
property
-
tllib.vision.models.resnet.
resnet18
(pretrained=False, progress=True, **kwargs)[source]¶ ResNet-18 model from “Deep Residual Learning for Image Recognition”
-
tllib.vision.models.resnet.
resnet34
(pretrained=False, progress=True, **kwargs)[source]¶ ResNet-34 model from “Deep Residual Learning for Image Recognition”
-
tllib.vision.models.resnet.
resnet50
(pretrained=False, progress=True, **kwargs)[source]¶ ResNet-50 model from “Deep Residual Learning for Image Recognition”
-
tllib.vision.models.resnet.
resnet101
(pretrained=False, progress=True, **kwargs)[source]¶ ResNet-101 model from “Deep Residual Learning for Image Recognition”
-
tllib.vision.models.resnet.
resnet152
(pretrained=False, progress=True, **kwargs)[source]¶ ResNet-152 model from “Deep Residual Learning for Image Recognition”
-
tllib.vision.models.resnet.
resnext50_32x4d
(pretrained=False, progress=True, **kwargs)[source]¶ ResNeXt-50 32x4d model from “Aggregated Residual Transformation for Deep Neural Networks”
-
tllib.vision.models.resnet.
resnext101_32x8d
(pretrained=False, progress=True, **kwargs)[source]¶ ResNeXt-101 32x8d model from “Aggregated Residual Transformation for Deep Neural Networks”
-
tllib.vision.models.resnet.
wide_resnet50_2
(pretrained=False, progress=True, **kwargs)[source]¶ Wide ResNet-50-2 model from “Wide Residual Networks”
The model is the same as ResNet except for the bottleneck number of channels which is twice larger in every block. The number of channels in outer 1x1 convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048 channels, and in Wide ResNet-50-2 has 2048-1024-2048.
-
tllib.vision.models.resnet.
wide_resnet101_2
(pretrained=False, progress=True, **kwargs)[source]¶ Wide ResNet-101-2 model from “Wide Residual Networks”
The model is the same as ResNet except for the bottleneck number of channels which is twice larger in every block. The number of channels in outer 1x1 convolutions is the same, e.g. last block in ResNet-50 has 2048-512-2048 channels, and in Wide ResNet-50-2 has 2048-1024-2048.
LeNet¶
LeNet model from “Gradient-based learning applied to document recognition”
- param num_classes
number of classes. Default: 10
- type num_classes
int
Note
The input image size must be 28 x 28.
DTN¶
DTN model
- param num_classes
number of classes. Default: 10
- type num_classes
int
Note
The input image size must be 32 x 32.
Object Detection¶
-
class
tllib.vision.models.object_detection.meta_arch.
TLGeneralizedRCNN
(*args, finetune=False, **kwargs)[source]¶ Generalized R-CNN for Transfer Learning. Similar to that in in Supervised Learning, TLGeneralizedRCNN has the following three components: 1. Per-image feature extraction (aka backbone) 2. Region proposal generation 3. Per-region feature extraction and prediction
Different from that in Supervised Learning, TLGeneralizedRCNN 1. accepts unlabeled images during training (return no losses) 2. return both detection outputs, features, and losses during training
- Parameters
backbone – a backbone module, must follow detectron2’s backbone interface
proposal_generator – a module that generates proposals using backbone features
roi_heads – a ROI head that performs per-region computation
pixel_std (pixel_mean,) – list or tuple with #channels element, representing the per-channel mean and std to be used to normalize the input image
input_format – describe the meaning of channels of input. Needed by visualization
vis_period – the period to run visualization. Set to 0 to disable.
finetune (bool) – whether finetune the detector or train from scratch. Default: True
- Inputs:
batched_inputs: a list, batched outputs of
DatasetMapper
. Each item in the list contains the inputs for one image. For now, each item in the list is a dict that contains:image: Tensor, image in (C, H, W) format.
instances (optional): groundtruth
Instances
proposals (optional):
Instances
, precomputed proposals.“height”, “width” (int): the output resolution of the model, used in inference. See
postprocess()
for details.
labeled (bool, optional): whether has ground-truth label
- Outputs:
outputs: A list of dict where each dict is the output for one input image. The dict contains a key “instances” whose value is a
Instances
and a key “features” whose value is the features of middle layers. TheInstances
object has the following keys: “pred_boxes”, “pred_classes”, “scores”, “pred_masks”, “pred_keypoints”losses: A dict of different losses
-
class
tllib.vision.models.object_detection.meta_arch.
TLRetinaNet
(*args, finetune=False, **kwargs)[source]¶ RetinaNet for Transfer Learning.
Different from that in Supervised Learning, TLRetinaNet 1. accepts unlabeled images during training (return no losses) 2. return both detection outputs, features, and losses during training
- Parameters
backbone – a backbone module, must follow detectron2’s backbone interface
head (nn.Module) – a module that predicts logits and regression deltas for each level from a list of per-level features
head_in_features (Tuple[str]) – Names of the input feature maps to be used in head
anchor_generator (nn.Module) – a module that creates anchors from a list of features. Usually an instance of
AnchorGenerator
box2box_transform (Box2BoxTransform) – defines the transform from anchors boxes to instance boxes
anchor_matcher (Matcher) – label the anchors by matching them with ground truth.
num_classes (int) – number of classes. Used to label background proposals.
Loss parameters (#) –
focal_loss_alpha (float) – focal_loss_alpha
focal_loss_gamma (float) – focal_loss_gamma
smooth_l1_beta (float) – smooth_l1_beta
box_reg_loss_type (str) – Options are “smooth_l1”, “giou”
Inference parameters (#) –
test_score_thresh (float) – Inference cls score threshold, only anchors with score > INFERENCE_TH are considered for inference (to improve speed)
test_topk_candidates (int) – Select topk candidates before NMS
test_nms_thresh (float) – Overlap threshold used for non-maximum suppression (suppress boxes with IoU >= this threshold)
max_detections_per_image (int) – Maximum number of detections to return per image during inference (100 is based on the limit established for the COCO dataset).
Input parameters (#) –
pixel_mean (Tuple[float]) – Values to be used for image normalization (BGR order). To train on images of different number of channels, set different mean & std. Default values are the mean pixel value from ImageNet: [103.53, 116.28, 123.675]
pixel_std (Tuple[float]) – When using pre-trained models in Detectron1 or any MSRA models, std has been absorbed into its conv1 weights, so the std needs to be set 1. Otherwise, you can use [57.375, 57.120, 58.395] (ImageNet std)
vis_period (int) – The period (in terms of steps) for minibatch visualization at train time. Set to 0 to disable.
input_format (str) – Whether the model needs RGB, YUV, HSV etc.
finetune (bool) – whether finetune the detector or train from scratch. Default: True
- Inputs:
batched_inputs: a list, batched outputs of
DatasetMapper
. Each item in the list contains the inputs for one image. For now, each item in the list is a dict that contains:image: Tensor, image in (C, H, W) format.
instances (optional): groundtruth
Instances
“height”, “width” (int): the output resolution of the model, used in inference. See
postprocess()
for details.
labeled (bool, optional): whether has ground-truth label
- Outputs:
outputs: A list of dict where each dict is the output for one input image. The dict contains a key “instances” whose value is a
Instances
and a key “features” whose value is the features of middle layers. TheInstances
object has the following keys: “pred_boxes”, “pred_classes”, “scores”, “pred_masks”, “pred_keypoints”losses: A dict of different losses
-
class
tllib.vision.models.object_detection.proposal_generator.rpn.
TLRPN
(*args, **kwargs)[source]¶ Region Proposal Network, introduced by Faster R-CNN.
- Parameters
in_features (list[str]) – list of names of input features to use
head (nn.Module) – a module that predicts logits and regression deltas for each level from a list of per-level features
anchor_generator (nn.Module) – a module that creates anchors from a list of features. Usually an instance of
AnchorGenerator
anchor_matcher (Matcher) – label the anchors by matching them with ground truth.
box2box_transform (Box2BoxTransform) – defines the transform from anchors boxes to instance boxes
batch_size_per_image (int) – number of anchors per image to sample for training
positive_fraction (float) – fraction of foreground anchors to sample for training
pre_nms_topk (tuple[float]) – (train, test) that represents the number of top k proposals to select before NMS, in training and testing.
post_nms_topk (tuple[float]) – (train, test) that represents the number of top k proposals to select after NMS, in training and testing.
nms_thresh (float) – NMS threshold used to de-duplicate the predicted proposals
min_box_size (float) – remove proposal boxes with any side smaller than this threshold, in the unit of input image pixels
anchor_boundary_thresh (float) – legacy option
loss_weight (float|dict) –
weights to use for losses. Can be single float for weighting all rpn losses together, or a dict of individual weightings. Valid dict keys are:
”loss_rpn_cls” - applied to classification loss “loss_rpn_loc” - applied to box regression loss
box_reg_loss_type (str) – Loss type to use. Supported losses: “smooth_l1”, “giou”.
smooth_l1_beta (float) – beta parameter for the smooth L1 regression loss. Default to use L1 loss. Only used when box_reg_loss_type is “smooth_l1”
- Inputs:
images (ImageList): input images of length N
features (dict[str, Tensor]): input data as a mapping from feature map name to tensor. Axis 0 represents the number of images N in the input data; axes 1-3 are channels, height, and width, which may vary between feature maps (e.g., if a feature pyramid is used).
gt_instances (list[Instances], optional): a length N list of Instances`s. Each `Instances stores ground-truth instances for the corresponding image.
labeled (bool, optional): whether has ground-truth label. Default: True
- Outputs:
proposals: list[Instances]: contains fields “proposal_boxes”, “objectness_logits”
loss: dict[Tensor] or None
-
class
tllib.vision.models.object_detection.roi_heads.
TLRes5ROIHeads
(*args, **kwargs)[source]¶ The ROIHeads in a typical “C4” R-CNN model, where the box and mask head share the cropping and the per-region feature computation by a Res5 block.
- Parameters
in_features (list[str]) – list of backbone feature map names to use for feature extraction
pooler (ROIPooler) – pooler to extra region features from backbone
res5 (nn.Sequential) – a CNN to compute per-region features, to be used by
box_predictor
andmask_head
. Typically this is a “res5” block from a ResNet.box_predictor (nn.Module) – make box predictions from the feature. Should have the same interface as
FastRCNNOutputLayers
.mask_head (nn.Module) – transform features to make mask predictions
- Inputs:
images (ImageList):
features (dict[str,Tensor]): input data as a mapping from feature map name to tensor. Axis 0 represents the number of images N in the input data; axes 1-3 are channels, height, and width, which may vary between feature maps (e.g., if a feature pyramid is used).
proposals (list[Instances]): length N list of Instances. The i-th Instances contains object proposals for the i-th input image, with fields “proposal_boxes” and “objectness_logits”.
targets (list[Instances], optional): length N list of Instances. The i-th Instances contains the ground-truth per-instance annotations for the i-th input image. Specify targets during training only. It may have the following fields:
gt_boxes: the bounding box of each instance.
gt_classes: the label for each instance with a category ranging in [0, #class].
gt_masks: PolygonMasks or BitMasks, the ground-truth masks of each instance.
gt_keypoints: NxKx3, the groud-truth keypoints for each instance.
labeled (bool, optional): whether has ground-truth label. Default: True
- Outputs:
list[Instances]: length N list of Instances containing the detected instances. Returned during inference only; may be [] during training.
dict[str->Tensor]: mapping from a named loss to a tensor storing the loss. Used during training only.
-
class
tllib.vision.models.object_detection.roi_heads.
TLStandardROIHeads
(*args, **kwargs)[source]¶ It’s “standard” in a sense that there is no ROI transform sharing or feature sharing between tasks. Each head independently processes the input features by each head’s own pooler and head.
- Parameters
box_in_features (list[str]) – list of feature names to use for the box head.
box_pooler (ROIPooler) – pooler to extra region features for box head
box_head (nn.Module) – transform features to make box predictions
box_predictor (nn.Module) – make box predictions from the feature. Should have the same interface as
FastRCNNOutputLayers
.mask_in_features (list[str]) – list of feature names to use for the mask pooler or mask head. None if not using mask head.
mask_pooler (ROIPooler) – pooler to extract region features from image features. The mask head will then take region features to make predictions. If None, the mask head will directly take the dict of image features defined by mask_in_features
mask_head (nn.Module) – transform features to make mask predictions
keypoint_pooler, keypoint_head (keypoint_in_features,) – similar to
mask_*
.train_on_pred_boxes (bool) – whether to use proposal boxes or predicted boxes from the box head to train other heads.
- Inputs:
images (ImageList):
features (dict[str,Tensor]): input data as a mapping from feature map name to tensor. Axis 0 represents the number of images N in the input data; axes 1-3 are channels, height, and width, which may vary between feature maps (e.g., if a feature pyramid is used).
proposals (list[Instances]): length N list of Instances. The i-th Instances contains object proposals for the i-th input image, with fields “proposal_boxes” and “objectness_logits”.
targets (list[Instances], optional): length N list of Instances. The i-th Instances contains the ground-truth per-instance annotations for the i-th input image. Specify targets during training only. It may have the following fields:
gt_boxes: the bounding box of each instance.
gt_classes: the label for each instance with a category ranging in [0, #class].
gt_masks: PolygonMasks or BitMasks, the ground-truth masks of each instance.
gt_keypoints: NxKx3, the groud-truth keypoints for each instance.
labeled (bool, optional): whether has ground-truth label. Default: True
- Outputs:
list[Instances]: length N list of Instances containing the detected instances. Returned during inference only; may be [] during training.
dict[str->Tensor]: mapping from a named loss to a tensor storing the loss. Used during training only.
Semantic Segmentation¶
Keypoint Detection¶
PoseResNet¶
-
tllib.vision.models.keypoint_detection.pose_resnet.
pose_resnet101
(num_keypoints, pretrained_backbone=True, deconv_with_bias=False, finetune=False, progress=True, **kwargs)[source]¶ Constructs a Simple Baseline model with a ResNet-101 backbone.
- Parameters
num_keypoints (int) – number of keypoints
pretrained_backbone (bool, optional) – If True, returns a model pre-trained on ImageNet. Default: True.
deconv_with_bias (bool, optional) – Whether use bias in the deconvolution layer. Default: False
finetune (bool, optional) – Whether use 10x smaller learning rate in the backbone. Default: False
progress (bool, optional) – If True, displays a progress bar of the download to stderr. Default: True
-
class
tllib.vision.models.keypoint_detection.pose_resnet.
PoseResNet
(backbone, upsampling, feature_dim, num_keypoints, finetune=False)[source]¶ Simple Baseline for keypoint detection.
- Parameters
backbone (torch.nn.Module) – Backbone to extract 2-d features from data
upsampling (torch.nn.Module) – Layer to upsample image feature to heatmap size
feature_dim (int) – The dimension of the features from upsampling layer.
num_keypoints (int) – Number of keypoints
finetune (bool, optional) – Whether use 10x smaller learning rate in the backbone. Default: False
-
class
tllib.vision.models.keypoint_detection.pose_resnet.
Upsampling
(in_channel=2048, hidden_dims=(256, 256, 256), kernel_sizes=(4, 4, 4), bias=False)[source]¶ 3-layers deconvolution used in Simple Baseline.
Joint Loss¶
-
class
tllib.vision.models.keypoint_detection.loss.
JointsMSELoss
(reduction='mean')[source]¶ Typical MSE loss for keypoint detection.
- Parameters
reduction (str, optional) – Specifies the reduction to apply to the output:
'none'
|'mean'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output. Default:'mean'
- Inputs:
output (tensor): heatmap predictions
target (tensor): heatmap labels
target_weight (tensor): whether the keypoint is visible. All keypoint is visible if None. Default: None.
- Shape:
output: \((minibatch, K, H, W)\) where K means the number of keypoints, H and W is the height and width of the heatmap respectively.
target: \((minibatch, K, H, W)\).
target_weight: \((minibatch, K)\).
Output: scalar by default. If
reduction
is'none'
, then \((minibatch, K)\).
-
class
tllib.vision.models.keypoint_detection.loss.
JointsKLLoss
(reduction='mean', epsilon=0.0)[source]¶ KL Divergence for keypoint detection proposed by Regressive Domain Adaptation for Unsupervised Keypoint Detection.
- Parameters
reduction (str, optional) – Specifies the reduction to apply to the output:
'none'
|'mean'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output. Default:'mean'
- Inputs:
output (tensor): heatmap predictions
target (tensor): heatmap labels
target_weight (tensor): whether the keypoint is visible. All keypoint is visible if None. Default: None.
- Shape:
output: \((minibatch, K, H, W)\) where K means the number of keypoints, H and W is the height and width of the heatmap respectively.
target: \((minibatch, K, H, W)\).
target_weight: \((minibatch, K)\).
Output: scalar by default. If
reduction
is'none'
, then \((minibatch, K)\).
Re-Identification¶
Models¶
-
class
tllib.vision.models.reid.resnet.
ReidResNet
(*args, **kwargs)[source]¶ Modified ResNet architecture for ReID from Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification (ICLR 2020). We change stride of \(layer4\_group1\_conv2, layer4\_group1\_downsample1\) to 1. During forward pass, we will not activate self.relu. Please refer to source code for details.
@author: Baixu Chen @contact: cbx_99_hasta@outlook.com
-
tllib.vision.models.reid.resnet.
reid_resnet18
(pretrained=False, progress=True, **kwargs)[source]¶ Constructs a Reid-ResNet-18 model.
-
tllib.vision.models.reid.resnet.
reid_resnet34
(pretrained=False, progress=True, **kwargs)[source]¶ Constructs a Reid-ResNet-34 model.
-
tllib.vision.models.reid.resnet.
reid_resnet50
(pretrained=False, progress=True, **kwargs)[source]¶ Constructs a Reid-ResNet-50 model.
-
tllib.vision.models.reid.resnet.
reid_resnet101
(pretrained=False, progress=True, **kwargs)[source]¶ Constructs a Reid-ResNet-101 model.
-
class
tllib.vision.models.reid.identifier.
ReIdentifier
(backbone, num_classes, bottleneck=None, bottleneck_dim=-1, finetune=True, pool_layer=None)[source]¶ Person reIdentifier from Bag of Tricks and A Strong Baseline for Deep Person Re-identification (CVPR 2019). Given 2-d features \(f\) from backbone network, the authors pass \(f\) through another BatchNorm1d layer and get \(bn\_f\), which will then pass through a Linear layer to output predictions. During training, we use \(f\) to compute triplet loss. While during testing, \(bn\_f\) is used as feature. This may be a little confusing. The figures in the origin paper will help you understand better.
-
property
features_dim
¶ The dimension of features before the final head layer
-
property
Loss¶
-
class
tllib.vision.models.reid.loss.
TripletLoss
(margin, normalize_feature=False)[source]¶ Triplet loss augmented with batch hard from In defense of the Triplet Loss for Person Re-Identification (ICCV 2017).
Sampler¶
-
class
tllib.utils.data.
RandomMultipleGallerySampler
(dataset, num_instances=4)[source]¶ Sampler from In defense of the Triplet Loss for Person Re-Identification (ICCV 2017). Assume there are \(N\) identities in the dataset, this implementation simply samples \(K\) images for every identity to form an iter of size \(N\times K\). During training, we will call
__iter__
method of pytorch dataloader once we reach aStopIteration
, this guarantees every image in the dataset will eventually be selected and we are not wasting any training data.