Parameter optimization advanced configuration#

Assume that the network uses the following optimizer for training：

import megengine.optimizer as optim
optimizer = optim.SGD(
    le_net.parameters(),    # 参数列表，将指定参数与优化器绑定
    lr=0.05,                # 学习速率
)

This optimizer uses the same learning rate to optimize all parameters. This chapter will introduce how to use different learning rates for different parameters.

Different parameters use different learning rates#

Optimizer supports grouping the parameters of the network, and different parameter groups can be trained with different learning rates. A parameter group is represented by a dictionary with

'params': param_list, used to specify the parameters included in the parameter group. This key-value pair must be present.
'lr': learning_rate is used to specify the learning rate of this parameter group. This key-value pair can sometimes be omitted, and the learning rate of the parameter group after it is omitted is specified by the optimizer.

The dictionaries of all parameter groups to be optimized will form a list as Optimizer is instantiated.

In order to better illustrate parameters, we first use the Module provided :meth:` ~ .megengine.module.Module.named_parameters` function to group network parameters. This function returns a dictionary that contains all the parameters of the network and uses the parameter name as the key and the parameter variable as the value：

for (name, param) in le_net.named_parameters():
    print(name, param.shape) # 打印参数的名字和对应张量的形状

classifer.bias (10,)
classifer.weight (10, 84)
conv1.bias (1, 6, 1, 1)
conv1.weight (6, 1, 5, 5)
conv2.bias (1, 16, 1, 1)
conv2.weight (16, 6, 5, 5)
fc1.bias (120,)
fc1.weight (120, 400)
fc2.bias (84,)
fc2.weight (84, 120)

The parameter name `` LeNet`` we can convolution of all the parameters into a set of parameters for all layers fully connected into another group：

conv_param_list = []
fc_param_list = []
for (name, param) in le_net.named_parameters():
    # 所有卷积的参数为一组，所有全连接层的参数为另一组
    if 'conv' in name:
        conv_param_list.append(param)
    else:
        fc_param_list.append(param)

According to the following code to set the different parameters grouped set different learning rate：

import megengine.optimizer as optim

optimizer = optim.SGD(
    # 参数组列表即param_groups，每个参数组都可以自定义学习速率，也可不自定义，统一使用优化器设置的学习速率
    [
        {'params': conv_param_list},            # 卷积参数所属的参数组，未自定义学习速率
        {'params': fc_param_list, 'lr': 0.01}   # 全连接层参数所属的参数组，自定义学习速率为0.01
    ],
    lr=0.05,  # 参数组例表中未指定学习速率的参数组服从此设置，如所有卷积参数
)

The list of parameter groups set in the optimizer corresponds to the param_groups attribute. We can get the learning rate of different parameter groups through it.

# 打印每个参数组所含参数的数量和对应的学习速率
print(len(optimizer.param_groups[0]['params']), optimizer.param_groups[0]['lr'])
print(len(optimizer.param_groups[1]['params']), optimizer.param_groups[1]['lr'])

4 0.05
6 0.01

Changes to the learning rate during training#

MegEngine also supports the modification of the learning rate during the training process. For example, some parameters do not need to be optimized after training to a certain level. At this time, the learning rate of the corresponding parameter group can be set to zero. We modify the training code to illustrate with examples. The modified training code trains a total of four epochs. At the end of the second epoch, we will zero the learning rate of all fully connected layer parameters, and output the part of the fully connected layer in LeNet in each epoch The parameter value shows whether it is updated.

print("original parameter: {}".format(optimizer.param_groups[1]['params'][0]))
for epoch in range(4):
    for step, (batch_data, batch_label) in enumerate(dataloader):
        _, loss = train_func(batch_data, batch_label, le_net, gm)
        optimizer.step()  # 根据梯度更新参数值
        optimizer.clear_grad() # 将参数的梯度置零

    # 输出 LeNet 中全连接层的部分参数值
    print("epoch: {}, parameter: {}".format(epoch, optimizer.param_groups[1]['params'][0]))

    if epoch == 1:
        # 将所有全连接层参数的学习速率改为0.0
        optimizer.param_groups[1]['lr'] = 0.0
        print("\nset lr zero\n")

original parameter: Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], device=xpux:0)
epoch: 0, parameter: Tensor([-0.0102  0.0082  0.0062 -0.0093 -0.0018  0.0132 -0.0064  0.0077 -0.0005 -0.007 ], device=xpux:0)
epoch: 1, parameter: Tensor([-0.0094  0.008   0.0066 -0.0105 -0.0026  0.0141 -0.008   0.0073  0.0015 -0.0071], device=xpux:0)

set lr zero

epoch: 2, parameter: Tensor([-0.0094  0.008   0.0066 -0.0105 -0.0026  0.0141 -0.008   0.0073  0.0015 -0.0071], device=xpux:0)
epoch: 3, parameter: Tensor([-0.0094  0.008   0.0066 -0.0105 -0.0026  0.0141 -0.008   0.0073  0.0015 -0.0071], device=xpux:0)

It can be seen from the output that the parameter value is continuously updated before the learning rate is set to 0, but the parameter value does not change after it is set to 0.

While most networks in the training which will continue to reduce the learning rate, the following code shows how MegEngine learning rate decreases linearly in the training process：

total_epochs = 10
learning_rate = 0.05 # 初始学习速率
for epoch in range(total_epochs):
    # 设置当前epoch的学习速率
    for param_group in optimizer.param_groups: # param_groups中包含所有需要此优化器更新的参数
        # 学习速率线性递减，每个epoch调整一次
        param_group["lr"] = learning_rate * (1 - float(epoch) / total_epochs)

Use different optimizers for different parameters#

For different parameters, you can also use different optimizers to optimize them separately. Zero the parameter gradient ( clear_grad) and update ( step) operations, if all optimizers are performed at the same time, you can define A MultipleOptimizer class. Declare multiple different optimizers during initialization, and perform corresponding operations on all optimizers when calling the zeroing function and the update function.

class MultipleOptimizer(object):
    def __init__(*opts):
        self.opts = opts

    def clear_grad(self):
        for opt in self.opts:
            opt.clear_grad()

    def step(self):
        for opt in self.opts:
            opt.step()

Suppose you want to use SGD to optimize all convolution parameters, and use Adam to optimize all fully connected layer parameters. The optimizer can be defined as follows, and the effect of using different optimizers to optimize different parameters can be achieved without changing the training code.

optimizer = MultipleOptimizer(
    optim.SGD(conv_param_list, lr=0.05), optim.Adam(fc_param_list, lr=0.01)
)

If different parameter gradients are zeroed and updated at the same time, you only need to define multiple optimizers and call the corresponding functions at different times.

Fixed some parameters not optimized#

In addition to the training parameters will not be grouped together and learning rate is set to zero outside, MegEngine also offers other ways to fix parameters are not optimized： only the parameters to be optimized and the derivation and optimizer can bind. As shown in the following code, only the convolution parameters in ``LeNet’’ are optimized：

import megengine.optimizer as optim
from megengine.autodiff import GradManager

le_net = LeNet()
param_list = []
for (name, param) in le_net.named_parameters():
    if 'conv' in name: # 仅训练LeNet中的卷积参数
        param_list.append(param)

optimizer = optim.SGD(
    param_list, # 参数
    lr=0.05,    # 学习速率
)

gm = GradManager().attach(param_list)

The following code is added to the above provided specific training them, a gradient of the difference can be seen more intuitive respective parameters：

learning_rate = 0.05
total_epochs = 1 # 为了减少输出，本次训练仅训练一个epoch
for epoch in range(total_epochs):
    # 设置当前epoch的学习速率
    for param_group in optimizer.param_groups:
        param_group["lr"] = learning_rate * (1 - float(epoch) / total_epochs)

    total_loss = 0
    for step, (batch_data, batch_label) in enumerate(dataloader):
        batch_data = tensor(batch_data)
        batch_label = tensor(batch_label)
        _, loss = train_func(batch_data, batch_label, le_net, gm)
        optimizer.step()  # 根据梯度更新参数值
        optimizer.clear_grad() # 将参数的梯度置零
        total_loss += loss.numpy().item()

    # 输出每个参数的梯度
    for (name, param) in le_net.named_parameters():
        if param.grad is None:
            print(name, param.grad)
        else:
            print(name, param.grad.sum())

classifier.bias None
classifier.weight None
conv1.bias Tensor([-0.0432], device=xpux:0)
conv1.weight Tensor([0.1256], device=xpux:0)
conv2.bias Tensor([0.0147], device=xpux:0)
conv2.weight Tensor([5.0205], device=xpux:0)
fc1.bias None
fc1.weight None
fc2.bias None
fc2.weight None

From the output, it can be seen that except for the convolution parameter, the other parameters have no gradient and will not be updated.