I’m currently trying to fine-tune a Image-captioning model and getting this error:
ValueError: Expected input batch_size (3) to match target batch_size (27).
I’m sure this is the loss function that is incorrect and I’m new to PyTorch and don’t know how to correctly configure it.
The model

dataset = CustomDataset(image_folder=args.image_folder,
                            image_to_caption=image_to_caption,
                            transform=transforms.ToTensor())

    dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True, drop_last=False)

    model = FineTuneModel(args.embed_size, args.hidden_size, vocab_size, args.num_layers)
    model = model.to(device)

    encoder = EncoderCNN(args.embed_size).to(device)
    decoder = DecoderRNN(args.embed_size, args.hidden_size, len(vocab), args.num_layers).to(device)
    
    
    criterion = nn.CrossEntropyLoss()
    params = list(decoder.parameters()) + list(encoder.linear.parameters()) + list(encoder.bn.parameters())

    optimizer = torch.optim.Adam(params, lr=args.learning_rate)

The training loop:

for epoch in range(args.num_epochs):
        model.train()
        total_loss = 0
        for i, (images, captions, lengths) in enumerate(dataloader):
            images = images.to(device)
            captions = captions.to(device)
            targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]
            
            features = encoder(images)
            outputs = decoder(features, captions, lengths)
            loss = criterion(outputs, targets)
            decoder.zero_grad()
            encoder.zero_grad()
            loss.backward()
            optimizer.step()
            print(f'Epoch [{epoch+1}/{args.num_epochs}], Loss: {total_loss/len(dataloader)}')

        torch.save(model.decoder.state_dict(),
                os.path.join(args.fine_path,
                                'decoder-1-1.ckpt'))

        torch.save(model.encoder.state_dict(),
                os.path.join(args.fine_path,
                                'encoder-1-1.ckpt'))
    
    print("num_epochs: ",args.num_epochs)

I’m so sorry, I picked a draft by accident, on the start of the question I said something about the batch_size, and solved that issue and know appears the "Not enough values to unpack (expected 3 , got 2)

Hi @Henriquept,

Can you share the stacktrace (rather than just the error message) as well? That’ll point to the line where the error is occurring!

Sure:
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return img, torch.tensor(tokenized_captions)
Traceback (most recent call last):
File “/pytorch-tutorial/tutorials/03-advanced/image_captioning/finetuneme.py”, line 159, in
main(args)
File “/home/finetuneme.py”, line 105, in main
for i, (images, captions, lengths) in enumerate(dataloader):
^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 2)

Your dataloader object is only returning 2 items, instead of the 3 (images, captions, lengths) that you’ve placed inside your for-loop. I’d print out the contents of the dataloader and check what its iterating over and go from there.

Thanks, I’ve change it to:
´´´python
for i, (images, captions) in enumerate(dataloader):
images = images.to(device)
captions = captions.to(device)

        lengths = [len(cap) for cap in captions]
        outputs = model(images, captions, lengths)
        
        loss = criterion(outputs.squeeze(0), captions.flatten())
        model.zero_grad()
        loss.backward()
        
        optimizer.step()

        total_loss += loss.item()

´´´
and now the error is:
Expected input batch_size (3) to match target batch_size (27).

line 112, in main
    loss = criterion(outputs.squeeze(0), captions.flatten())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Check the sizes of the outputs and captions tensors, and they should have different shapes, perhaps the flatten command should be over a specific dim (rather than the entire tensor).

Heres the prints

Outputs:  torch.Size([3, 9956])
Captions:  torch.Size([1, 3, 9])

This is how I standarlize the photo & the caption:

def __getitem__(self, idx):
            img_name = os.path.join(self.image_folder, self.images_names[idx])
            img = Image.open(img_name)
            img = transform(img)

            #img = torch.randn(256, 256)
            #img.unsqueeze_(0).repeat(3, 1, 1)            
            def tokenize(caption, vocabulary):
                words = caption.split()
                tokens = []
                for word in words:
                    if word in vocabulary.word2idx:
                        tokens.append(vocabulary(word))
                return tokens

            captions = self.image_to_captions[str(idx)]
            tokenized_captions = [torch.tensor(tokenize(caption, vocab)) for caption in captions]
            tokenized_captions = pad_sequence(tokenized_captions, batch_first=True)

            return img, torch.tensor(tokenized_captions)

For the nn.CrossEntropyLoss(), shouldn’t the input tensors be the same shape? If you do outputs.squeeze(0), the shapes passed to the loss function are, [1, 3, 9956] and [1, 3, 9] respectively, which aren’t the same.

Perhaps you need to map the outputs tensor to a reduced shape and then pass it to the loss function?

1 Like

Yes, I’m sure they need to be the same shape, but I don’t fully understand how to do that, if I map the output to the same shape of the caption doesn’t lose its value?

You’ll lose some information in the mapping from the 9956 length vector to the 9 length vector, but it shouldn’t be too much of a problem.

A simple way to project would be to use a nn.Linear object to project from 9956 to 9, via something like,

linear = nn.Linear(9956,9)
reduced_outputs = linear(outpts) #affine projection from 9956 to 9
Outputs:  torch.Size([3, 9956])
Outputs:  torch.Size([1, 3, 9])
Captions:  torch.Size([1, 3, 9])
Traceback (most recent call last):
  File "/home/es/Documents/projects/anothertry/pytorch-tutorial/tutorials/03-advanced/image_captioning/finetuneme.py", line 168, in <module>
    main(args)
  File "/homeles/Documents/projects/anothertry/pytorch-tutorial/tutorials/03-advanced/image_captioning/finetuneme.py", line 119, in main
    loss = criterion(outputs.squeeze(0), captions.flatten())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home//anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/es/anaconda3/lib/python3.11/site-packages/torch/nn/modules/loss.py", line 1185, in forward
    return F.cross_entropy(input, target, weight=self.weight,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/T/anaconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 3086, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Expected input batch_size (3) to match target batch_size (27).
for i, (images, captions) in enumerate(dataloader):
            images = images.to(device)
            captions = captions.to(device)


            lengths = [len(cap) for cap in captions]
            outputs = model(images, captions, lengths)
            print("Outputs: ", outputs.size())
            linear = nn.Linear(9956, 9).to(device)
            reduced_outputs = linear(outputs)
            reduced_outputs = reduced_outputs.unsqueeze(0)
            print("Outputs: ", reduced_outputs.size())
            print("Captions: ", captions.size())
            
            loss = criterion(outputs.squeeze(0), captions.flatten())
            model.zero_grad()
            loss.backward()
            
            optimizer.step()

            total_loss += loss.item()
            print(f'Epoch [{epoch+1}/{args.num_epochs}], Loss: {total_loss/len(dataloader)}')

When you flatten captions it becomes shape [1,27] whereas outputs is shape [1,3,9]. You need to print the shapes out and check they’re the same.

1 Like

I removed the flatten because the size was already the same as outputs:
loss = criterion(outputs.squeeze(0), captions)
But now it changes to: ValueError: Expected input batch_size (3) to match target batch_size (1)., if I remove the squeeze(0) it’s still the same error.

EDIT:
I forgot that I separated the outputs from reduced outputs. I’ve fixed it and know the error is:

Error:
RuntimeError: Expected floating point type for target with class probabilities, got Long
line 119, in main
loss = criterion(reduced_outputs, captions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The two tensors you pass to your loss function are of different dtypes (One is torch.float32, the other is torch.long. You need to cast them to the same type (torch.float32), via .to(dtype=torch.float32).

1 Like

Thanks, now it appears this error:
RuntimeError: Expected input size [3, 9], got [3, 9, 256]

Again, you need to make sure the shapes are the same size and track the operation that lead to the mis-match in shapes.

When I try print the shapes it prints: print("Captions shape: ", captions.shape()) ^^^^^^^^^^^^^^^^ TypeError: 'torch.Size' object is not callable

The .shape attribute isn’t a method, you just need to print captions.shape

1 Like

Prints:
Before Outputs size: torch.Size([3, 9956]) Before Outputs shape: torch.Size([3, 9956]) Outputs size: torch.Size([1, 3, 9]) Captions size: torch.Size([1, 3, 9]) Captions shape: torch.Size([1, 3, 9]) Outputs shape: torch.Size([1, 3, 9])

Error:
RuntimeError: Expected input size [3, 9], got [3, 9, 256]

EDIT: If I put the same Size by doing outputs = outputs.view(3, 9) brings back the issue RuntimeError: Expected input size [3, 9], got [3, 9, 256]