I’m currently trying to fine-tune a Image-captioning model and getting this error:
ValueError: Expected input batch_size (3) to match target batch_size (27).
I’m sure this is the loss function that is incorrect and I’m new to PyTorch and don’t know how to correctly configure it.
The model
dataset = CustomDataset(image_folder=args.image_folder,
image_to_caption=image_to_caption,
transform=transforms.ToTensor())
dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True, drop_last=False)
model = FineTuneModel(args.embed_size, args.hidden_size, vocab_size, args.num_layers)
model = model.to(device)
encoder = EncoderCNN(args.embed_size).to(device)
decoder = DecoderRNN(args.embed_size, args.hidden_size, len(vocab), args.num_layers).to(device)
criterion = nn.CrossEntropyLoss()
params = list(decoder.parameters()) + list(encoder.linear.parameters()) + list(encoder.bn.parameters())
optimizer = torch.optim.Adam(params, lr=args.learning_rate)
The training loop:
for epoch in range(args.num_epochs):
model.train()
total_loss = 0
for i, (images, captions, lengths) in enumerate(dataloader):
images = images.to(device)
captions = captions.to(device)
targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]
features = encoder(images)
outputs = decoder(features, captions, lengths)
loss = criterion(outputs, targets)
decoder.zero_grad()
encoder.zero_grad()
loss.backward()
optimizer.step()
print(f'Epoch [{epoch+1}/{args.num_epochs}], Loss: {total_loss/len(dataloader)}')
torch.save(model.decoder.state_dict(),
os.path.join(args.fine_path,
'decoder-1-1.ckpt'))
torch.save(model.encoder.state_dict(),
os.path.join(args.fine_path,
'encoder-1-1.ckpt'))
print("num_epochs: ",args.num_epochs)
I’m so sorry, I picked a draft by accident, on the start of the question I said something about the batch_size, and solved that issue and know appears the "Not enough values to unpack (expected 3 , got 2)
Hi @Henriquept,
Can you share the stacktrace (rather than just the error message) as well? That’ll point to the line where the error is occurring!
Sure:
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
return img, torch.tensor(tokenized_captions)
Traceback (most recent call last):
File “/pytorch-tutorial/tutorials/03-advanced/image_captioning/finetuneme.py”, line 159, in
main(args)
File “/home/finetuneme.py”, line 105, in main
for i, (images, captions, lengths) in enumerate(dataloader):
^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 3, got 2)
Your dataloader
object is only returning 2 items, instead of the 3 (images, captions, lengths) that you’ve placed inside your for-loop. I’d print out the contents of the dataloader and check what its iterating over and go from there.
Thanks, I’ve change it to:
´´´python
for i, (images, captions) in enumerate(dataloader):
images = images.to(device)
captions = captions.to(device)
lengths = [len(cap) for cap in captions]
outputs = model(images, captions, lengths)
loss = criterion(outputs.squeeze(0), captions.flatten())
model.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
´´´
and now the error is:
Expected input batch_size (3) to match target batch_size (27).
line 112, in main
loss = criterion(outputs.squeeze(0), captions.flatten())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Check the sizes of the outputs
and captions
tensors, and they should have different shapes, perhaps the flatten
command should be over a specific dim (rather than the entire tensor).
Heres the prints
Outputs: torch.Size([3, 9956])
Captions: torch.Size([1, 3, 9])
This is how I standarlize the photo & the caption:
def __getitem__(self, idx):
img_name = os.path.join(self.image_folder, self.images_names[idx])
img = Image.open(img_name)
img = transform(img)
#img = torch.randn(256, 256)
#img.unsqueeze_(0).repeat(3, 1, 1)
def tokenize(caption, vocabulary):
words = caption.split()
tokens = []
for word in words:
if word in vocabulary.word2idx:
tokens.append(vocabulary(word))
return tokens
captions = self.image_to_captions[str(idx)]
tokenized_captions = [torch.tensor(tokenize(caption, vocab)) for caption in captions]
tokenized_captions = pad_sequence(tokenized_captions, batch_first=True)
return img, torch.tensor(tokenized_captions)
For the nn.CrossEntropyLoss()
, shouldn’t the input tensors be the same shape? If you do outputs.squeeze(0)
, the shapes passed to the loss function are, [1, 3, 9956]
and [1, 3, 9]
respectively, which aren’t the same.
Perhaps you need to map the outputs
tensor to a reduced shape and then pass it to the loss function?
1 Like
Yes, I’m sure they need to be the same shape, but I don’t fully understand how to do that, if I map the output to the same shape of the caption doesn’t lose its value?
You’ll lose some information in the mapping from the 9956 length vector to the 9 length vector, but it shouldn’t be too much of a problem.
A simple way to project would be to use a nn.Linear
object to project from 9956 to 9, via something like,
linear = nn.Linear(9956,9)
reduced_outputs = linear(outpts) #affine projection from 9956 to 9
Outputs: torch.Size([3, 9956])
Outputs: torch.Size([1, 3, 9])
Captions: torch.Size([1, 3, 9])
Traceback (most recent call last):
File "/home/es/Documents/projects/anothertry/pytorch-tutorial/tutorials/03-advanced/image_captioning/finetuneme.py", line 168, in <module>
main(args)
File "/homeles/Documents/projects/anothertry/pytorch-tutorial/tutorials/03-advanced/image_captioning/finetuneme.py", line 119, in main
loss = criterion(outputs.squeeze(0), captions.flatten())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home//anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/es/anaconda3/lib/python3.11/site-packages/torch/nn/modules/loss.py", line 1185, in forward
return F.cross_entropy(input, target, weight=self.weight,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/T/anaconda3/lib/python3.11/site-packages/torch/nn/functional.py", line 3086, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: Expected input batch_size (3) to match target batch_size (27).
for i, (images, captions) in enumerate(dataloader):
images = images.to(device)
captions = captions.to(device)
lengths = [len(cap) for cap in captions]
outputs = model(images, captions, lengths)
print("Outputs: ", outputs.size())
linear = nn.Linear(9956, 9).to(device)
reduced_outputs = linear(outputs)
reduced_outputs = reduced_outputs.unsqueeze(0)
print("Outputs: ", reduced_outputs.size())
print("Captions: ", captions.size())
loss = criterion(outputs.squeeze(0), captions.flatten())
model.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f'Epoch [{epoch+1}/{args.num_epochs}], Loss: {total_loss/len(dataloader)}')
When you flatten captions
it becomes shape [1,27]
whereas outputs is shape [1,3,9]
. You need to print the shapes out and check they’re the same.
1 Like
I removed the flatten because the size was already the same as outputs:
loss = criterion(outputs.squeeze(0), captions)
But now it changes to: ValueError: Expected input batch_size (3) to match target batch_size (1).
, if I remove the squeeze(0) it’s still the same error.
EDIT:
I forgot that I separated the outputs from reduced outputs. I’ve fixed it and know the error is:
Error:
RuntimeError: Expected floating point type for target with class probabilities, got Long
line 119, in main
loss = criterion(reduced_outputs, captions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The two tensors you pass to your loss function are of different dtypes (One is torch.float32
, the other is torch.long
. You need to cast them to the same type (torch.float32
), via .to(dtype=torch.float32)
.
1 Like
Thanks, now it appears this error:
RuntimeError: Expected input size [3, 9], got [3, 9, 256]
Again, you need to make sure the shapes are the same size and track the operation that lead to the mis-match in shapes.
When I try print the shapes it prints: print("Captions shape: ", captions.shape()) ^^^^^^^^^^^^^^^^ TypeError: 'torch.Size' object is not callable
The .shape
attribute isn’t a method, you just need to print captions.shape
1 Like
Prints:
Before Outputs size: torch.Size([3, 9956]) Before Outputs shape: torch.Size([3, 9956]) Outputs size: torch.Size([1, 3, 9]) Captions size: torch.Size([1, 3, 9]) Captions shape: torch.Size([1, 3, 9]) Outputs shape: torch.Size([1, 3, 9])
Error:
RuntimeError: Expected input size [3, 9], got [3, 9, 256]
EDIT: If I put the same Size by doing outputs = outputs.view(3, 9)
brings back the issue RuntimeError: Expected input size [3, 9], got [3, 9, 256]