wandb: [Q]wandb: ERROR Failed to sample metric: Not Supported

In the official example:

https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Simple_PyTorch_Integration.ipynb

Step3: Track gradients with wandb.watch() and everything else with wandb.log()

def train(model, loader, criterion, optimizer, config):
    # Tell wandb to watch what the model gets up to: gradients, weights, and more!
    wandb.watch(model, criterion, log="all", log_freq=10)

    # Run training and track with wandb
    total_batches = len(loader) * config.epochs
    example_ct = 0  # number of examples seen
    batch_ct = 0
    for epoch in tqdm(range(config.epochs)):
        for _, (images, labels) in enumerate(loader):

            loss = train_batch(images, labels, model, optimizer, criterion)
            example_ct +=  len(images)
            batch_ct += 1

            # Report metrics every 25th batch
            if ((batch_ct + 1) % 25) == 0:
                train_log(loss, example_ct, epoch)

def train_batch(images, labels, model, optimizer, criterion):
    images, labels = images.to(device), labels.to(device)
    
    # Forward pass ➡
    outputs = model(images)
    loss = criterion(outputs, labels)
    
    # Backward pass ⬅
    optimizer.zero_grad()
    loss.backward()

    # Step with optimizer
    optimizer.step()

    return loss
  • Define train_log()

    def train_log(loss, example_ct, epoch):
        # Where the magic happens
        wandb.log({"epoch": epoch, "loss": loss}, step=example_ct)
        print(f"Loss after {str(example_ct).zfill(5)} examples: {loss:.3f}")
    

And here is my code:

(most of the other code is the same as the example)

def train(model,train_dataloader,loss_function,optimizer,config,device):
    model.train()
    #tell wandb to watch the model
    wandb.watch(model,loss_function,log="all",log_freq=10)
    total_batches = len(train_dataloader)*config.epochs
    example_count = 0
    batch_count = 0
    print(f"Use {device} to train!")
    for epoch in tqdm(range(config.epochs)):
        for _,(images,labels) in enumerate(train_dataloader):
            images,labels = images.to(device),labels.to(device)
            outputs = model(images)
            loss = loss_function(outputs,labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            example_count += len(images)
            batch_count += 1
            #report metrics every 25 batch
            if ((batch_count+1)%25==0):
                wandb.log({"epoch":epoch,"loss":loss},step = example_count)
        print(f"epoch{epoch+1} loss:{loss}")
    print("TRAIN FINISHED!")

I want to use wandb.wath() to track the gradients and other parameters, but there are only “loss”,”accuracy” and “epoch” in my dashboard.

Everything else goes well except for tracking parameters. I dont know if “wandb: ERROR Failed to sample metric: Not Supported” is the root problem of the failing to track parameters such as gradients, weights.

By the way, when I use wandb without wandb.watch(), the “wandb: ERROR Failed to sample metric: Not Supported” also exists. But it seems won’t influence wandb.log().

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 21 (5 by maintainers)

Most upvoted comments

It works well with wandb==0.13.4 on Windows.

@ShaohanTian @wangerlie does this issue only occur for you in the most recent versions, would it be possible that you check with 0.13.4?

Hi @ewth it appears that you get a slightly different error wandb: ERROR Failed to serialize metric: division by zero. Any chance you’re using define_metric and could you check if you’re dividing by zero during your training?