finetune-nlp-pytorch

Repeating steps from https://huggingface.co/course/chapter3/4?fw=pt

from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from datasets import load_dataset
from transformers import Trainer, TrainingArguments

checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
raw_datasets = load_dataset('glue', 'mrpc')

def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer)

# modify dataset to be used with pytorch model
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2', 'idx'])
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')
tokenized_datasets.set_format('torch')
print(f"dataset column names - {tokenized_datasets['train'].column_names}")


# create dataloaders
from torch.utils.data import DataLoader
train_dataloader = DataLoader(tokenized_datasets['train'], shuffle=True, batch_size=8,collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets['validation'], batch_size=8, collate_fn=data_collator)

batch = next(iter(train_dataloader))
print({k:v.shape for k,v in batch.items()})

# load model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# outputs = model(**batch)
# print(f"loss - {outputs.loss}. logits.shape - {outputs.logits.shape}")

# optimizer
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

# scheduler
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler('linear', optimizer=optimizer,num_warmup_steps=0,num_training_steps=num_training_steps)
print(f'num_training_steps - {num_training_steps}')

# detect device
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# training loop
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps))
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k:v.to(device) for k,v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

# evaluation loop
import evaluate
metric = evaluate.load('glue','mrpc')
for batch in eval_dataloader:
    batch = {k:v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch['labels'])
metric.compute()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Found cached dataset glue (/Users/achinta/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

Loading cached processed dataset at /Users/achinta/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-b8ca0e77d9b1a107.arrow
Loading cached processed dataset at /Users/achinta/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-2656b493beabb728.arrow
Loading cached processed dataset at /Users/achinta/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-e8576c3577f28c58.arrow
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

dataset column names - ['labels', 'input_ids', 'token_type_ids', 'attention_mask']
{'labels': torch.Size([8]), 'input_ids': torch.Size([8, 72]), 'token_type_ids': torch.Size([8, 72]), 'attention_mask': torch.Size([8, 72])}

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

num_training_steps - 1377

/Users/achinta/miniforge3/envs/ml/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

{'accuracy': 0.8578431372549019, 'f1': 0.8999999999999999}

The ‘accelerate’ version of the code can be seen in https://huggingface.co/course/chapter3/4?fw=pt

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 75]),
 'token_type_ids': torch.Size([8, 75]),
 'attention_mask': torch.Size([8, 75])}

5 The datasets library

3.1 time to slice and dice

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
-rw-r--r--  1 achinta  staff    80M Oct  2  2018 drugsComTrain_raw.tsv
-rw-r--r--  1 achinta  staff    27M Oct  2  2018 drugsComTest_raw.tsv

Using custom data configuration default-4eaca5caac99961c
Found cached dataset csv (/Users/achinta/.cache/huggingface/datasets/csv/default-4eaca5caac99961c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)

CPU times: user 67.3 ms, sys: 37.3 ms, total: 105 ms
Wall time: 1.37 s