No GPU, No Social gathering : High-quality-Tune BERT for Sentiment Evaluation with Vertex AI Customized jobs | by Benjamin Etienne | Jun, 2024

No GPU, No Social gathering : High-quality-Tune BERT for Sentiment Evaluation with Vertex AI Customized jobs | by Benjamin Etienne | Jun, 2024
No GPU, No Social gathering : High-quality-Tune BERT for Sentiment Evaluation with Vertex AI Customized jobs | by Benjamin Etienne | Jun, 2024


High-quality-tuning a BERT mannequin on social media knowledge

Getting and getting ready the info

The dataset we’ll use comes from Kaggle, you’ll be able to obtain it right here : https://www.kaggle.com/datasets/farisdurrani/sentimentsearch (CC BY 4.0 License). In my experiments, I solely selected the datasets from Fb and Twitter.

The next snippet will take the csv recordsdata and save 3 splits (coaching, validation, and take a look at) to the place you need. I like to recommend saving them in Google Cloud Storage.

You possibly can run the script with:

python make_splits --output-dir gs://your-bucket/
import pandas as pd
import argparse
import numpy as np
from sklearn.model_selection import train_test_split

def make_splits(output_dir):
df=pd.concat([
pd.read_csv("data/farisdurrani/twitter_filtered.csv"),
pd.read_csv("data/farisdurrani/facebook_filtered.csv")
])
df = df.dropna(subset=['sentiment'], axis=0)
df['Target'] = df['sentiment'].apply(lambda x: 1 if x==0 else np.signal(x)+1).astype(int)

df_train, df_ = train_test_split(df, stratify=df['Target'], test_size=0.2)
df_eval, df_test = train_test_split(df_, stratify=df_['Target'], test_size=0.5)

print(f"Recordsdata might be saved in {output_dir}")
df_train.to_csv(output_dir + "/practice.csv", index=False)
df_eval.to_csv(output_dir + "/eval.csv", index=False)
df_test.to_csv(output_dir + "/take a look at.csv", index=False)

print(f"Practice : ({df_train.form}) samples")
print(f"Val : ({df_eval.form}) samples")
print(f"Take a look at : ({df_test.form}) samples")

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--output-dir')
args, _ = parser.parse_known_args()
make_splits(args.output_dir)

The information ought to look roughly like this:

(picture from creator)

Utilizing a small BERT pretrained mannequin

For our mannequin, we’ll use a light-weight BERT mannequin, BERT-Tiny. This mannequin has already been pretrained on vasts quantity of knowledge, however not essentially with social media knowledge and never essentially with the target of doing Sentiment Evaluation. For this reason we’ll fine-tune it.

It accommodates solely 2 layers with a 128-units dimension, the complete listing of fashions could be seen here if you wish to take a bigger one.

Let’s first create a most important.py file, with all needed modules:

import pandas as pd
import argparse
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as textual content
import logging
import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "UNCOMPRESSED"

def train_and_evaluate(**params):
cross
# might be up to date as we go

Let’s additionally write down our necessities in a devoted necessities.txt

transformers==4.40.1
torch==2.2.2
pandas==2.0.3
scikit-learn==1.3.2
gcsfs

We are going to now load 2 components to coach our mannequin:

  • The tokenizer, which can handle splitting the textual content inputs into tokens that BERT has been skilled with.
  • The mannequin itself.

You possibly can receive each from Huggingface here. You can too obtain them to Cloud Storage. That’s what I did, and can due to this fact load them with:


# Load pretrained tokenizers and bert mannequin
tokenizer = BertTokenizer.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2/vocab.txt')
mannequin = BertModel.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2')

Let’s now add the next piece to our file:

class SentimentBERT(nn.Module):
def __init__(self, bert_model):
tremendous().__init__()
self.bert_module = bert_model
self.dropout = nn.Dropout(0.1)
self.remaining = nn.Linear(in_features=128, out_features=3, bias=True)

# Uncomment the under should you solely wish to retrain sure layers.
# self.bert_module.requires_grad_(False)
# for param in self.bert_module.encoder.parameters():
# param.requires_grad = True

def ahead(self, inputs):
ids, masks, token_type_ids = inputs['ids'], inputs['mask'], inputs['token_type_ids']
# print(ids.dimension(), masks.dimension(), token_type_ids.dimension())
x = self.bert_module(ids, masks, token_type_ids)
x = self.dropout(x['pooler_output'])
out = self.remaining(x)
return out

Slightly break right here. We now have a number of choices in relation to reusing an current mannequin.

  • Switch studying : we freeze the weights of the mannequin and use it as a “characteristic extractor”. We are able to due to this fact append further layers downstream. That is incessantly utilized in Pc Imaginative and prescient the place fashions like VGG, Xception, and so forth. could be reused to coach a customized mannequin on small datasets
  • High-quality-tuning : we unfreeze all or a part of the weights of the mannequin and retrain the mannequin on a customized dataset. That is the popular strategy when coaching customized LLMs.

Extra particulars on Switch studying and High-quality-tuning here:

Within the mannequin, we now have chosen to unfreeze all of the mannequin, however be happy to freeze a number of layers of the pretrained BERT module and see the way it influences the efficiency.

The important thing half right here is so as to add a completely related layer after the BERT module to “hyperlink” it to our classification job, therefore the ultimate layer with 3 models. This can permit us to reuse the pretrained BERT weights and adapt our mannequin to our job.

Creating the dataloaders

To create the dataloaders we’ll want the Tokenizer loaded above. The Tokenizer takes a string as enter, and returns a number of outputs amongst which we will discover the tokens (‘input_ids’ in our case):

The BERT tokenizer is a bit particular and can return a number of outputs, however an important one is the input_ids: they’re the tokens used to encode our sentence. They is perhaps phrases, or components or phrases. For instance, the phrase “wanting” is perhaps made of two tokens, “look” and “##ing”.

Let’s now create a dataloader module which can deal with our datasets :

class BertDataset(Dataset):
def __init__(self, df, tokenizer, max_length=100):
tremendous(BertDataset, self).__init__()
self.df=df
self.tokenizer=tokenizer
self.goal=self.df['Target']
self.max_length=max_length

def __len__(self):
return len(self.df)

def __getitem__(self, idx):

X = self.df['bodyText'].values[idx]
y = self.goal.values[idx]

inputs = self.tokenizer.encode_plus(
X,
pad_to_max_length=True,
add_special_tokens=True,
return_attention_mask=True,
max_length=self.max_length,
)
ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
masks = inputs["attention_mask"]

x = {
'ids': torch.tensor(ids, dtype=torch.lengthy).to(DEVICE),
'masks': torch.tensor(masks, dtype=torch.lengthy).to(DEVICE),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.lengthy).to(DEVICE)
}
y = torch.tensor(y, dtype=torch.lengthy).to(DEVICE)

return x, y

Writing the principle script to coach the mannequin

Allow us to outline at the start two capabilities to deal with the coaching and analysis steps:

def practice(epoch, mannequin, dataloader, loss_fn, optimizer, max_steps=None):
mannequin.practice()
total_acc, total_count = 0, 0
log_interval = 50
start_time = time.time()

for idx, (inputs, label) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = mannequin(inputs)

loss = loss_fn(predicted_label, label)
loss.backward()
optimizer.step()

total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)

if idx % log_interval == 0:
elapsed = time.time() - start_time
print(
"Epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f} | loss {:8.3f} ({:.3f}s)".format(
epoch, idx, len(dataloader), total_acc / total_count, loss.merchandise(), elapsed
)
)
total_acc, total_count = 0, 0
start_time = time.time()

if max_steps is just not None:
if idx == max_steps:
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}

return {'loss': loss.merchandise(), 'acc': total_acc / total_count}

def consider(mannequin, dataloader, loss_fn):
mannequin.eval()
total_acc, total_count = 0, 0

with torch.no_grad():
for idx, (inputs, label) in enumerate(dataloader):
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)

return {'loss': loss.merchandise(), 'acc': total_acc / total_count}

We’re getting nearer to getting our most important script up and operating. Let’s sew items collectively. We now have:

  • A BertDataset class to deal with the loading of the info
  • A SentimentBERT mannequin which takes our Tiny-BERT mannequin and provides a further layer for our customized use case
  • practice() and eval() capabilities to deal with these steps
  • A train_and_eval() capabilities that bundles every little thing

We are going to use argparse to have the ability to launch our script with arguments. Such arguments are usually the practice/eval/take a look at recordsdata to run our mannequin with any datasets, the trail the place our mannequin might be saved, and parameters associated to the coaching.

import pandas as pd
import time
import torch.nn as nn
import torch
import logging
import numpy as np
import argparse

from torch.utils.knowledge import Dataset, DataLoader
from transformers import BertTokenizer, BertModel

logging.basicConfig(format='%(asctime)s [%(levelname)s]: %(message)s', degree=logging.DEBUG)
logging.getLogger().setLevel(logging.INFO)

# --- CONSTANTS ---
BERT_MODEL_NAME = 'small_bert/bert_en_uncased_L-2_H-128_A-2'

if torch.cuda.is_available():
logging.data(f"GPU: {torch.cuda.get_device_name(0)} is on the market.")
DEVICE = torch.machine('cuda')
else:
logging.data("No GPU obtainable. Coaching will run on CPU.")
DEVICE = torch.machine('cpu')

# --- Knowledge preparation and tokenization ---
class BertDataset(Dataset):
def __init__(self, df, tokenizer, max_length=100):
tremendous(BertDataset, self).__init__()
self.df=df
self.tokenizer=tokenizer
self.goal=self.df['Target']
self.max_length=max_length

def __len__(self):
return len(self.df)

def __getitem__(self, idx):

X = self.df['bodyText'].values[idx]
y = self.goal.values[idx]

inputs = self.tokenizer.encode_plus(
X,
pad_to_max_length=True,
add_special_tokens=True,
return_attention_mask=True,
max_length=self.max_length,
)
ids = inputs["input_ids"]
token_type_ids = inputs["token_type_ids"]
masks = inputs["attention_mask"]

x = {
'ids': torch.tensor(ids, dtype=torch.lengthy).to(DEVICE),
'masks': torch.tensor(masks, dtype=torch.lengthy).to(DEVICE),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.lengthy).to(DEVICE)
}
y = torch.tensor(y, dtype=torch.lengthy).to(DEVICE)

return x, y

# --- Mannequin definition ---
class SentimentBERT(nn.Module):
def __init__(self, bert_model):
tremendous().__init__()
self.bert_module = bert_model
self.dropout = nn.Dropout(0.1)
self.remaining = nn.Linear(in_features=128, out_features=3, bias=True)

def ahead(self, inputs):
ids, masks, token_type_ids = inputs['ids'], inputs['mask'], inputs['token_type_ids']
x = self.bert_module(ids, masks, token_type_ids)
x = self.dropout(x['pooler_output'])
out = self.remaining(x)
return out

# --- Coaching loop ---
def practice(epoch, mannequin, dataloader, loss_fn, optimizer, max_steps=None):
mannequin.practice()
total_acc, total_count = 0, 0
log_interval = 50
start_time = time.time()

for idx, (inputs, label) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = mannequin(inputs)

loss = loss_fn(predicted_label, label)
loss.backward()
optimizer.step()

total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)

if idx % log_interval == 0:
elapsed = time.time() - start_time
print(
"Epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f} | loss {:8.3f} ({:.3f}s)".format(
epoch, idx, len(dataloader), total_acc / total_count, loss.merchandise(), elapsed
)
)
total_acc, total_count = 0, 0
start_time = time.time()

if max_steps is just not None:
if idx == max_steps:
return {'loss': loss.merchandise(), 'acc': total_acc / total_count}

return {'loss': loss.merchandise(), 'acc': total_acc / total_count}

# --- Validation loop ---
def consider(mannequin, dataloader, loss_fn):
mannequin.eval()
total_acc, total_count = 0, 0

with torch.no_grad():
for idx, (inputs, label) in enumerate(dataloader):
predicted_label = mannequin(inputs)
loss = loss_fn(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().merchandise()
total_count += label.dimension(0)

return {'loss': loss.merchandise(), 'acc': total_acc / total_count}

# --- Important perform ---
def train_and_evaluate(**params):

logging.data("operating with the next params :")
logging.data(params)

# Load pretrained tokenizers and bert mannequin
# replace the paths to whichever you might be utilizing
tokenizer = BertTokenizer.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2/vocab.txt')
mannequin = BertModel.from_pretrained('fashions/bert_uncased_L-2_H-128_A-2')

# Coaching parameters
epochs = int(params.get('epochs'))
batch_size = int(params.get('batch_size'))
learning_rate = float(params.get('learning_rate'))

# Load the info
df_train = pd.read_csv(params.get('training_file'))
df_eval = pd.read_csv(params.get('validation_file'))
df_test = pd.read_csv(params.get('testing_file'))

# Create dataloaders
train_ds = BertDataset(df_train, tokenizer, max_length=100)
train_loader = DataLoader(dataset=train_ds,batch_size=batch_size, shuffle=True)
eval_ds = BertDataset(df_eval, tokenizer, max_length=100)
eval_loader = DataLoader(dataset=eval_ds,batch_size=batch_size)
test_ds = BertDataset(df_test, tokenizer, max_length=100)
test_loader = DataLoader(dataset=test_ds,batch_size=batch_size)

# Create the mannequin
classifier = SentimentBERT(bert_model=mannequin).to(DEVICE)
total_parameters = sum([np.prod(p.size()) for p in classifier.parameters()])
model_parameters = filter(lambda p: p.requires_grad, classifier.parameters())
params = sum([np.prod(p.size()) for p in model_parameters])
logging.data(f"Whole params : {total_parameters} - Trainable : {params} ({params/total_parameters*100}% of whole)")

# Optimizer and loss capabilities
optimizer = torch.optim.Adam([p for p in classifier.parameters() if p.requires_grad], learning_rate)
loss_fn = nn.CrossEntropyLoss()

# If dry run we solely
logging.data(f'Coaching mannequin with {BERT_MODEL_NAME}')
if args.dry_run:
logging.data("Dry run mode")
epochs = 1
steps_per_epoch = 1
else:
steps_per_epoch = None

# Motion !
for epoch in vary(1, epochs + 1):
epoch_start_time = time.time()
train_metrics = practice(epoch, classifier, train_loader, loss_fn=loss_fn, optimizer=optimizer, max_steps=steps_per_epoch)
eval_metrics = consider(classifier, eval_loader, loss_fn=loss_fn)

print("-" * 59)
print(
"Finish of epoch {:3d} - time: {:5.2f}s - loss: {:.4f} - accuracy: {:.4f} - valid_loss: {:.4f} - legitimate accuracy {:.4f} ".format(
epoch, time.time() - epoch_start_time, train_metrics['loss'], train_metrics['acc'], eval_metrics['loss'], eval_metrics['acc']
)
)
print("-" * 59)

if args.dry_run:
# If dry run, we don't run the analysis
return None

test_metrics = consider(classifier, test_loader, loss_fn=loss_fn)

metrics = {
'practice': train_metrics,
'val': eval_metrics,
'take a look at': test_metrics,
}
logging.data(metrics)

# save mannequin and structure to single file
if params.get('job_dir') is None:
logging.warning("No job dir offered, mannequin won't be saved")
else:
logging.data("Saving mannequin to {} ".format(params.get('job_dir')))
torch.save(classifier.state_dict(), params.get('job_dir'))
logging.data("Bye bye")

if __name__ == '__main__':
# Create arguments right here
parser = argparse.ArgumentParser()
parser.add_argument('--training-file', required=True, sort=str)
parser.add_argument('--validation-file', required=True, sort=str)
parser.add_argument('--testing-file', sort=str)
parser.add_argument('--job-dir', sort=str)
parser.add_argument('--epochs', sort=float, default=2)
parser.add_argument('--batch-size', sort=float, default=1024)
parser.add_argument('--learning-rate', sort=float, default=0.01)
parser.add_argument('--dry-run', motion="store_true")

# Parse them
args, _ = parser.parse_known_args()

# Execute coaching
train_and_evaluate(**vars(args))

That is nice, however sadly, this mannequin will take a very long time to coach. Certainly, with round 4.7M parameters to coach, one step will take round 3s on a 16Gb Macbook Professional with Intel chip.

3s per step could be fairly lengthy when you have got 1238 steps to go and 10 epochs to finish…

No GPU, no get together.

Leave a Reply

Your email address will not be published. Required fields are marked *