Fine-Tuning LLMs: Pre-Trained Transformers with Python

Advancement in artificial intelligence and natural language processing has revolutionized how many people interact with data and media. Large language models (LLMs)—also known as transformer-based models—are at the forefront of this advancement. LLMs have garnered attention for their ability to generate human-like language, and have emerged as a transformative force, facilitating content generation, retrieval, and translation. LLMs may be customized to fit a specific use-case in a process known as “fine-tuning.” Here, engineers take a pre-trained language model and further train it on domain-specific data to adapt its language generation capabilities for particular tasks or applications. This paper explores natural language processing, developing LLMs, and fine-tuning applications.

🌸👋🏻 Join 10,000+ followers! Let’s take this to your inbox. You’ll receive occasional emails about whatever’s on my mind—offensive security, open source, academics, boats, software freedom, you get the idea.

Language

Natural language refers to the communication system humans use for verbal and written expression. It’s characterized by its flexibility, ambiguity, and complexity. Natural language can be contrasted against artificial languages, which are languages designed for specific purposes (i.e., programming languages, mathematical notations, and communication among fictional communities).

Artificial languages typically have rigid structures and rules. For this reason, humans and computers use artificial languages like Python and Assembly to communicate with electronic systems.

Artificial intelligence (AI) is the “simulation of human intelligence in machines,” enabling them to perform tasks that typically require human cognitive abilities (News Index). AI’s relation to artificial and natural language lies in the development of language models, a type of computational model that leverages AI techniques to process and generate human-like and artificial languages.

Computational models

Computational models are mathematical or algorithmic representations (e.g., y = mx + b) of real-world systems or processes that can be simulated or executed using computers. Accordingly, a primary goal of these models is to understand the behavior and properties of a studied phenomenon.

Language models are a type of computational model. One popular model is Generative Pre-trained Transformer 3 (GPT-3) for example. These models leverage statistical and probabilistic techniques to understand and predict the structure and meaning of human language. This enables tasks such as text generation and machine translation.

Machine translation is the automated process of converting text or speech from one language into another using computational methods and algorithms.

Natural language processing (NLP) is a “field of AI focusing on the interaction between computers and human language” (GMS). NLP enables machines to process and generate natural language through computational models. Humans and their communication often inspire the development of these models.

Artificial neural networks

For example, artificial neural networks (ANNs) are computational models inspired by the interconnected neurons in the human brain. They can perform tasks such as pattern recognition, classification, and regression.

Regression is a statistical technique. It models the relationship between a dependent variable and one or more independent variables to predict numerical values.

ANNs consist of interconnected nodes organized in layers. Each node takes input from the previous layer, or directly from the input layer in the case of the first layer. Then, each node produces an output, which becomes the input for the next layer.

Nodes are similar to human neurons; they can perform operations on incoming information—specifically, the input data—within the network.

The layers are similar to how humans process information—hierarchically and abstractly—with each layer learning more complex and higher-level representations from the input data. Put simply, the layers of nodes are analogous to how humans sequence correlations between events, behavior, and concepts hierarchically.

Text processing

For instance, when humans read and understand text, they engage in a hierarchical comprehension process, just like ANNs do when processing sequential data like text. When reading a sentence or a paragraph, humans start by processing individual words and basic linguistic elements.

As they move forward, they group words into phrases and sentences, understanding the relationships and context between different parts of the text. This hierarchical processing allows humans to extract meaning, infer connections, and interpret the overall message conveyed by the text.

Similarly, in NLP models like ANNs, the architecture’s design mimics this hierarchical understanding. The layers of nodes in the network sequentially process words and phrases, learning higher-level patterns and representations as the information flows through the network.

This hierarchical approach allows NLP models to capture the contextual dependencies between words, sentences, and paragraphs, enabling them to comprehend and generate human-like language.

Artificial layering techniques

Neural networks are structured in layers. They enable hierarchical learning and abstraction of features from the input data. There are three broad type of layers: input, hidden, and output.

The input layer receives the raw data such as images, text, numerical features, etc… Then, the input layer passes it to the first hidden layer.

The hidden layers, as the name suggests, are not directly observable from the outside and serve to process and learn features from the data in an intermediate fashion.

The output layer produces the final output of the network, which could be a classification label, a numerical value, etc.

Layering in neural networks allows the model to learn complex patterns and representations by progressively extracting more abstract features in the hidden layers. Consequently, this hierarchical approach to learning allows neural networks to capture intricate relationships in the data and make high-level abstractions.

The depth of the network contributes to the network’s ability to learn and represent increasingly complex patterns and dependencies in the input data. This attribute is why deep neural networks with multiple hidden layers became a significant area of research.

ANN layering example

Let’s consider an example of a text classification task using a neural network.

Suppose we have a dataset of movie reviews with each review labeled as “positive” or “negative” sentiment. We aim to build a neural network that automatically classifies new movie reviews into these sentiment categories.

The neural network would use an input layer that takes the text data. Then, the neural network would convert each word in the review into a numerical representation, such as word embeddings.

The network would have one or more hidden layers with multiple neurons. Each layer processes the embeddings to learn patterns and relationships between words. The model then performs a non-linear activation function, such as ReLU, on the hidden layer neurons to introduce non-linearity.

Non-linearity is a mathematical or conceptual relationship that does not follow a straight-line or proportional pattern between input and output.

Then, the network adjusts its weights and biases during training. The network’s decisions rely on the error between its predicted sentiment for each review and the actual labeled sentiment in the training data. Then, this process is again done through back-propagation, where the network learns to recognize the linguistic patterns and sentiment cues indicative of positive or negative movie reviews.

Once trained, the neural network can take in new, unclassified movie reviews and predict their sentiment as either positive or negative based on the learned language patterns and features in the text.

Numerical representation example

Word embedding

Before giving an example of how the input layer in ANNs converts words to a numerical representation, we need to understand word embeddings.

Word embedding is an NLP technique that represents words as dense vectors in a continuous space. Basically, it captures semantic relationships between words.

Dense vectors

A dense vector is a mathematical representation containing real-numbered values that efficiently encode multiple features of an entity, such as a word in NLP (e.g., [0.2, -1.3, 0.8]).

Continuous space

Continuous space refers to a mathematical domain where values can take any real number, such as the range of possible coordinates on a line (e.g., all points between 0 and 1).

Numerical example

Now, we can discuss how an ANN’s input layer can numerically represent words. Here is an example of how words can be converted to numerical representations using word embeddings.

Suppose we have the following movie review.

The movie was fantastic, with a captivating storyline and brilliant performances.

Step 1: Tokenization

The review is first tokenized, breaking it down into individual words.

["The", "movie", "was", "fantastic", ",", "with", "a", "captivating", "storyline", "and", "brilliant", "performances", "."]

Step 2: Vocabulary Creation

Next, we create a vocabulary of unique words present in the entire dataset or corpus. The vocabulary might look like this.

["The", "movie", "was", "fantastic", "with", "a", "captivating", "storyline", "and", "brilliant", "performances", "."]

Step 3: Word Embeddings

We use pre-trained word embeddings, such as Word2Vec or GloVe, which map each word in the vocabulary to a fixed-size numerical vector. These embeddings encode semantic information about each word’s meaning using the context it appears in a large corpus.

A corpus is a structured collection of written or spoken texts used for linguistic analysis and research.

As an illustration, suppose the word embeddings are as follows.

"fantastic"   -> [0.5, 0.2, -0.3]
"captivating" -> [0.3, -0.1, 0.4]
"brilliant"   -> [0.1, 0.6, 0.8]

… and so on for other words in the vocabulary.

Step 4: Numerical Representation

Finally, we represent the original movie review as a sequence of numerical vectors, one for each word in the review, based on their corresponding word embeddings:

["The", "movie", "was", "fantastic", ",", "with", "a", "captivating", "storyline", "and", "brilliant", "performances", "."]

becomes:

[embedding for "The"], [embedding for "movie"], [embedding for "was"], [embedding for "fantastic"], [embedding for ","], [embedding for "with"], [embedding for "a"], [embedding for "captivating"], [embedding for "storyline"], [embedding for "and"], [embedding for "brilliant"], [embedding for "performances"], [embedding for "."]

In this way, the words are converted to numerical representations (word embeddings) that preserve their semantic meaning and can be used as input to the ANNs for sentiment analysis or other NLP tasks.

Machine learning

NPL models like ANNs can use various computational techniques to interact with natural language. One popular branch of technique is machine learning (ML), a subset of AI that harnesses algorithms and statistical models to enable computers to learn from and predict decisions based on data without being explicitly programmed for each task.

When NLP models use ML, computers can analyze large amounts of language data, extracting patterns and relationships to interact with human language more effectively and accurately.

Fine-tuning and transfer learning

Fine-tuning in NLP refers to adapting a pre-trained language model to a specific NLP task by updating its parameters on a task-specific dataset, enhancing its ability to perform specialized language-related tasks.

It is a form of transfer learning, a subfield of ML; transfer learning involves taking a pre-trained model and adapting it to perform a specific task or domain by fine-tuning its parameters on a smaller dataset relevant to that task. Subsequently, Fine-tuning is often used in deep learning, a subset of ML that focuses on training ANNs with multiple layers to learn and represent complex patterns and features from data.

Deep learning

Deep learning is a subset of ML. It uses ANNs to model and process complex patterns in data, achieving higher levels of abstraction and accuracy in various tasks.

“Deep” refers to the number of layers in a neural network architecture. Deep neural networks have hidden layers between its input and output layers. The term “deep” reflects the depth of these layered architectures.

The depth of the network allows it to capture and model highly abstract and non-linear relationships within the data, enabling it to perform tasks like image recognition, NLP, and more with substantial accuracy.

Architecture

Each deep learning technique is known as an architecture. For instance, deep architectures refer to neural network models—like the ANNs discussed previously—with multiple layers that enable them to learn complex hierarchical representations from data, allowing for the extraction of intricate features and patterns.

Architectures represent the overall structure and design of ANNs, including the arrangement of nodes, connections, and layers, which determine how data is processed and transformed to perform specific tasks.

Different architectures are tailored to handle various types of data and functions, such as image recognition (e.g., convolutional neural networks), sequence modeling (e.g., recurrent neural networks), or language understanding (e.g., transformer-based architectures). Each architecture is optimized to extract relevant features and relationships from the data, allowing deep learning models to achieve high performance and accuracy in myriad applications.

Transformers

Transformer-based architecture is foundational to LLMs and NLP, serving as the core structure that enables LLMs to excel in NLP tasks by effectively capturing contextual relationships and patterns in text data.

Transformer architectures are deep learning models that leverage self-attention mechanisms to process and capture long-range dependencies in sequential data, making them effective at understanding and generating natural language.

Self-attention

Self-attention mechanisms are a key component of transformer architectures. They allow a model to weigh the importance of different positions in a sequence to generate contextually informed representations for each element based on its interactions with other elements in the sequence.

Long-range

A long-range dependency refers to a relationship between elements in a sequence that are distant from each other, making it challenging for traditional models to capture and understand the contextual connections between these elements without losing relevant information.

The cat that I adopted from the shelter yesterday is playful.

In the sentence above, there is a long-range dependency between the word “cat” and the word “playful.” The word “cat” is the subject of the sentence, and the adjective “playful” describes the cat’s behavior. However, “playful” appears several words away from “cat” in the sentence. Capturing this long-range dependency is crucial for understanding the relationship between the subject and adjective in the sentence.

NLP models with self-attention mechanisms, such as transformer architectures, can effectively handle such long-range dependencies and generate contextually informed representations, leading to better NLP capabilities.

Transformers and NLP

NLP models use transformers by employing self-attention mechanisms to process and understand sequential data, such as sentences or documents, effectively capturing contextual relationships between words and generating contextually informed representations. Transformers leverage deep learning techniques, specifically neural networks, to process and understand sequential data, such as natural language text.

The core component of transformers is the self-attention mechanism, which allows the model to focus on different parts of the input sequence simultaneously, capturing long-range dependencies and interactions between elements. This attention mechanism enables transformers to overcome the limitations of traditional sequential models, like recurrent neural networks (RNNs), in handling long-range dependencies efficiently.

RNNs are ANNs designed to handle sequential data by maintaining internal memory, allowing them to capture temporal dependencies, while traditional ANNs process individual data points independently.

In summary, transformers are a specific type of neural network architecture that has become influential in NLP due to their ability to effectively process sequential data with long-range dependencies.

So, what are LLMs again?

Remember, LLMs are NLP models based on transformer architectures that have been pre-trained on large amounts of text data.

The LLMs leverage deep learning by using multiple layers of transformer-based neural networks to learn intricate patterns and context in language data during their pre-training phase. Here, deep learning enables the models to capture semantic relationships and contextual information effectively.

In LLMs, neural networks are used within the transformer architecture to process sequential data, such as words or sentences, through self-attention mechanisms.

These mechanisms allow the model to focus on different parts of the input sequence simultaneously, capturing dependencies and interrelations between words to create contextually informed language representations, which form the foundation for their exceptional language understanding and generation capabilities.

Lastly, the LLMs use fine-tuning––a process of adjusting and optimizing a pre-trained model—by training the pre-existing ML model on domain-specific or task-specific data to enhance its performance for specialized applications.

Transformer-based neural networks, such as the popular Transformer architecture, excel at capturing long-range dependencies in sequential data by employing self-attention mechanisms, whereas traditional ANNs process data sequentially or in parallel without specialized mechanisms for handling long-range relationships.

Fine-tuning

Fine-tuning approaches involve various techniques for adjusting and refining pre-trained ML models, including gradient-based optimization, parameter updates, and training on task-specific data to enhance performance for specific tasks or domains: single-task, multitask, and domain adaptation.

Gradient-based optimization

Gradient-based optimization is a method for iteratively adjusting parameters in a model using gradients to minimize a loss function.

A loss function quantifies the difference between the predicted output of an ML model and the actual target values in the training data, and minimizing it helps the model learn optimal parameters that lead to accurate predictions for better performance on new, unseen data.

Parameter updates

Parameter updates involve adjusting the weights and biases of an ML model using optimization techniques to minimize the loss function during training.

Single-task

Single-task fine-tuning involves refining a pre-trained model on a particular task or dataset to optimize its performance for that specific task alone. It is customizes a pre-trained model for a specific task, such as sentiment analysis on customer reviews, by training it further on relevant data to improve its accuracy and effectiveness for that particular task.

Gradient vs single-task

Gradient-based optimization is a method for iteratively adjusting parameters in a model using gradients to minimize a loss function, while single-task fine-tuning involves refining a pre-trained model on a specific task to improve its performance.

Gradient-based optimization and single-task fine-tuning are similar in the sense that both involve adjusting model parameters to improve performance on a specific task. However, gradient-based optimization is a broader concept used during training, while single-task fine-tuning specifically refers to refining a pre-trained model for a particular task.

Multi-task

In general, multi-task learning refers to training a single ML model to perform multiple related tasks simultaneously, leveraging shared knowledge and representations to enhance overall performance across those tasks.

Multi-task learning is used to improve the performance of ML models by allowing them to learn from and share information across multiple related tasks, such as training a model to perform both part-of-speech tagging and named entity recognition, where the knowledge gained from one task can benefit the other and lead to better performance overall.

Domain adaptation

Domain adaptation involves adapting an ML model already trained on one domain (source domain) to perform well on a different, but related, domain (target domain), typically by adjusting the model’s features or parameters to mitigate the differences between the two domains and improve its performance in the target.

Parameter updates vs domain adaptation

Parameter updates involve adjusting the weights and biases of an ML model using optimization techniques to minimize the loss function during training, while domain adaptation refers to adapting a model trained on one domain to perform well on a different but related domain by minimizing distributional discrepancies.

Both parameter updates and domain adaptation involve modifying a model to perform better in specific contexts, but they focus on different aspects: parameter updates refine a model’s internal values during training, while domain adaptation adjusts a model to handle different data distributions.

Evaluating NLP models

Evaluating NLP models involves assessing their performance and effectiveness through various metrics and techniques, such as accuracy, precision, recall, F1 score, perplexity, BLEU score, and human evaluations, to gauge their ability to understand and generate human-like language.

Accuracy

Accuracy measures the correctness of predictions made by an ML model by calculating the ratio of correctly predicted instances to the total number of instances in a dataset, often expressed as a percentage.

Precision

Precision is a metric that calculates the proportion of “true positive” predictions (correctly identified positive instances) over the total number of positive predictions made by a model, providing insight into the model’s ability to avoid false positives.

Precision = TP / (TP + FP)

TP = True positive 
FP = False positive

Recall

Recall is a metric that calculates the proportion of “true positive” predictions found over the total number of positive instances marked in a dataset, reflecting the model’s ability to capture all relevant positive cases.

Recall = TP / (TP + FN)

TP = True positives
FN = False negatives

F1 score

F1 scores combine precision and recall by calculating the harmonic mean of these two values, providing a balanced assessment of a model’s performance in terms of false positives and false negatives.

Perplexity

Perplexity is a measurement used to quantify how well a probability distribution or language model predicts a given sequence of tokens, with lower perplexity indicating better predictive performance.

BLEU score

Bilingual Evaluation Understudy (BLEU) evaluates the quality of machine-generated translations or text by comparing them to one or more reference translations, measuring the degree of overlap in n-grams (contiguous sequences of words) between the generated and reference text.

Experiment: Fine-Tuning LLMs for Sentiment Analysis in Python

LLMs’ potential lies in fine-tuning, which tailors these models to specific tasks, domains, or datasets. Fine-tuning allows developers to leverage the pre-trained knowledge of these models while adapting them to suit their particular needs. This experiment explores the process of fine-tuning LLMs and compares the results to results generated from traditional ML models.

Notes

16 September 2023: I might develop the machine learning portion of this exercise if further interest is expressed.

30 July 2023: I took on too much work when I signed on to this project, so I created this exercise just with traditional ML techniques, specifically logistic regression with TF-IDF vectorization.

Out of curiosity, I originally wanted to compare the results with TF-IDF vectorizations and fine-tuning LLMs. TF-IDF vectorizations already generate very accurate results with this dataset, yet are considered less accurate than LLMs. However, I hope to continue this project after I submit it. So, this functionality may be available soon.

Abstract

The objective of this experiment is to explore the effectiveness of fine-tuning a pre-trained LLM (such as GPT-3) on a movie sentiment analysis task. We will compare the performance of a fine-tuned LLM with a traditional ML approach (Logistic Regression with TF-IDF) on the same sentiment analysis task.

At present, I only finalized the ML approach, so there is no comparison with a pre-trained LLM.

This experiment investigates the effectiveness of traditional ML techniques—specifically logistic regression with TF-IDF vectorization—for sentiment analysis on the movie review dataset (Stanford). The study explores the performance of the logistic regression model on a subset of movie reviews based on a user-defined sample percentage. Then, the subset goes through text preprocessing, including lowercase conversion, removal of non-word characters, tokenization, and stopword elimination. The evaluation metrics—accuracy, precision, recall, and F1-score—assess the model’s performance on sentiment classification. The findings highlight the applicability of the chosen approach for sentiment analysis tasks and provide insights into the classification performance of movie reviews.

I will discuss the LLM component if and/or when I finish programming it.

The experiment involves multiple phases.

Methodology

Methodology refers to a systematic approach or set of principles used to guide and structure the process of conducting research, analysis, or problem-solving in a particular field or discipline.

Background

[TLDR: reading this paper]: The background phase provides an overview of pre-trained transformer-based models, elucidating their architecture, pre-training objectives, and the transfer learning paradigm. It covers how these models capture contextual information and develop a strong foundation for subsequent fine-tuning processes. In addition, the background section explores various fine-tuning approaches, such as single-task fine-tuning, multi-task learning, and domain adaptation. Lastly, it analyzes techniques used to determine the effectiveness of fine-tuned models and how they are crucial to gauge the performance of fine-tuning and NLP models, like accuracy, F1 score, and perplexity.

Data preparation

The data preparation phase involves acquiring the dataset and performing necessary text preprocessing to ensure data integrity. Subsequently, the pre-trained model is augmented with a classification head tailored to sentiment analysis, and the fine-tuning process ensues while keeping the pre-trained LLM’s parameters fixed except for the classification head. In parallel, the traditional ML model is constructed using Logistic Regression, and hyperparameters are optimized through cross-validation.

Evaluation

The evaluation phase assesses both models’ performance on the testing set using accuracy, precision, recall, and F1-score metrics. The analysis of the outcomes ascertain the fine-tuned LLM’s strengths and weaknesses relative to the traditional approach, and an examination of the transfer learning in NLP tasks displays potential implications.

The experiment reveals that fine-tuning LLMs yields notable improvements in sentiment analysis accuracy compared to traditional ML methods. Transfer learning allows for leveraging pre-existing linguistic knowledge and adapting it to specific downstream tasks, even with limited labeled data. The implications of this approach are discussed, including its potential applications in customer feedback analysis, chatbots, and content generation.

Ultimately, the experiment provides insights into fine-tuning LLMs for sentiment analysis and demonstrates transfer learning in NLP.

Conclusion and continuation

In short, through this paper and exercise, it is clear that LLMs are NLP models based on transformer architectures pre-trained on vast amounts of text data. We learned that deep learning is a subset of ML that uses ANNs to model and process complex patterns in data. Additionally, we learned that neural networks are computational models inspired by the human brain’s interconnected neurons. They consist of layers of interconnected nodes (neurons) that process and transform input data to produce output.

LLMs leverage deep learning by using multiple layers of transformer-based neural networks to learn intricate patterns and context in language data during their pre-training phase. The deep learning enables the models to capture semantic relationships and contextual information effectively. In LLMs, neural networks within the transformer architecture process sequential data, such as words or sentences, through self-attention mechanisms.

These mechanisms allow models to focus on various segments of the input sequence simultaneously. Here, they capture dependencies and interrelations between words to create contextually informed language representations. These form the foundation for their exceptional language understanding and generation capabilities.

Future work

Future research for this project should focus on three main areas:

Implementing and fine-tuning LLMs for sentiment analysis using pre-trained models.
Comparing their performance with traditional ML approaches regarding accuracy, precision, recall, and F1-score metrics.
Exploring the broader implications of transfer learning in NLP, especially in applications such as customer feedback analysis.

These investigations aim to demonstrate the potential benefits of LLMs in sentiment analysis and emphasize the significance of transfer learning in NLP tasks.

Areas of interest: where language models and fine tuning fail

Here are some areas that might be of particular interest.

Sample efficiency and generalization

Presently, fine-tuning methods often require a substantial amount of domain-specific data to adapt a pre-trained model to a new task. Improving the sample efficiency and generalization capabilities of fine-tuned models would be beneficial.

Few-shot and zero-shot learning

Developing models that can perform well with few or zero examples is a challenging yet promising direction. Shot learning could involve enhancing few-shot and zero-shot learning techniques. It allows models to understand and generalize from very limited task-specific examples.

Domain adaptation and multilingual learning

Enhancing the ability of models to quickly adapt to different domains or languages with minimal data is an important area. Engineers should explore techniques that enable more efficient cross-domain and cross-lingual transfer.

Ethical and bias considerations

In addition, addressing bias and ethical concerns in fine-tuned and large language models is crucial. Future work may involve designing methods to reduce biases, making models more aware of and accountable for potential ethical issues, and ensuring that fine-tuned models adhere to guidelines and principles.

Explainability and interpretability

Creating methods to make fine-tuned models more interpretable and explainable is an ongoing challenge. This includes understanding how models make decisions, providing meaningful explanations for their outputs, and enabling users to trust and verify model behavior.

Structured and formal information handling

Integrating structured and formal information (such as knowledge graphs or databases) into fine-tuned models could lead to more accurate and contextually rich language understanding and generation.

Multi-modal fusion

Exploring how to effectively combine and learn from different modalities (text, images, audio, etc.) in fine-tuning and language models can enable more comprehensive and context-aware understanding.

Continual learning and catastrophic forgetting

Overcoming catastrophic forgetting is where fine-tuning on a new task erases knowledge from previous tasks. In general, overcoming catastrophic forgetting is important for creating more versatile and capable fine-tuned models.

Resource-efficient fine-tuning

Additionally, researching techniques to fine-tune models efficiently with limited computational resources is essential for wider adoption and democratization of NLP technologies.

Robustness and adversarial defense

Enhancing fine-tuned models’ robustness to adversarial attacks and unintended inputs remains a significant challenge. It requires methods that can better handle noisy, diverse, or out-of-distribution data.

Interactive and adaptive learning

Lastly, developing methods that allow fine-tuned models to interactively learn and adapt from feedback could lead to more personalized and responsive interactions.

Overall, I hope you enjoyed this post on fine-tuning large language models. Lastly, if you want to learn more about my research, consider reading Open Source Security Economics and Ethics.

This blog post was made for Introduction to Artificial Intelligence (CSCI 331) under Professor Thomas Borrelli.