Transformers


Neural Network Architecture

  1. Artificial Neural Networks (ANN)
  • Structure: Basic neural networks with fully connected layers (input → hidden → output).
  • Use Case: General-purpose tasks like regression and classification on structured data.
  • Key Feature: Static data flow, no time or spatial context.
  1. Recurrent Neural Networks (RNN)
  • Structure: Includes loops to process sequential data, maintaining hidden states for context.
  • Use Case: Time-series data, text, and speech (e.g., language modeling, sentiment analysis).
  • Key Feature: Captures temporal dependencies, but prone to vanishing gradients.
  1. Convolutional Neural Networks (CNN)
  • Structure: Specialized layers (convolutions, pooling) for spatial hierarchies in data.
  • Use Case: Image and video processing, object detection, and spatial data tasks.
  • Key Feature: Efficient feature extraction from images using localized patterns.
  1. Transformers
  • Structure: Built on self-attention mechanisms, replacing recurrence or convolutions.
  • Use Case: Text (e.g., BERT, GPT), images (Vision Transformers), and multi-modal tasks.
  • Key Feature: Captures global context, scales well for large datasets, and enables parallel processing.

LSTM-Based Architecture

Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) designed to address the vanishing gradient problem in traditional RNNs, enabling better learning and memory of long-range dependencies in sequences.

Key Components of LSTM-Based Architecture

  1. Cell State:
  • The central component of LSTM, carrying information throughout the sequence.
  • It acts like a memory, maintaining and updating long-term information across time steps.
  1. Gates: LSTMs use gates to control the flow of information. These gates are learned by the network during training. There are three primary gates:
    1. Forget Gate:
    • Decides which information to discard from the cell state.
    • The forget gate outputs values between 0 and 1, indicating which parts of the cell state should be kept or forgotten.
    1. Input Gate:
    • Decides what new information to store in the cell state.
    • The input gate controls how much of the new information gets updated in the cell state, and the candidate memory cell creates potential new information.
    1. Output Gate:
    • Determines what the next hidden state (h_t) should be, which will be used in the next time step and for output.
    • The output gate filters information that will be passed to the next time step and the final output.

Limitations of LSTM

  • Static Word Embeddings:
    • Use fixed word embeddings (e.g., Word2Vec, GloVe).
    • Cannot adjust word representations based on context.
  • Long-Range Dependencies:
    • Struggles to capture long-range dependencies.
    • Context becomes blurry for distant words.
  • No Contextual Adaptation:
    • Word embedding is the same regardless of context (e.g., "bank" for river or financial institution).
  • Sequential Processing:
    • Processes tokens one-by-one, limiting parallelization.
    • Slow training and inference.
  • Difficulty in Complex Contextual Tasks:
    • Struggles with tasks like word sense disambiguation and coreference resolution.
    • Cannot dynamically adjust embeddings for ambiguous meanings.
  • Inefficient Memory Usage:
    • Memory of past tokens can be lost over long sequences.
    • Poor at remembering distant relationships between words.
  • Limited Flexibility:
    • Cannot fine-tune embeddings dynamically based on evolving context.
  • Parallelization Bottleneck:
    • Cannot process the entire sequence at once.
    • Training and inference are slower compared to transformer-based models.

Transformers

Transformers with Self-Attention: A Deep Dive

The Transformer architecture, introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized the field of natural language processing (NLP) by abandoning traditional sequential processing (like RNNs or LSTMs) and introducing a parallelizable, attention-based mechanism. One of the key innovations is self-attention, which allows the model to attend to different parts of a sequence simultaneously, without the need to process tokens in order.

The Transformer architecture consists of two main parts:

  • Encoder – Responsible for processing the input sequence and creating a rich, context-aware representation.
  • Decoder – Uses the context from the encoder to generate the output sequence (in tasks like machine translation).

Self-Attention

In the "Attention is All You Need" paper, the self-attention mechanism is a way for a model to weigh the importance of each word in a sentence relative to every other word, regardless of their position in the sequence. Here's how it works:

  • Contextual Focus: Self-attention allows each word to attend to all other words in the sentence. For example, in the sentence "The cat eats the mouse," the word "cat" might focus on "eats" and "mouse" to understand the action and the object, while "eats" might focus on "cat" and "mouse" to understand what is being eaten.
  • Parallel Relationships: Every word in the sequence computes relationships (attentions) with all the other words simultaneously. This means that the model doesn't have to process the words one by one (like in LSTMs or RNNs), and can look at the entire sequence in parallel.
  • Dynamic Attention: The mechanism computes a dynamic weight (attention score) that tells the model how much importance each word should have when considering the others. For example, in "The cat eats the mouse," the word "eats" would have a stronger connection (higher attention score) to "cat" and "mouse" than to "the."
  • Contextual Understanding: This attention allows the model to build a better contextual representation of each word, based on its relationship with other words in the sentence. Each word ends up with a new vector representation that incorporates information from the entire sequence.

Limitations of LSTM Solved by Self-Attention

  • Long-Range Dependencies:

    • LSTM Limitation: Struggles with capturing long-term relationships in sequences.
    • Self-Attention Solution: Can directly attend to any token, regardless of distance, efficiently capturing long-range dependencies.
  • Sequential Processing:

    • LSTM Limitation: Processes tokens one-by-one, limiting parallelization.
    • Self-Attention Solution: Processes the entire sequence simultaneously, enabling parallelization for faster computation.
  • Fixed Context Window:

    • LSTM Limitation: Uses a fixed-length context vector to represent the entire sequence, which may lose information in long sentences.
    • Self-Attention Solution: Dynamically creates context-aware embeddings, allowing each token to focus on relevant parts of the sequence.
  • Limited Focus:

    • LSTM Limitation: May fail to emphasize key words or relationships over long distances.
    • Self-Attention Solution: Can focus more on important tokens and dynamically adjust attention across the entire sequence.
  • Difficulty with Complex Dependencies:

    • LSTM Limitation: Struggles with tasks like word sense disambiguation or coreference resolution due to sequential processing.
    • Self-Attention Solution: Allows direct modeling of complex relationships between all tokens, improving tasks like disambiguation and coreference.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a pre-trained transformer-based model for NLP tasks developed by Google. It uses a transformer encoder architecture to capture context in a bidirectional manner, meaning it considers both the left and right context of a word, unlike previous models that only looked at one direction.

Key Features of BERT

  1. Bidirectional Context:
    • Traditional models like LSTMs or GPT process text in one direction (left-to-right or right-to-left), limiting their understanding of context.
    • BERT is bidirectional, meaning it processes the entire sentence simultaneously from both directions, allowing it to fully understand the context of each word based on both the words before and after it.
  2. Transformer Encoder:
    • BERT uses the transformer encoder architecture, which employs self-attention to understand relationships between words in a sequence without relying on sequential processing (like LSTMs).
    • The encoder processes the input in parallel, making it faster and more efficient.
  3. Pre-training and Fine-tuning:
    • Pre-training: BERT is pre-trained on a large corpus of text (Wikipedia + BookCorpus) using two main tasks:
      • Masked Language Modeling (MLM): Randomly masks some words in the sentence and the model learns to predict them.
      • Next Sentence Prediction (NSP): The model is trained to predict if one sentence logically follows another (helpful for tasks like question answering).
    • Fine-tuning: Once pre-trained, BERT can be fine-tuned on specific tasks (like classification, NER, or question answering) by adding a small task-specific output layer on top.
  4. Input Representation:
    • BERT takes word pieces (subword tokens) as input, enabling it to handle out-of-vocabulary words.
    • The input is represented as:
      • Token IDs: The words (or subwords) are mapped to numerical token IDs.
      • Segment Embeddings: Indicate if a token belongs to sentence A or sentence B (important for tasks like question answering).
      • Position Embeddings: Represent the position of each word in the sequence.
  5. Fine-Tuned for Specific Tasks:
    • After pre-training, BERT can be fine-tuned for a variety of NLP tasks by simply adding a task-specific output layer. These tasks include:
      • Sentence Classification (e.g., sentiment analysis)
      • Named Entity Recognition (NER)
      • Question Answering
      • Text Summarization
      • Natural Language Inference (NLI)

How BERT Works

  • Input Encoding: The input sequence is tokenized and converted into embeddings (word pieces, segment IDs, position embeddings).
  • Self-Attention: The self-attention mechanism allows BERT to attend to all words in the sentence simultaneously, learning their relationships and contextual importance.
  • Bidirectional Encoding: Unlike traditional models, BERT considers both the left and right context of each word during training, enabling richer representations.
  • Fine-Tuning: For a specific task, the pre-trained BERT model is fine-tuned with labeled task data. The task-specific output layer is trained to generate the desired output (e.g., classification label, token prediction).
All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.