Introduction to State of the Art NLP | BERT

Taken directly from the research paper of Bert, let’s introduce the Bert model.


BERT stands for Bidirectional Encoder Representations from TransformersCurrently, all the other models like ELMO, Open AI GPT use only unidirectional text understanding for the training, which means at any point in time the model has just seen the previous tokens but not the next tokens of the text.Bert is designed to train itself on the bidirectional representations from the unlabeled text by conditioning on both the left and right sides of the text.
This results in a pre-trained model which can be fine-tuned easily with just one additional output layer. Some of the Implementations are:

  • Question Answering  Models
  • Language inference Models

The problem which bert is solving?

Okay, I think I have mentioned it at the start, let’s talk about that.So pretty much everyone has heard about OPEN AI GPT, the authors of GPT have used a unidirectional approach but is it even a bad thing, what if the unidirectional approach to learning is better?. The problem with such an approach is that because the model at a time is just able to use the knowledge from the previous tokens, this approach can be very harmful while applying the fine-tuning approaches for token-level tasks. In the paper, the solution which is proposed is a transformer-based  which uses a Mask Based Model (MLM)

What is a Masked Based Model(MLM)

The model randomly masks some of the tokens (words) from the text which is used for training, and the model tries to predict the vocabulary ID (In simple words the actual tokens )  based on its context. Because the model uses context from both the side to predict the word, it allows one to train a deep bi-directional transformer. More about BERT in Next Article.

How the Bert is Modeled

Model Architecture

It is a multi-layered bi-directional Transformer Encoder. In the paper, it is described as the number of layers (i.e training blocks) as L, the hidden size as H, and the number of self-attentions heads as A3. There are two model sizes Bert (Base) with (L = 12, H = 768, A=12) which means the total parameters are 110 M which is the same as OPEN AI GPT.Bert (Large) (L=24, H=1024, A=16) which means around 340 M parameters .

Input/Output Representations

The input representation is able to represent a single sentence as well as pair of sentences. A sequence refers to the input token sequence to bert. Which may be ae a single sentence or two sentences packed together.Some Considerations:

  • The first token of every sentence is always a special classification token [CLS].
  • Sentence pairs are packed together into a single sequence. The differentiation is made in two ways.
    • First, we separate them with a special token [SEP].
    • A learned embedding is added to every token indicating whether it belongs to sentence A or Sentence B.

Pre Training Bert

The bert model is trained using two unsupervised tasks.

Task 1: Masked LM

The task is simple, mask some part of the text and then predict those masked models.Around 15% of all tokens are masked in each sequence at random. The only problem is that it will create a mismatch between pre-training and fine-tuning as the input will not always contain the masked words. So to fix this the masked words are not always replaced by [MASK] tokens.The training data generator chooses 15% of the token positions at random for prediction. If the i-th token is chosen, we replace the i-th token with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i-th token 10% of the time. Then, Ti will be used to predict the original token with cross-entropy loss.

Task 2: Next Sequence Prediction (NSP)

Many important tasks such as QNA or Natural Language  Inference  (NLI) are based on understanding the relationship between two sentences which is not captured by Task 1.In order to train a model that understands sentence relationships, we pre-train for a binarized next sentence prediction task that can be trivially generated from any monolingual corpus. Specifically,when choosing the sentences A and B for each pretraining example, 50% of the time B is the actualnext sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence fromthe corpus (labeled as NotNext).


I think with this we can now understand how bert Model is trained and its general use cases. With the next article of the series, we will dive deeper into how to Fine-Tune a Bert Model and so on.

Also if anyone needs an article on Transformer, just let us know in comments.


If You find it interesting!! we would really like to hear from you.

Ping us at Instagram/@the.blur.code


Thanks for reading!! 

Happy Coding

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top
Hey, wait!

Before you dive into the exciting Article, wanna join our newsletter list, We promise we will not spam