WikiText-103
You can view the WikiText-103 leaderboard here.
Getting Started
You'll need the following in the root of your repository:
sotabench.py
file - contains benchmarking logic; the server will run this on each commitrequirements.txt
file - Python dependencies to be installed before runningsotabench.py
sotabench_setup.sh
(optional) - any advanced dependencies or setup, e.g. compilation
You can write whatever you want in your sotabench.py
file to get language model predictions on the WikiText-103 dataset.
But you will need to record your results for the server, and you'll want to avoid doing things like downloading the dataset on the server. So you should:
- Point to the server WikiText-103 data path - popular datasets are pre-downloaded on the server.
- Include an Evaluation object in
sotabench.py
file to record the results. - Use Caching (optional) - to speed up evaluation by hashing the first batch of predictions.
We explain how to do these various steps below.
Server Data Location
The WikiText-103 development data is located in the root of your repository on the server at .data/nlp/wikitext-103/wikitext-103-v1.zip
.
The archive contains a folder wikitext-103
with the following files:
wiki.train.tokens
wiki.valid.tokens
wiki.test.tokens
It is the original zip file released here.
We are running the benchmark on the wiki.test.tokens
dataset.
We have two helper methods that will unpack the dataset for you and give you the pathlib.Path
to the test file.
The first option test_set_path
is available once you instantiate the WikiText103Evaluator
:
... evaluator = WikiText103Evaluator( model_name="Transformer-XL Large", paper_arxiv_id="1901.02860", paper_pwc_id="transformer-xl-attentive-language-models", local_root='/content/wikitext-103' ) # dataset_path is pathlib.Path and points to wikitext.test.tokens with evaluator.test_set_path.open() as f: test_data = torch.tensor(tokenizer.encode(f.read())).to("cuda")
There is a second option available if you are evaluating multiple models and need to use the same
dataset multiple times - WikiText103Evaluator.get_test_set_path(local_root)
. This will get the path before
you initialize a WikiText evaluator:
from sotabencheval.language_modelling import WikiText103Evaluator test_file_path = WikiText103Evaluator.get_test_set_path('/home/ubuntu/my_data/wiki103') with test_file_path.open() as f: content = f.read()
How Do I Initialize an Evaluator?
Add this to your code - before you start batching over the dataset and making predictions:
from sotabencheval.language_modelling import WikiText103Evaluator evaluator = WikiText103Evaluator(model_name='Model name as found in paperswithcode website')
If you are reproducing a model from a paper, then you can enter the arXiv ID. If you
put in the same model name string as on the
Wikitext-103 leaderboard
then you will enable direct comparison with the paper's model. If the arxiv_id
is not available you
can use paperswithcode.com
id. Below is an example of an evaluator that matches Transformer XL
:
from sotabencheval.language_modelling import WikiText103Evaluator evaluator = WikiText103Evaluator( model_name="Transformer-XL Large", paper_arxiv_id="1901.02860", paper_pwc_id="transformer-xl-attentive-language-models", local_root="path_to_your_data", )
The above will directly compare with the result of the paper when run on the server.
How Do I Evaluate Predictions?
The evaluator object has an .add(log_probs, targets)
method to submit predictions by batch or in full.
We expect you to give us the log probability of a batch of target tokens and the target
tokens themselves.
The log_probs
can be either:
- a 0d "tensor" (
np.ndarray
/torch.tensor
) - summed log probability of alltargets
tokens - a 2d "tensor" (
np.ndarray
/torch.tensor
) - log probabilities of each target token, thelog_probs.shape
should matchtargets.shape
- a 3d "tensor" (
np.ndarray
/torch.tensor
) - distribution of log probabilities for each position in the sequence, we will gather the probabilities of target tokens for you.
It is recommended to use third or second option as it allows us to check your perplexity calculations.
If your model uses subword tokenization you don't need convert subwords to full words. You are free to report probability of each subword: we will adjust the perplexity normalization accordingly. Just make sure to set subword_tokenization=True
in your evaluator.
Here is an example of how to report results (for a PyTorch example):
evaluator = WikiText103Evaluator( model_name='GPT-2 Small', paper_pwc_id="language-models-are-unsupervised-multitask", local_root="path_to_your_data", subword_tokenization = True ) # run you data preprocessing, in case of GPT-2 the preprocessing removes moses artifacts with torch.no_grad(): model.eval() for input, target in data_loader: output = model(input) log_probs = torch.LogSoftmax(output, dim=-1) target_log_probs = output.gather(-1, targets.unsqueeze(-1)) evaluator.add(target_log_probs, target)
When you are done, you can get the results locally by running:
evaluator.get_results()
But for the server you want to save the results by running:
evaluator.save()
This method serialises the results and model metadata and stores to the server database.
How Do I Cache Evaluation?
Sotabench reruns your script on every commit. This is good because it acts like continuous integration in checking for bugs and changes, but can be annoying if the model hasn't changed and evaluation is lengthy.
Fortunately sotabencheval has caching logic that you can use.
The idea is that after the first batch, we hash the model outputs and the current metrics and this tells us if the model is the same given the dataset. You can include hashing within an evaluation loop like follows (in the following example for a PyTorch repository):
with torch.no_grad(): for input, target in data_loader: # ... output = model(input) log_probs = #... evaluator.add(log_probs, target) if evaluator.cache_exists: break evaluator.save()
If the hash is the same as in the server, we infer that the model hasn't changed, so we simply return hashed results rather than running the whole evaluation again.
Caching is very useful if you have large models, or a repository that is evaluating multiple models, as it speeds up evaluation significantly.
A full sotabench.py example
Below we show an implementation for a model from the huggingface/transformers
. This
incorporates all the features explained above: (a) using the server data,
(b) using the WikiText-103 Evaluator, and (c) caching the evaluation logic:
import torch from tqdm import tqdm from sotabencheval.language_modelling import WikiText103Evaluator model = torch.hub.load('huggingface/transformers', 'modelWithLMHead', 'transfo-xl-wt103').to("cuda") tokenizer = torch.hub.load('huggingface/transformers', 'tokenizer', 'transfo-xl-wt103') evaluator = WikiText103Evaluator( model_name="Transformer-XL Large", paper_arxiv_id="1901.02860", paper_pwc_id="transformer-xl-attentive-language-models", local_root='/content/wikitext-103' ) with evaluator.test_set_path.open() as f: test_data = torch.tensor(tokenizer.encode(f.read())) seq_len = 128 with torch.no_grad(): evaluator.reset_timer() model.eval() X, Y, mems = test_data[None, :-1], test_data[None, 1:], None for s in tqdm(range(0, X.shape[-1], seq_len)): x,y = X[..., s:s+seq_len].to("cuda"), Y[..., s:s+seq_len].to("cuda") log_probs, mems, *_ = model(input_ids=x, mems=mems) evaluator.add(log_probs, y) if evaluator.cache_exists: break evaluator.save() evaluator.print_results()
You can run this example on Google Colab.
Need More Help?
Head on over to the Natural Language Processing section of the sotabench forums if you have any questions or difficulties.