Welcome to Robustness Gym¶
Robustness Gym is a toolkit for evaluating natural language processing models.
Robustness Gym is under active development so expect rough edges. Feedback and contributions are welcomed and appreciated. You can submit bugs and feature suggestions on Github Issues and submit contributions using a pull request.
You can get started by going to the installation page.
Installation¶
This page describes how to get Robustness Gym installed and ready to use. Head to the tutorials to start using Robustness Gym after installation.
Installing the Robustness Gym package¶
The only things you need to install to get setup.
Install with pip¶
pip install robustnessgym
Optional Installation¶
The steps below aren’t necessary unless you need these features.
Progress bars in Jupyter¶
Enable the following Jupyter extensions to display progress bars properly.
jupyter nbextension enable --py widgetsnbextension
jupyter labextension install @jupyter-widgets/jupyterlab-manager
TextBlob setup¶
To use TextBlob, download and install the TextBlob corpora.
python -m textblob.download_corpora
Installing Spacy GPU¶
To install Spacy with GPU support, use the installation steps given below.
pip install cupy
pip install spacy[cuda]
python -m spacy download en_core_web_sm
Installing neuralcoref¶
The standard version of neuralcoref
does not use GPUs for prediction and a pull
request that is pending adds this
functionality.
Follow the steps below to use this.
git clone https://github.com/dirkgr/neuralcoref.git@754d470d484f56c5715ef35c220c217f28079eef
cd neuralcoref
git checkout GpuFix
pip install -r requirements.txt
pip install -e .
Robustness Gym in a Nutshell¶
What is Robustness Gym? Should you use it? Read this page to find some quick answers to common questions.
The Big Picture¶
Robustness Gym was built out of our own frustrations of being unable to systematically evaluate and test our machine learning models.
Traditionally, evaluation has consisted of a few simple steps:
Load some data
Generate predictions using a model
Compute aggregate metrics
This is no longer sufficient: models are increasingly being deployed in real-world use cases, and aggregate performance is too coarse to make meaningful model assessments. Modern evaluation is about understanding if models are robust to all the scenarios they might encounter, and where the tradeoffs lie.
This is reflected in Robustness Gym which distills these modern goals into a new workflow,
Load some data
Compute and cache side-information on data
Build slices of data
Evaluate across the slices
Report and share findings
Iterate
We’ll go into what these steps mean and how to use them in Robustness Gym next.
The Robustness Gym Workflow¶
1. Load some data¶
Loading data in Robustness Gym is easy. We extend the Huggingface
datasets library,
so all datasets there are immediately available for use using the Robustness Gym
Dataset
class.
import robustnessgym as rg
# Load the boolq data
dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')
# Load the first 10 training examples
dataset = rg.Dataset.load_dataset('boolq', split='train[:10]')
# Load from jsonl file
dataset = rg.Dataset.from_json("file.jsonl")
2. Compute and cache side-information¶
One of the most common operations in evaluation is interpreting and analyzing examples in dataset. This can mean tagging data, adding additional information about examples from a knowledge base, or making predictions about the example.
It’s often useful to have this information available conveniently stored alongside the example, ready to use for analysis.
This is the idea of the CachedOperation
class in Robustness Gym. Think of it as a
.map()
over your dataset, except it provides convenience functions to retrieve
any information you cache.
Robustness Gym ships with a few cached operations that you can use out-of-the-box.
from robustnessgym import SpacyOp, Stanza, TextBlob
# Create the Spacy CachedOperation
spacy_op = SpacyOp()
# Apply it on the "text" column of a dataset
dataset = spacy_op(batch_or_dataset=dataset, columns=["text"])
# Easily retrieve whatever information you need, wherever you need it
# Retrieve the tokens extracted by Spacy for the first 2 examples in the dataset
tokens = SpacyOp.retrieve(batch=dataset[:2], columns=["text"], proc_fns=SpacyOp.tokens)
# Retrieve everything Spacy cached for the first 2 examples, and process it yourself
spacy_info = SpacyOp.retrieve(batch=dataset[:2], columns=["text"])
# ...do stuff with spacy_info
3. Build slices¶
Robustness Gym supports a general set of
abstractions to create slices of data. Slices are just
datasets that are constructed by applying an instance of the SliceBuilder
class
in Robustness Gym.
Robustness Gym currently supports slices of four kinds:
Evaluation Sets: slice constructed from a pre-existing dataset
Subpopulations: slice constructed by filtering a larger dataset
Transformations: slice constructed by transforming a dataset
Attacks: slice constructed by attacking a dataset adversarially
3.1 Evaluation Sets¶
from robustnessgym import Dataset, Slice
# Evaluation Sets: direct construction of a slice
boolq_slice = Slice(Dataset.load_dataset('boolq'))
3.2 Subpopulations¶
from robustnessgym import NumTokensSubpopulation
# A simple subpopulation that splits the dataset into 3 slices
# The intervals act as buckets: the first slice will bucket based on text with
# length between 0 and 4
length_sp = NumTokensSubpopulation(intervals=[(0, 4), (8, 12), ("80%", "100%")])
# Apply it
dataset, slices, membership = length_sp(batch_or_dataset=dataset, columns=['text'])
# dataset is an updated dataset where every example is tagged with its slice
# slices are a list of Slice objects: think of this as a list of 3 datasets
# membership is a matrix of shape (n x 3) with 0/1 entries, assigning each of the n
# examples depending on whether they're in the slice or not
3.3 Transformations¶
from robustnessgym import EasyDataAugmentation
# Easy Data Augmentation (https://github.com/jasonwei20/eda_nlp)
eda = EasyDataAugmentation(num_transformed=2)
# Apply it
dataset, eda_slices, eda_membership = eda(batch_or_dataset=dataset, columns=['text'])
# eda_slices is just 2 transformed versions of the original dataset
3.4 Attacks¶
from robustnessgym import TextAttack
from textattack.models.wrappers import HuggingFaceModelWrapper
# TextAttack
textattack = TextAttack.from_recipe(recipe='BAEGarg2019',
model=HuggingFaceModelWrapper(...))
4. Evaluate slices¶
At this point, you can just use your own code (e.g. in numpy) to calculate metrics , since the slices are just datasets.
import numpy as np
def accuracy(true: np.array, pred: np.array):
"""
Your function for computing accuracy.
"""
return np.mean(true == pred)
# Some model in your code
model = MyModel()
# Evaluation on the length slices
metrics = {}
for sl in slices:
metrics[sl.identifier] = accuracy(true=sl["label"], pred=MyModel.predict(sl['text']))
Robustness Gym includes a TestBench
abstraction to make this process easier.
from robustnessgym import TestBench, Identifier, BinarySentiment
# Construct a testbench
testbench = TestBench(
# Your identifier for the testbench
identifier=Identifier(_name="MyTestBench"),
# The task this testbench should be used to evaluate
task=BinarySentiment(),
)
# Add slices
testbench.add_slices(slices)
# Evaluate: Robustness Gym knows what metrics to use from the task
metrics = testbench.evaluate(model)
You can also get a Robustness Report using the TestBench.
# Create the report
report = testbench.create_report(model)
# Generate the figures
_, figure = report.figures()
figure.write_image('my_figure.pdf', engine="kaleido")
Quickstart¶
This page gives a quick overview on how to start using Robustness Gym.
The central operation in Robustness Gym is the construction of slices of data: a slice is just a dataset that is used to test specific model properties.
Robustness Gym comes with a set of general abstractions to build slices with ease. We’ll use a simple example to show you how these work.
Robustness Gym also has a lot of built-in functionality that you can use out-of-the-box (thanks to some other great open-source projects) for creating slices. You can read more about these in [](), and check out []() if you’d like to contribute some of your own slice building code to Robustness Gym.
Let’s dive in quickly!
Building Slices¶
Robustness Gym contains a SliceBuilder
class for writing code to build slices.
This
class defines a common interface that all SliceBuilders
must follow:
Any
SliceBuilder
object can be called usingslicebuilder(batch_or_dataset, columns)
.This call always returns a
(dataset, slices, matrix)
tuple.
To see how this works, let’s see a simple example. We’re going to
Create a dummy dataset containing just 4 text examples.
Use a
ScoreSubpopulation
(a kind ofSliceBuilder
) to build 2 slices.
Let’s start by creating the dataset.
from robustnessgym import Dataset, Identifier
dataset = Dataset.from_batch({
'text': ['a person is walking',
'a person is running',
'a person is sitting',
'a person is walking on a street eating a bagel']
}, identifier=Identifier(_name='MyDataset'))
Here, we used the .from_batch(..)
method to create a dataset called MyDataset
.
This dataset has a single column called text with 4 examples or rows.
The Identifier
class is used to store identifying information for Dataset
objects, SliceBuilder
objects and more.
Tip
Most objects in Robustness Gym have a .identifier
property that can be used to
inspect the object.
Next, let’s create the ScoreSubpopulation
to build slices.
def length(batch, columns):
"""
A simple function to compute the length of all examples in a batch.
batch: a dict of lists
columns: a list of str
return: a list of lengths
"""
assert len(columns) == 1, "Pass in a single column."
# The name of the column to grab text from
column_name = columns[0]
text_batch = batch[column_name]
# Tokenize the text using .split() and calculate the number of tokens
return [len(text.split()) for text in text_batch]
We pause here to point out three things:
The
def func(batch, columns)
is a common pattern in Robustness Gym for adding custom functionality.The
batch
here refers to a batch of data,{'text': ['a person is walking', 'a person is running'], 'index': [0, 1]}
is a batch of size 2 from the dataset (
dataset[:2]
).The
columns
parameter specifies the relevant columns of the batch. This has some advantages e.g. supposeotherdataset
has a column of text named sentence instead. We can reuselength
for both datasets,length(batch=dataset[:2], columns=['text']) length(batch=otherdataset[:2], columns=['sentence'])
length
returns a list of scores (lengths in this case). This is an important ingredient of theScoreSubpopulation
, which constructs (as the name suggests) slices by bucketing examples based on their score.We tokenized text inside the length function. This is bad:
Tokenization is a basic step in text processing, and we should only do it once.
If it was some other, more expensive operation, we should definitely do it once.
Let’s keep going and wrap length
in a ScoreSubpopulation
.
from robustnessgym import ScoreSubpopulation
# Create the score subpopulation for length
length_sp = ScoreSubpopulation(intervals=[(0, 5), (5, 10)], score_fn=length)
The ScoreSubpopulation
requires
a list of
intervals
, each interval is a tuple containing the range of lengths that are considered part of that slice.a
score_fn
, used to assign scores to a batch of examples
Let’s run this on the dataset.
# Run the length subpopulation on the dataset
dataset, slices, membership = length_sp(batch_or_dataset=dataset, columns=['text'])
This call just executes the length
function on the dataset, and buckets the
examples based on which intervals they fall in. As we briefly mentioned earlier, this
returns the (dataset, slices, membership)
tuple,
dataset
now tags each example with slice information i.e. what slices does the example belong toslices
is a list ofSlice
objects (2 here, since we specified 2 intervals). EachSlice
object is a dataset containing just the examples that were part of the slice.membership
is anp.array
matrix of shape(n, m)
, wheren
is the number of examples in the original dataset, andm
is the number of slices built. Entry(i, j)
is 1 if examplei
is in slicej
.
And that’s (almost) it! Most code you write in Robustness Gym will follow a
similar workflow. Before we end, we take a short segue to talk about the other major
abstraction in Robustness Gym: the CachedOperation
class.
Caching Information¶
As we noted earlier, we tokenized text inside the length
function, when we should
ideally run this step separately and reuse it across multiple SliceBuilder
objects.
When creating Robustness Gym, we noticed this pattern frequently: cache
some information (CachedOperation
), and use that information to build some slices
(SliceBuilder
).
Let’s look at the same example as before, and use a CachedOperation
for
tokenization this time.
from robustnessgym import CachedOperation, Identifier
def tokenize(batch, columns):
"""
A simple function to tokenize a batch of examples.
batch: a dict of lists
columns: a list of str
return: a list of tokenized text
"""
assert len(columns) == 1, "Pass in a single column."
# The name of the column to grab text from
column_name = columns[0]
text_batch = batch[column_name]
# Tokenize the text using .split()
return [text.split() for text in text_batch]
# Create the CachedOperation
cachedop = CachedOperation(apply_fn=tokenize,
identifier=Identifier(_name="Tokenizer"))
We’ve written tokenize
with the familiar func(batch, columns)
function
signature. This function is then wrapped into a CachedOperation
for use.
Tip
A CachedOperation
can be created with any func(batch, columns)
. The only
constraint is that it must return a list, with size equal to that of the batch.
Let’s create our ScoreSubpopulation
for length again.
from robustnessgym.decorators import singlecolumn
def length(batch, columns):
"""
A simple function to compute the length of all examples in a batch.
batch: a dict of lists
columns: a list of str
return: a list of lengths
"""
assert len(columns) == 1, "Pass in a single column."
# The name of the column to grab text from
column_name = columns[0]
text_batch = batch[column_name]
CachedOperation.retrieve(
batch=batch,
columns=[column_name],
proc_fns=lambda decoded_batch: []
)
# Tokenize the text using .split() and calculate the number of tokens
return [len(text.split()) for text in text_batch]
Robustness Gym ships with CachedOperations
that use standard text processing
pipelines to tokenize and tag text.
There’s a ton more to Robustness Gym (and more coming). Here are some pointers on where to head to next, depending on your specific goals:
If you want a more detailed tutorial and walkthrough, head to the [Tutorial 1]() Jupyter notebook
If you’d like to see what
SliceBuilders
are available in Robustness Gym today, check out []().If you’re interested in a walkthrough of the
SliceBuilder
class in more detail, head to [](). Head to []() for a deep dive into theCachedOperation
class. This is recommended for expert users.If you’d like to learn more about the motivation behind Robustness Gym, check out []().
If you’re interested in becoming a contributor, read []().