transformertc.datatc

Data loading functions for Token level Classification with BERT Reads data in the CONLL format.

Module Contents

transformertc.datatc.ExampleAttType
transformertc.datatc.logger
class transformertc.datatc.InputExampleTC(guid: str, tokens: List[str], labels: List[str] = None)

Bases: object

A single training/test example for token classification.

Note

This class is a structure to hold examples with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.

guid

Unique id for the example. Usually including its subset name (e.g. train-5).

Type

str

tokens

The sequence of tokens.

Type

list of str

labels

The sequence of labels. Defaults to None.

Type

list of str, optional

__repr__(self)

String representation of the object.

to_dict(self)

Serializes this instance to a Python dictionary.

to_json_string(self)

Serializes this instance to a JSON string.

class transformertc.datatc.InputFeaturesTC(input_ids: List[int], attention_mask: List[int], token_type_ids: List[int], label_ids: List[int], token_positions: List[int])

Bases: object

A single set of features of data, probably corresponding to a single InputExampleTC.

In the context of the transformers library, features refers to a transformer’s input e.g. (subword) token ids, attention masks, segment ids, label ids, etc.

Note

This class is a structure to hold examples with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.

input_ids

Indices of input tokens in the vocabulary.

Type

list of int

attention_mask

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1] usually: 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

Type

list of int

token_type_ids

Segment token indices to indicate first and second portions of the inputs.

Type

list of int

label_ids

Label ids corresponding to the input token ids.

Type

list of int

token_positions

positions of original tokens.

Type

list of int

__eq__(self, other)

Comparison with other objects.

__repr__(self)

String representation of the object.

to_dict(self)

Serializes this instance to a Python dictionary.

to_json_string(self)

Serializes this instance to a JSON string.

class transformertc.datatc.ResultTC(name: str, token_start: int, token_end: int, label: str, score: float = -1.0)

Bases: object

A single set of results from inference (token classification).

Note

This class is a structure to hold results with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.

name

the token or tokens classified. For NER this contains the named entity mention.

Type

str

label

the label associated with the token or tokens. For NER, this is the entity type e.g. ORG.

Type

str

token_start

the position of the first token for this result. For NER, this is the first position of the first token in the mention.

Type

int

token_end

the position of the last token.

Type

int

score

the score or probability associated with this result. For NER, this is the mean of the individual token scores, with each token score being the softmax output. A negative number can be used to indicate the absence of a score. Defaults to -1.

Type

float

Example:

'If the film In the Mood for Love , ...'
 0  1   2    3  4   5    6   7    8 ...

ResultTC(
    name="In the Mood for Love",
    label="WORK_OF_ART",
    token_start=3,
    token_end=7,
    score=0.9969802618026733
)
__repr__(self)

String representation of object.

to_dict(self)

Serializes this instance to a Python dictionary.

to_json_string(self)

Serializes this instance to a JSON string.

to_txt_string(self)

Serializes this instance to a JSON string without formatting.

class transformertc.datatc.DataProcessor

Bases: object

Base class for data converters for classification data sets.

Made to support both sequence classification and token classification tasks.

abstract get_train_examples(self)

Gets a collection of InputExampleTC for the train set.

abstract get_dev_examples(self)

Gets a collection of InputExampleTC for the dev set.

abstract get_test_examples(self)

Gets a collection of InputExampleTC for the test set.

abstract get_labels(self)

Gets the list of labels for this data set.

classmethod _read_tsv(cls, input_file: str, quotechar: str = None)

Reads a tab separated value file.

Parameters
  • input_file (str) – the path to the file to read.

  • quotechar (str, optional) – the CSV quotechar. Defaults to None.

classmethod _read_conll_file(cls, input_file: str, label_col: int = -1)

Reads a CONLL format text file.

Parameters
  • input_file (str) – the path to the file to read.

  • label_col (int, optional) – the index of the CONLL column to read as the label. Defaults to -1 which in python indexing means the last column. 0 is assumed to be the tokens.

class transformertc.datatc.CONLLProcessor(data_dir: str, label_col: int = -1)

Bases: transformertc.datatc.DataProcessor

Processor for a CONLL style datasets.

The dataset is assumed to have the following structure:

data_dir
|-- train.txt
|-- dev.txt
|-- test.txt
|-- labels.txt

With labels.txt being an ordered, line-delimited list of labels e.g.:

O
B-PER
I-PER
...

While train.txt, dev.txt, and test.txt are CONLL style formated files e.g:

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
get_train_examples(self)

See base class.

get_dev_examples(self)

See base class.

get_test_examples(self)

See base class.

get_labels(self)

See base class.

_create_examples(self, lines, set_type)

Creates examples for the different subsets (splits).

transformertc.datatc.convert_example_to_features(example: InputExampleTC, label2id: Dict[str, int], tokenizer: Any, max_length: int = 512, ignore_lbl_id: int = -100) → Tuple[InputFeaturesTC, List[str]]

Converts a single InputExampleTC to a InputFeaturesTC.

Parameters
  • example (InputExampleTC) – an example to convert to featurs.

  • label2id (dict key str, value int) – a dictionary mapping label strings to their respective label ids.

  • ( (tokenizer) – obj): the transformer tokenizer object.

  • max_length (int) – the maximum length of the post-tokenized tokens and the respective associated fields in an InputFeaturesTC. Sequences longer will be truncated, sequences shorter will be padded. This length includes any special tokens that must be added such as [CLS] and [SEP] in BERT.

  • ignore_lbl_id (int, optional) – a value of a label id to be ignored, used for subword tokens. This is typically negative. Usually, -1 or torch.nn.CrossEntropy().ignore_index.

Returns

tuple containing:

features (InputFeaturesTC) containing the data in example

converted into features. Given a task-specific InputExamplesTC it should return a task-specific InputFeaturesTC for that task.

sw_tokens (list of str) containing a list of

(probably) subword tokens as tokenized by the tokenizer.

Return type

(tuple)

Raises

AssertionError – If lengths of the respective InputFeaturesTC will not match.

transformertc.datatc.log_example_features(example: InputExampleTC, features: InputFeaturesTC, tokens: List[str]) → None

Logs an InputExampleTC and its conversion to InputFeaturesTC.

transformertc.datatc.convert_examples_to_features(examples: Sequence[InputExampleTC], labels: List[str], tokenizer: Any, max_length: int = 512, ignore_lbl_id: int = -100) → List[InputFeaturesTC]

Converts sequence of InputExampleTC to list of ``InputFeaturesTC.

Parameters
  • examples (list of InputExampleTC) – Sequence of InputExampleTC containing the examples to be converted to features.

  • ( (tokenizer) – obj): Instance of a transformer tokenizer that will tokenize the example tokens and convert them to model specific ids.

  • max_length (int) – the maximum length of the post-tokenized tokens and the respective associated fields in an InputFeaturesTC. Sequences longer will be truncated, sequences shorter will be padded. This length includes any special tokens that must be added such as [CLS] and [SEP] in BERT.

  • ignore_lbl_id (int, optional) – a value of a label id to be ignored, used for subword tokens. This is typically negative. Usually, -1 or torch.nn.CrossEntropy().ignore_index.

Returns

If the input is a list of InputExamplesTC, will return a list of task-specific InputFeaturesTC which can be fed to the model.

transformertc.datatc.convert_tokens_to_example(tokens: List[str], guid: str = '', labels: List[str] = None) → InputExampleTC

Creates an InputExampleTC from a list of tokens.

Note

This function is meant to be used at inference time (i.e. not during training or evaluation) where data is expected not to have labels.

Parameters
  • tokens (list of str) – Sequence of tokens to be converted to an InputExampleTC.

  • guid (str, optional) – a unique identifier for the example. Defaults to empty string.

  • labels (list of str) – a label for each corresponding token. Since this function is primarly made for inference, it defaults to a list of empty strings that exist for compatability. However a different string signifying no label could be used.

Returns

InputExampleTC containing the tokens passed as input.

transformertc.datatc.convert_features_to_pytorch_dataset(all_features: List[InputFeaturesTC]) → TensorDataset

Converts a list of features into a pytorch dataset.

Parameters

all_features (list of InputFeatureTC) – the list of InputFeatureTC originating from a list of InputExampleTC that will constitute the dataset.

Returns

A pytorch TensorDataset containing the features with the attributes of features occupying the following dimensions:

0 - input (token) ids 1 - attention mask 2 - token types (or segment ids) 3 - label ids

transformertc.datatc.convert_features_to_dataset(all_features: List[InputFeaturesTC], dataset_type: str = 'pytorch') → TensorDataset

Converts a list of features into a dataset.

Parameters
  • all_features (list of InputFeatureTC) – the list of InputFeatureTC originating from a list of InputExampleTC that will constitute the dataset.

  • dataset_type (str) – the type of dataset, curruntly only pytorch is supported.

Returns

A pytorch TensorDataset.

Raises

ValueError if dataset_type is not supported.

class transformertc.datatc.DataBunchLoaderTC(data_dir: str, slen: int, backend: str = 'pytorch', cached_data_dir: str = None)

Bases: object

A class for grouping multiple subsets of a token classficiation dataset.

The idea is that training and evaluation commonly require a train, development (or validation), and test subsets (or splits). Generally all of these have the same parameters (e.g. max sequence length) and same set of labels, etc. This class is for made for objects to contain all the required parameters and options for the 3 subsets of a token classification dataset.

See CONLLProcessor for the directory structure and data file requirements.

tokenizer

the (subword) tokenizer of the pretrained transformer model.

get_labels(self)

Returns the list of labels.

property tokenizer(self)

Get the tokenizer.

_path_for_cached(self, subset: str = 'train')

Given a subset name, return the path for its cached data.

_load_cached(self, subset: str = 'train')

Load a cached subset.

_save_cached(self, features: List[InputFeaturesTC], subset: str = 'train')

Save features to a cache file.

_get_features(self, subset: str = 'train')

Features for a subset from either cache or by generating them.

If features are not already cached and if cached_data_dir is not None, it will cache the features generated.

Returns

List of InputFeaturesTC for the subset.

get_train(self, batch_size: int)

Returns the train dataloader.

get_dev(self, batch_size: int)

Returns the dev dataloader.

get_test(self, batch_size: int = 1)

Returns the test dataloader.