`transformertc.datatc`¶

Data loading functions for Token level Classification with BERT Reads data in the CONLL format.

Module Contents¶

transformertc.datatc.ExampleAttType¶

transformertc.datatc.logger¶

class transformertc.datatc.InputExampleTC(guid: str, tokens: List[str], labels: List[str] = None)¶

Bases: object

A single training/test example for token classification.

Note

This class is a structure to hold examples with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.

guid¶

Unique id for the example. Usually including its subset name (e.g. train-5).

Type: str

tokens¶

The sequence of tokens.

Type: list of str

labels¶

The sequence of labels. Defaults to None.

Type: list of str, optional

__repr__(self)¶: String representation of the object.

to_dict(self)¶: Serializes this instance to a Python dictionary.

to_json_string(self)¶: Serializes this instance to a JSON string.

class transformertc.datatc.InputFeaturesTC(input_ids: List[int], attention_mask: List[int], token_type_ids: List[int], label_ids: List[int], token_positions: List[int])¶

Bases: object

A single set of features of data, probably corresponding to a single InputExampleTC.

In the context of the transformers library, features refers to a transformer’s input e.g. (subword) token ids, attention masks, segment ids, label ids, etc.

Note

This class is a structure to hold examples with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.

input_ids¶

Indices of input tokens in the vocabulary.

Type: list of int

attention_mask¶

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1] usually: 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

Type: list of int

token_type_ids¶

Segment token indices to indicate first and second portions of the inputs.

Type: list of int

label_ids¶

Label ids corresponding to the input token ids.

Type: list of int

token_positions¶

positions of original tokens.

Type: list of int

__eq__(self, other)¶: Comparison with other objects.

__repr__(self)¶: String representation of the object.

to_dict(self)¶: Serializes this instance to a Python dictionary.

to_json_string(self)¶: Serializes this instance to a JSON string.

class transformertc.datatc.ResultTC(name: str, token_start: int, token_end: int, label: str, score: float = -1.0)¶

Bases: object

A single set of results from inference (token classification).

Note

This class is a structure to hold results with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.

name¶

the token or tokens classified. For NER this contains the named entity mention.

Type: str

label¶

the label associated with the token or tokens. For NER, this is the entity type e.g. ORG.

Type: str

token_start¶

the position of the first token for this result. For NER, this is the first position of the first token in the mention.

Type: int

token_end¶

the position of the last token.

Type: int

score¶

the score or probability associated with this result. For NER, this is the mean of the individual token scores, with each token score being the softmax output. A negative number can be used to indicate the absence of a score. Defaults to -1.

Type: float

Example:

'If the film In the Mood for Love , ...'
 0  1   2    3  4   5    6   7    8 ...

ResultTC(
    name="In the Mood for Love",
    label="WORK_OF_ART",
    token_start=3,
    token_end=7,
    score=0.9969802618026733
)

__repr__(self)¶: String representation of object.

to_dict(self)¶: Serializes this instance to a Python dictionary.

to_json_string(self)¶: Serializes this instance to a JSON string.

to_txt_string(self)¶: Serializes this instance to a JSON string without formatting.

class transformertc.datatc.DataProcessor¶

Bases: object

Base class for data converters for classification data sets.

Made to support both sequence classification and token classification tasks.

abstract get_train_examples(self)¶: Gets a collection of InputExampleTC for the train set.

abstract get_dev_examples(self)¶: Gets a collection of InputExampleTC for the dev set.

abstract get_test_examples(self)¶: Gets a collection of InputExampleTC for the test set.

abstract get_labels(self)¶: Gets the list of labels for this data set.

classmethod _read_tsv(cls, input_file: str, quotechar: str = None)¶

Reads a tab separated value file.

Parameters

input_file (str) – the path to the file to read.
quotechar (str, optional) – the CSV quotechar. Defaults to None.

classmethod _read_conll_file(cls, input_file: str, label_col: int = -1)¶

Reads a CONLL format text file.

Parameters

input_file (str) – the path to the file to read.
label_col (int, optional) – the index of the CONLL column to read as the label. Defaults to -1 which in python indexing means the last column. 0 is assumed to be the tokens.

class transformertc.datatc.CONLLProcessor(data_dir: str, label_col: int = -1)¶

Bases: transformertc.datatc.DataProcessor

Processor for a CONLL style datasets.

The dataset is assumed to have the following structure:

data_dir
|-- train.txt
|-- dev.txt
|-- test.txt
|-- labels.txt

With labels.txt being an ordered, line-delimited list of labels e.g.:

O
B-PER
I-PER
...

While train.txt, dev.txt, and test.txt are CONLL style formated files e.g:

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC

get_train_examples(self)¶: See base class.

get_dev_examples(self)¶: See base class.

get_test_examples(self)¶: See base class.

get_labels(self)¶: See base class.

_create_examples(self, lines, set_type)¶: Creates examples for the different subsets (splits).

transformertc.datatc.convert_example_to_features(example: InputExampleTC, label2id: Dict[str, int], tokenizer: Any, max_length: int = 512, ignore_lbl_id: int = -100) → Tuple[InputFeaturesTC, List[str]]¶

Converts a single InputExampleTC to a InputFeaturesTC.

Parameters

example (InputExampleTC) – an example to convert to featurs.
label2id (dict key str, value int) – a dictionary mapping label strings to their respective label ids.
( (tokenizer) – obj): the transformer tokenizer object.
max_length (int) – the maximum length of the post-tokenized tokens and the respective associated fields in an InputFeaturesTC. Sequences longer will be truncated, sequences shorter will be padded. This length includes any special tokens that must be added such as [CLS] and [SEP] in BERT.
ignore_lbl_id (int, optional) – a value of a label id to be ignored, used for subword tokens. This is typically negative. Usually, -1 or torch.nn.CrossEntropy().ignore_index.

Returns

tuple containing:

features (InputFeaturesTC) containing the data in example
converted into features. Given a task-specific InputExamplesTC it should return a task-specific InputFeaturesTC for that task.

sw_tokens (list of str) containing a list of
(probably) subword tokens as tokenized by the tokenizer.

Return type

(tuple)

Raises

AssertionError – If lengths of the respective InputFeaturesTC will not match.

transformertc.datatc.log_example_features(example: InputExampleTC, features: InputFeaturesTC, tokens: List[str]) → None¶: Logs an InputExampleTC and its conversion to InputFeaturesTC.

transformertc.datatc.convert_examples_to_features(examples: Sequence[InputExampleTC], labels: List[str], tokenizer: Any, max_length: int = 512, ignore_lbl_id: int = -100) → List[InputFeaturesTC]¶

Converts sequence of InputExampleTC to list of ``InputFeaturesTC.

Parameters

examples (list of InputExampleTC) – Sequence of InputExampleTC containing the examples to be converted to features.
( (tokenizer) – obj): Instance of a transformer tokenizer that will tokenize the example tokens and convert them to model specific ids.
max_length (int) – the maximum length of the post-tokenized tokens and the respective associated fields in an InputFeaturesTC. Sequences longer will be truncated, sequences shorter will be padded. This length includes any special tokens that must be added such as [CLS] and [SEP] in BERT.
ignore_lbl_id (int, optional) – a value of a label id to be ignored, used for subword tokens. This is typically negative. Usually, -1 or torch.nn.CrossEntropy().ignore_index.

Returns

If the input is a list of InputExamplesTC, will return a list of task-specific InputFeaturesTC which can be fed to the model.

transformertc.datatc.convert_tokens_to_example(tokens: List[str], guid: str = '', labels: List[str] = None) → InputExampleTC¶

Creates an InputExampleTC from a list of tokens.

Note

This function is meant to be used at inference time (i.e. not during training or evaluation) where data is expected not to have labels.

Parameters

tokens (list of str) – Sequence of tokens to be converted to an InputExampleTC.
guid (str, optional) – a unique identifier for the example. Defaults to empty string.
labels (list of str) – a label for each corresponding token. Since this function is primarly made for inference, it defaults to a list of empty strings that exist for compatability. However a different string signifying no label could be used.

Returns

InputExampleTC containing the tokens passed as input.

transformertc.datatc.convert_features_to_pytorch_dataset(all_features: List[InputFeaturesTC]) → TensorDataset¶

Converts a list of features into a pytorch dataset.

Parameters

all_features (list of InputFeatureTC) – the list of InputFeatureTC originating from a list of InputExampleTC that will constitute the dataset.

Returns

A pytorch TensorDataset containing the features with the attributes of features occupying the following dimensions:

0 - input (token) ids 1 - attention mask 2 - token types (or segment ids) 3 - label ids

transformertc.datatc.convert_features_to_dataset(all_features: List[InputFeaturesTC], dataset_type: str = 'pytorch') → TensorDataset¶

Converts a list of features into a dataset.

Parameters

all_features (list of InputFeatureTC) – the list of InputFeatureTC originating from a list of InputExampleTC that will constitute the dataset.
dataset_type (str) – the type of dataset, curruntly only pytorch is supported.

Returns

A pytorch TensorDataset.

Raises

ValueError if dataset_type is not supported. –

class transformertc.datatc.DataBunchLoaderTC(data_dir: str, slen: int, backend: str = 'pytorch', cached_data_dir: str = None)¶

Bases: object

A class for grouping multiple subsets of a token classficiation dataset.

The idea is that training and evaluation commonly require a train, development (or validation), and test subsets (or splits). Generally all of these have the same parameters (e.g. max sequence length) and same set of labels, etc. This class is for made for objects to contain all the required parameters and options for the 3 subsets of a token classification dataset.

See CONLLProcessor for the directory structure and data file requirements.

tokenizer¶: the (subword) tokenizer of the pretrained transformer model.

get_labels(self)¶: Returns the list of labels.

property tokenizer(self): Get the tokenizer.

_path_for_cached(self, subset: str = 'train')¶: Given a subset name, return the path for its cached data.

_load_cached(self, subset: str = 'train')¶: Load a cached subset.

_save_cached(self, features: List[InputFeaturesTC], subset: str = 'train')¶: Save features to a cache file.

_get_features(self, subset: str = 'train')¶

Features for a subset from either cache or by generating them.

If features are not already cached and if cached_data_dir is not None, it will cache the features generated.

Returns: List of InputFeaturesTC for the subset.

get_train(self, batch_size: int)¶: Returns the train dataloader.

get_dev(self, batch_size: int)¶: Returns the dev dataloader.

get_test(self, batch_size: int = 1)¶: Returns the test dataloader.

transformertc.datatc¶

Module Contents¶

`transformertc.datatc`¶