transformertc.datatc¶
Data loading functions for Token level Classification with BERT Reads data in the CONLL format.
Module Contents¶
-
transformertc.datatc.ExampleAttType¶
-
transformertc.datatc.logger¶
-
class
transformertc.datatc.InputExampleTC(guid: str, tokens: List[str], labels: List[str] = None)¶ Bases:
objectA single training/test example for token classification.
Note
This class is a structure to hold examples with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.
-
guid¶ Unique id for the example. Usually including its subset name (e.g. train-5).
- Type
str
-
tokens¶ The sequence of tokens.
- Type
listofstr
-
labels¶ The sequence of labels. Defaults to None.
- Type
listofstr, optional
-
__repr__(self)¶ String representation of the object.
-
to_dict(self)¶ Serializes this instance to a Python dictionary.
-
to_json_string(self)¶ Serializes this instance to a JSON string.
-
-
class
transformertc.datatc.InputFeaturesTC(input_ids: List[int], attention_mask: List[int], token_type_ids: List[int], label_ids: List[int], token_positions: List[int])¶ Bases:
objectA single set of features of data, probably corresponding to a single
InputExampleTC.In the context of the transformers library, features refers to a transformer’s input e.g. (subword) token ids, attention masks, segment ids, label ids, etc.
Note
This class is a structure to hold examples with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.
-
input_ids¶ Indices of input tokens in the vocabulary.
- Type
listofint
-
attention_mask¶ Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]usually:1for tokens that are NOT MASKED,0for MASKED (padded) tokens.- Type
listofint
-
token_type_ids¶ Segment token indices to indicate first and second portions of the inputs.
- Type
listofint
-
label_ids¶ Label ids corresponding to the input token ids.
- Type
listofint
-
token_positions¶ positions of original tokens.
- Type
listofint
-
__eq__(self, other)¶ Comparison with other objects.
-
__repr__(self)¶ String representation of the object.
-
to_dict(self)¶ Serializes this instance to a Python dictionary.
-
to_json_string(self)¶ Serializes this instance to a JSON string.
-
-
class
transformertc.datatc.ResultTC(name: str, token_start: int, token_end: int, label: str, score: float = -1.0)¶ Bases:
objectA single set of results from inference (token classification).
Note
This class is a structure to hold results with included serialization and deserialization methods. Therefore, the __init__ method’s arguments are also the class attributes.
-
name¶ the token or tokens classified. For NER this contains the named entity mention.
- Type
str
-
label¶ the label associated with the token or tokens. For NER, this is the entity type e.g. ORG.
- Type
str
-
token_start¶ the position of the first token for this result. For NER, this is the first position of the first token in the mention.
- Type
int
-
token_end¶ the position of the last token.
- Type
int
-
score¶ the score or probability associated with this result. For NER, this is the mean of the individual token scores, with each token score being the softmax output. A negative number can be used to indicate the absence of a score. Defaults to -1.
- Type
float
Example:
'If the film In the Mood for Love , ...' 0 1 2 3 4 5 6 7 8 ... ResultTC( name="In the Mood for Love", label="WORK_OF_ART", token_start=3, token_end=7, score=0.9969802618026733 )
-
__repr__(self)¶ String representation of object.
-
to_dict(self)¶ Serializes this instance to a Python dictionary.
-
to_json_string(self)¶ Serializes this instance to a JSON string.
-
to_txt_string(self)¶ Serializes this instance to a JSON string without formatting.
-
-
class
transformertc.datatc.DataProcessor¶ Bases:
objectBase class for data converters for classification data sets.
Made to support both sequence classification and token classification tasks.
-
abstract
get_train_examples(self)¶ Gets a collection of
InputExampleTCfor the train set.
-
abstract
get_dev_examples(self)¶ Gets a collection of
InputExampleTCfor the dev set.
-
abstract
get_test_examples(self)¶ Gets a collection of
InputExampleTCfor the test set.
-
abstract
get_labels(self)¶ Gets the list of labels for this data set.
-
classmethod
_read_tsv(cls, input_file: str, quotechar: str = None)¶ Reads a tab separated value file.
- Parameters
input_file (str) – the path to the file to read.
quotechar (str, optional) – the CSV quotechar. Defaults to None.
-
classmethod
_read_conll_file(cls, input_file: str, label_col: int = -1)¶ Reads a CONLL format text file.
- Parameters
input_file (str) – the path to the file to read.
label_col (int, optional) – the index of the CONLL column to read as the label. Defaults to -1 which in python indexing means the last column. 0 is assumed to be the tokens.
-
abstract
-
class
transformertc.datatc.CONLLProcessor(data_dir: str, label_col: int = -1)¶ Bases:
transformertc.datatc.DataProcessorProcessor for a CONLL style datasets.
The dataset is assumed to have the following structure:
data_dir |-- train.txt |-- dev.txt |-- test.txt |-- labels.txt
With labels.txt being an ordered, line-delimited list of labels e.g.:
O B-PER I-PER ...
While train.txt, dev.txt, and test.txt are CONLL style formated files e.g:
-DOCSTART- -X- -X- O EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC
-
get_train_examples(self)¶ See base class.
-
get_dev_examples(self)¶ See base class.
-
get_test_examples(self)¶ See base class.
-
get_labels(self)¶ See base class.
-
_create_examples(self, lines, set_type)¶ Creates examples for the different subsets (splits).
-
-
transformertc.datatc.convert_example_to_features(example: InputExampleTC, label2id: Dict[str, int], tokenizer: Any, max_length: int = 512, ignore_lbl_id: int = -100) → Tuple[InputFeaturesTC, List[str]]¶ Converts a single
InputExampleTCto aInputFeaturesTC.- Parameters
example (
InputExampleTC) – an example to convert to featurs.label2id (
dictkeystr, valueint) – a dictionary mapping label strings to their respective label ids.( (tokenizer) – obj): the transformer tokenizer object.
max_length (int) – the maximum length of the post-tokenized tokens and the respective associated fields in an InputFeaturesTC. Sequences longer will be truncated, sequences shorter will be padded. This length includes any special tokens that must be added such as [CLS] and [SEP] in BERT.
ignore_lbl_id (int, optional) – a value of a label id to be ignored, used for subword tokens. This is typically negative. Usually, -1 or torch.nn.CrossEntropy().ignore_index.
- Returns
tuple containing:
- features (
InputFeaturesTC) containing the data in example converted into features. Given a task-specific
InputExamplesTCit should return a task-specificInputFeaturesTCfor that task.- sw_tokens (
listofstr) containing a list of (probably) subword tokens as tokenized by the tokenizer.
- features (
- Return type
(tuple)
- Raises
AssertionError – If lengths of the respective InputFeaturesTC will not match.
-
transformertc.datatc.log_example_features(example: InputExampleTC, features: InputFeaturesTC, tokens: List[str]) → None¶ Logs an
InputExampleTCand its conversion toInputFeaturesTC.
-
transformertc.datatc.convert_examples_to_features(examples: Sequence[InputExampleTC], labels: List[str], tokenizer: Any, max_length: int = 512, ignore_lbl_id: int = -100) → List[InputFeaturesTC]¶ Converts sequence of
InputExampleTC to list of ``InputFeaturesTC.- Parameters
examples (
listofInputExampleTC) – Sequence ofInputExampleTCcontaining the examples to be converted to features.( (tokenizer) – obj): Instance of a transformer tokenizer that will tokenize the example tokens and convert them to model specific ids.
max_length (int) – the maximum length of the post-tokenized tokens and the respective associated fields in an InputFeaturesTC. Sequences longer will be truncated, sequences shorter will be padded. This length includes any special tokens that must be added such as [CLS] and [SEP] in BERT.
ignore_lbl_id (int, optional) – a value of a label id to be ignored, used for subword tokens. This is typically negative. Usually, -1 or torch.nn.CrossEntropy().ignore_index.
- Returns
If the input is a list of
InputExamplesTC, will return a list of task-specificInputFeaturesTCwhich can be fed to the model.
-
transformertc.datatc.convert_tokens_to_example(tokens: List[str], guid: str = '', labels: List[str] = None) → InputExampleTC¶ Creates an
InputExampleTCfrom a list of tokens.Note
This function is meant to be used at inference time (i.e. not during training or evaluation) where data is expected not to have labels.
- Parameters
tokens (
listofstr) – Sequence of tokens to be converted to anInputExampleTC.guid (str, optional) – a unique identifier for the example. Defaults to empty string.
labels (
listofstr) – a label for each corresponding token. Since this function is primarly made for inference, it defaults to a list of empty strings that exist for compatability. However a different string signifying no label could be used.
- Returns
InputExampleTCcontaining the tokens passed as input.
-
transformertc.datatc.convert_features_to_pytorch_dataset(all_features: List[InputFeaturesTC]) → TensorDataset¶ Converts a list of features into a pytorch dataset.
- Parameters
all_features (
listofInputFeatureTC) – the list ofInputFeatureTCoriginating from a list ofInputExampleTCthat will constitute the dataset.- Returns
A pytorch TensorDataset containing the features with the attributes of features occupying the following dimensions:
0 - input (token) ids 1 - attention mask 2 - token types (or segment ids) 3 - label ids
-
transformertc.datatc.convert_features_to_dataset(all_features: List[InputFeaturesTC], dataset_type: str = 'pytorch') → TensorDataset¶ Converts a list of features into a dataset.
- Parameters
all_features (
listofInputFeatureTC) – the list ofInputFeatureTCoriginating from a list ofInputExampleTCthat will constitute the dataset.dataset_type (str) – the type of dataset, curruntly only pytorch is supported.
- Returns
A pytorch TensorDataset.
- Raises
ValueError if dataset_type is not supported. –
-
class
transformertc.datatc.DataBunchLoaderTC(data_dir: str, slen: int, backend: str = 'pytorch', cached_data_dir: str = None)¶ Bases:
objectA class for grouping multiple subsets of a token classficiation dataset.
The idea is that training and evaluation commonly require a train, development (or validation), and test subsets (or splits). Generally all of these have the same parameters (e.g. max sequence length) and same set of labels, etc. This class is for made for objects to contain all the required parameters and options for the 3 subsets of a token classification dataset.
See
CONLLProcessorfor the directory structure and data file requirements.-
tokenizer¶ the (subword) tokenizer of the pretrained transformer model.
-
get_labels(self)¶ Returns the list of labels.
-
property
tokenizer(self) Get the tokenizer.
-
_path_for_cached(self, subset: str = 'train')¶ Given a subset name, return the path for its cached data.
-
_load_cached(self, subset: str = 'train')¶ Load a cached subset.
-
_save_cached(self, features: List[InputFeaturesTC], subset: str = 'train')¶ Save features to a cache file.
-
_get_features(self, subset: str = 'train')¶ Features for a subset from either cache or by generating them.
If features are not already cached and if cached_data_dir is not None, it will cache the features generated.
- Returns
List of
InputFeaturesTCfor the subset.
-
get_train(self, batch_size: int)¶ Returns the train dataloader.
-
get_dev(self, batch_size: int)¶ Returns the dev dataloader.
-
get_test(self, batch_size: int = 1)¶ Returns the test dataloader.
-