transformertc.berttc

BertTC class: BERT for token classification tasks.

For examples on how to use this class see the examples directory.

Module Contents

class transformertc.berttc.BertTC(config: BertConfig, configtc: ConfigTC, tokenizer: BertTokenizer, model: BertForTokenClassification)

Bases: object

BertTC class: BERT for token classification tasks.

This class allows:
  • loading pretrained and/or fine-tuned BERT models;

  • fine-tuning (really, training) models;

  • using fine-tuned models for classification (inference).

It acts primarly as a wrapper around a transformer model, its config object, and tokenizer. Put togetherwith a handfull of useful functions. Namely save/load and fine-tune/classify.

config

pretained BertConfig object from transformers.

Type

BertConfig

configtc

ConfigTC object.

Type

ConfigTC

tokenizer

pretrained BertTokenizer from transformers.

Type

BertTokenizer

model

pretrained BERT model with the token classification layers added in (but not necessarily trained).

Type

BertForTokenClassification

tokenizer

Sets the attributes.

save_pretrained(self, save_directory: str)

Save to a given directory.

to(self, device)

Send model to a specific device.

classmethod from_pretrained(cls, model_path)

Load from a given path.

classmethod create_from_pretrained(cls, model_name_or_path, labels, max_seq_length=0, task_format='BIO')
classify(self, texts: List[List[str]], batch_size: int = None, n_jobs: int = -1, progressbar: bool = False)

Classifiy a list of tokenized texts with this model.

Parameters
  • texts (list of list of str) – list of (word) tokenized documents (e.g. sentences). Example: [['This', 'is', '1'], ['And', 'this', 'is', '2']].

  • batch_size (int) – size of batches to use. Defaults to None which will try to use len(texts) as the batch_size.

  • n_jobs (int) – number of threads/processes to use when converting texts to features (i.e. InputFeaturesTC). Defaults to -1 which means a number equal to the number of CPU cores.

  • progressbar (bool) – show a progressbar (via TQDM) for the classification progress.

Returns

A list of lists of ResultTC corresponding to the list of texts.

finetune(self, dataloader, epochs: int = 4, lr: float = 5e-05, wdecay: float = 0.0, warmup_steps: int = 0, adam_epsilon: float = 1e-08, progressbar: bool = False)

Fine-tune pretrained model on a TC task.

Parameters
  • dataloader (DataLoader) – a pytorch dataloader.

  • epochs (int) – number of epochs to fine-tune for.

  • lr (float) – the learning rate.

  • wdecay (float) – weight decay.

  • warmup_steps (int) – number of steps to run linear warmup for.

  • adam_epsilon (float) – epsilon parameter for Adam optimizer.

  • progressbar (bool) – use TQDM progress bar during fine tuning.