Attention: This MediaPipe Solutions Preview is an early release. Learn more

mediapipe_model_maker.text_classifier.Dataset

Dataset library for text classifier.

Inherits From: ClassificationDataset, Dataset

mediapipe_model_maker.text_classifier.Dataset(
    dataset: tf.data.Dataset,
    label_names: List[str],
    tfrecord_cache_files: Optional[cache_files_lib.TFRecordCacheFiles] = None,
    size: Optional[int] = None
)

Args
`tf_dataset`	A tf.data.Dataset object that contains a potentially large set of elements, where each element is a pair of (input_data, target). The `input_data` means the raw input data, like an image, a text etc., while the `target` means the ground truth of the raw input data, e.g. the classification label of the image etc.
`size`	The size of the dataset. tf.data.Dataset donesn't support a function to get the length directly since it's lazy-loaded and may be infinite.

Attributes
`label_names`
`num_classes`
`size`	Returns the size of the dataset. Same functionality as calling len. See the len method definition for more information.

Attributes

label_names

num_classes

size

Returns the size of the dataset.

Same functionality as calling len. See the len method definition for more information.

Methods

`from_csv`

View source

@classmethod
from_csv(
    filename: str,
    csv_params: mediapipe_model_maker.text_classifier.CSVParams,
    shuffle: bool = True,
    cache_dir: Optional[str] = None,
    num_shards: int = 1
) -> 'Dataset'

Loads text with labels from a CSV file.

Args
`filename`	Name of the CSV file.
`csv_params`	Parameters used for reading the CSV file.
`shuffle`	If True, randomly shuffle the data.
`cache_dir`	Optional parameter to specify where to store the preprocessed dataset. Only used for BERT models.
`num_shards`	Optional parameter for num shards of the preprocessed dataset. Note that using more than 1 shard will reorder the dataset. Only used for BERT models.

Returns
Dataset containing (text, label) pairs and other related info.

`gen_tf_dataset`

View source

gen_tf_dataset(
    batch_size: int = 1,
    is_training: bool = False,
    shuffle: bool = False,
    preprocess: Optional[Callable[..., Any]] = None,
    drop_remainder: bool = False
) -> tf.data.Dataset

Generates a batched tf.data.Dataset for training/evaluation.

Args
`batch_size`	An integer, the returned dataset will be batched by this size.
`is_training`	A boolean, when True, the returned dataset will be optionally shuffled and repeated as an endless dataset.
`shuffle`	A boolean, when True, the returned dataset will be shuffled to create randomness during model training.
`preprocess`	A function taking three arguments in order, feature, label and boolean is_training.
`drop_remainder`	boolean, whether the finally batch drops remainder.

Returns
A TF dataset ready to be consumed by Keras model.

`split`

View source

split(
    fraction: float
) -> Tuple[ds._DatasetT, ds._DatasetT]

Splits dataset into two sub-datasets with the given fraction.

Primarily used for splitting the data set into training and testing sets.

Args
`fraction`	float, demonstrates the fraction of the first returned subdataset in the original data.

Returns
The splitted two sub datasets.

`len`

View source

__len__() -> int

Returns the number of element of the dataset.

If size is not set, this method will fallback to using the len method of the tf.data.Dataset in self._dataset. Calling len on a tf.data.Dataset instance may throw a TypeError because the dataset may be lazy-loaded with an unknown size or have infinite size.

In most cases, however, when an instance of this class is created by helper functions like 'from_folder', the size of the dataset will be preprocessed, and the _size instance variable will be already set.

Raises
TypeError if self._size is not set and the cardinality of self._dataset is INFINITE_CARDINALITY or UNKNOWN_CARDINALITY.

mediapipe_model_maker.text_classifier.Dataset

Args

Attributes

Methods

from_csv

gen_tf_dataset

split

__len__

`from_csv`

`gen_tf_dataset`

`split`

`len`