API

Data

carpedm.data.download

Download scripts.

This module provides the interface for downloading raw datasets from their source.

Datasets Currently Available for Download
ID Dataset
pmjtc
provided by the Center for Open Data in the Humanities (CODH).

Example

Data may be downloaded externally using the provided script:

$ download_data --data-dir <download/to/this/directory> --data-id pmjtc

Note

If an expected data subdirectory already exists in the specified target data-dir that data will not be downloaded, even if the subdirectory is empty. This should be fixed in a future version.

Todo

  • Update get_books_list once list is included in downloadables.
  • Check subdirectory contents.
  • Generalize download structure for other datasets.
carpedm.data.download.get_books_list(dataset='pmjtc')[source]

Retrieve list of books/images in dataset.

Parameters:dataset (str) – Identifier for dataset for which to retrieve information.
Returns:Names of dataset subdirectories and/or files.
Return type:list of str
carpedm.data.download.maybe_download(directory, dataset='pmjtc')[source]

Download character dataset if BOOKS not in directory.

Parameters:
  • directory (str) – Directory where dataset is located or should be saved.
  • dataset (str) – Identifier for dataset to download.

carpedm.data.io

Input and output.

This module provides functionality for reading and writing data.

Todo

  • Tests
    • DataWriter
    • CSVParser
class carpedm.data.io.CSVParser(csv_file, data_dir, bib_id)[source]

Utility class for parsing coordinate CSV files.

character(row)[source]

Convert CSV row to a Character object.

Returns:The next character
Return type:Character
characters()[source]

Generates rest of characters in CSV.

Yields:carpedm.data.util.Character – The next character.
parse_characters(charset)[source]

Generate metadata for single character images.

Parameters:charset (CharacterSet) – Character set.

A more efficient implementation of parse_sequences when image_scope='seq' and seq_len=1.

Only characters in the character set are included.

Returns:Single character image meta data.
Return type:list of carpedm.data.util.ImageMeta
parse_lines()[source]

Generate metadata for vertical lines of characters.

Characters not in character set or vocabulary will be labeled as unknown when converted to integer IDs.

Returns:Line image meta data.
Return type:list of carpedm.data.util.ImageMeta
parse_pages()[source]

Genereate metadata for full page images.

Includes every character on page. Characters not in character set or vocabulary will be labeled as unknown when converted to integer IDs.

Returns:Page image meta data.
Return type:list of carpedm.data.util.ImageMeta
parse_sequences(charset, len_min, len_max)[source]

Generate metadata for images of character sequences.

Only includes sequences of chars in the desired character set. If len_min == len_max, sequence length is deterministic, else each sequence is of random length from [len_min, len_max].

Parameters:
  • charset (CharacterSet) – The character set.
  • len_min (int) – Minimum sequence length.
  • len_max (int) – Maximum sequence length.
Returns:

Sequence image meta data.

Return type:

list of carpedm.data.util.ImageMeta

class carpedm.data.io.DataWriter(format_out, images, image_shape, vocab, chunk, character, line, label, bbox, subdirs)[source]

Utility for writing data to disk in various formats.

available_formats

list – The available formats.

References

Heavy modification of _process_dataset in the input pipeline for the TensorFlow im2txt models.

write(fname_prefix, num_threads, num_shards)[source]

Write data to disk.

Parameters:
  • fname_prefix (str) – Path base for data files.
  • num_threads (int) – Number of threads to run in parallel.
  • num_shards (int) – Total number of shards to write, if any.
Returns:

Total number of examples written.

Return type:

int

carpedm.data.lang

Language-specific and unicode utilities.

Todo

  • Variable UNK token in Vocabulary
class carpedm.data.lang.CharacterSet(charset, name=None)[source]

Character set abstract class.

in_charset(unicode)[source]

Check if a character is in the defined character set.

Parameters:unicode (str) – String representation of unicode value.
presets

Pre-defined character sets.

Returns:Character set IDs.
Return type:list of str
class carpedm.data.lang.JapaneseUnicodes(charset)[source]

Utility for accessing and manipulating Japanese character unicodes.

Inherits from CharacterSet.

Unicode ranges taken from [1] with edits for exceptions.

References

[1] http://www.unicode.org/charts/

presets()[source]

Pre-defined character sets.

Returns:Character set IDs.
Return type:list of str
class carpedm.data.lang.Vocabulary(reserved, vocab)[source]

Simple vocabulary wrapper.

References

Lightly modified TensorFlow “im2txt” Vocabulary.

char_to_id(char)[source]

Returns the integer id of a character string.

get_num_classes()[source]

Returns number of classes, includes <UNK>.

get_num_reserved()[source]

Returns number of reserved IDs.

id_to_char(char_id)[source]

Returns the character string of a integer id.

carpedm.data.lang.char2code(unicode)[source]

Returns the ASCII code for a unicode character.

Parameters:unicode (str) –
Raises:TypeError – string is length two.
carpedm.data.lang.code2char(code)[source]

Returns the unicode string for the character.

carpedm.data.lang.code2hex(code)[source]

Returns hex integer for a unicode string.

The argument code could either be an ascii representation, (e.g. U+3055, <UNK>) or a unicode character.

Parameters:code (str) – Code to convert.
Returns:
Return type:int

carpedm.data.meta

Image metadata management.

This module loads and manages metadata stored as CSV files in the raw data directory.

carpedm.data.meta.DEFAULT_SEED

int – The default random seed.

Examples

import carpedm as dm

Load, view, and generate a dataset of single kana characters.

single_kana = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='char', charset=dm.data.CharacterSet('kana'))
single_kana.view_images(subset='train', shape=(64,64))
single_kana.generate_dataset(out_dir='/tmp/pmjtc_data', subset='train')

Load and view a dataset of sequences of 3 kanji.

kanji_seq = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='seq', seq_len=3, charset=dm.data.CharacterSet('kanji'))
kanji_seq.view_images(subset='dev', shape=(None, 64))

Load and view a dataset of full pages.

full_page = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='page', charset=dm.data.CharacterSet('all'))
full_page.view_images(subset='test', shape=None)

Note

Unless stated otherwise, image shape arguments in this module should be a tuple (height, width). Tuple values may be one of the following:

  1. int
    specifies the absolute size (in pixels) for that axis
  2. float
    specifies a rescale factor relative to the original image size
  3. None
    the corresponding axis size will be computed such that the aspect ratio is maintained. If both height and width are None, no resize is performed.

Caution

If the new shape is smaller than the original, information will be lost due to interpolation.

Todo

  • Tests
    • generate_dataset
  • Sort characters by reading order, i.e. character ID.
  • Rewrite data as CSV following original format
  • Data generator option instead of writing data.
  • Output formats and/or generator return types for generate_dataset
    • numpy
    • hdf5
    • pandas DataFrame
  • Chunked generate_dataset option to include partial characters.
  • Low-priority:
    • Fix bounding box display error in view_images
    • specify number of character type in sequence
      • e.g. 2 Kanji, 1 kana
    • Instead of padding, fill specified shape with surrounding
class carpedm.data.meta.MetaLoader(data_dir, test_split='hnsd00000', dev_split=0.1, dev_factor=1, vocab_size=None, min_freq=0, reserved=('<PAD>', '<GO>', '<END>', '<UNK>'), charset=<carpedm.data.lang.JapaneseUnicodes object>, image_scope='char', seq_len=None, seq_maxlen=None, verbose=False, seed=None)[source]

Class for loading image metadata.

data_stats(which_sets=('train', 'dev', 'test'), which_stats=('majority', 'frequency', 'unknowns'), save_dir=None, include=(None, None))[source]

Print or show data statistics.

Parameters:
  • which_sets (tuple) – Data subsets to see statistics for.
  • which_stats (tuple) – Statistics to view. Default gives all options.
  • save_dir (str) – If not None, save figures/files to this directory.
  • include (tuple) – Include class IDs from this range.
generate_dataset(out_dir, subset, format_store='tfrecords', shape_store=None, shape_in=None, num_shards=8, num_threads=4, target_id='image/seq/char/id', sparse_labels=False, chunk=False, character=True, line=False, label=True, bbox=False, overwrite=False)[source]

Generate data usable by machine learning algorithm.

Parameters:
  • out_dir (str) – Directory to write the data to if ‘generator’ not in format_store.
  • subset (str) – The subset of data to generate.
  • format_store (str) – Format to save the data as.
  • shape_store (tuple or None) – Size to which images are resized for storage (on disk). The default is to not perform any resize. Please see this note on image shape for more information.
  • shape_in (tuple or None) – Size to which images are resized by interpolation or padding before being input to a model. Please see this note on image shape for more information.
  • num_shards (int) – Number of sharded output files.
  • num_threads (int) – Number of threads to run in parallel.
  • target_id (str) – Determines the target feature (one of keys in dict returned by ImageMeta.generate_features).
  • sparse_labels (bool) – Provide sparse_labels, only used for TFRecords.
  • chunk (bool) –

    Instead of using the original image, extract non-overlapping chunks and corresponding features from the original image on a regular grid. Pad the original image to divide by shape evenly.

    Note

    Currently only characters that fit entirely in the block will be propagated to appropriate features.

  • character (bool) – Include character info, e.g. label, bbox.
  • line (bool) – Include line info (bbox) in features.
  • label (bool) – Include label IDs in features.
  • bbox (str or None) – If not None, include bbox in features as unit (e.g. ‘pixel’, ‘ratio’ [of image]))
  • overwrite (bool) – Overwrite any existing data.
Returns:

Object for accessing batches of data.

Return type:

carpedm.data.providers.DataProvider

max_image_size(subset, static_shape=(None, None))[source]

Retrieve the maximum image size (in pixels).

Parameters:
  • subset (str or None) – Data subset from which to get image sizes. If None, return max sizes of all images.
  • static_shape (tuple of int) – Define static dimensions. Axes that are None will be of variable size.
Returns:

Maximum size (height, width)

Return type:

tuple

view_images(subset, shape=None)[source]

View and explore images in a data subset.

Parameters:
  • subset (str) – The subset to iterate through. One of {‘train’, ‘dev’, ‘test’}.
  • shape (tuple or None) – Shape to which images are resized. Please see this note on image shape for more information.
carpedm.data.meta.num_examples_per_epoch(data_dir, subset)[source]

Retrieve number of examples per epoch.

Parameters:
  • data_dir (str) – Directory where processed dataset is stored.
  • subset (str) – Data subset.
Returns:

Number of examples.

Return type:

int

carpedm.data.ops

Data operations.

This module contains several non-module-specific data operations.

Todo

  • Tests
    • to_sequence_example, parse_sequence_example
    • sparsify_label
    • shard_batch
    • same_line
    • ixs_in_region
    • seq_norm_bbox_values
carpedm.data.ops.in_line(xmin_line, xmax_line, ymin_line, xmin_new, xmax_new, ymax_new)[source]

Heuristic for determining whether a character is in a line.

Note

Currently dependent on the order in which characters are added. For example, a character may vertically overlap with a line, but adding it to the line would be out of reading order. This should be fixed in a future version.

Parameters:
  • xmin_line (list of int) – Minimum x-coordinate of characters in the line the new character is tested against.
  • xmax_line (list of int) – Maximum x-coordinate of characters in the line the new character is tested against.
  • ymin_line (int) – Minimum y-coordinate of line the new character is tested against.
  • xmin_new (int) – Minimum x-coordinate of new character.
  • xmax_new (int) – Maximum x-coordinate of new character.
  • ymax_new (int) – Maximum y-coordinate of new character.
Returns:

The new character vertically overlaps with the “average” character in the line.

Return type:

bool

carpedm.data.ops.in_region(obj, region, entire=True)[source]

Test if an object is in a region.

Parameters:
  • obj (tuple or BBox) – Object bounding box (xmin, xmax, ymin, ymax) or point (x, y).
  • region (tuple or BBox) – Region (xmin, xmax, ymin, ymax).
  • entire (bool) – Object is entirely contained in region.
Returns:

Result

Return type:

bool

carpedm.data.ops.ixs_in_region(bboxes, y1, y2, x1, x2)[source]

Heuristic for determining objects in a region.

Parameters:
  • bboxes (list of carpedm.data.util.BBox) – Bounding boxes for object boundaries.
  • y1 (int) – Top (lowest row index) of region.
  • y2 (int) – Bottom (highest row index) of region.
  • x1 (int) – left side (lowest column index) of region.
  • x2 (int) – right side (highest column index) of region.
Returns:

Indices of objects inside region.

Return type:

list of int

carpedm.data.ops.parse_sequence_example(serialized)[source]

Parse a sequence example.

Parameters:serialized (tf.Tensor) – Serialized 0-D tensor of type string.
Returns:Dictionary of features.
Return type:dict
carpedm.data.ops.seq_norm_bbox_values(bboxes, height, width)[source]

Sequence and normalize bounding box values.

Parameters:
  • bboxes (list of carpedm.data.util.BBox) – Bounding boxes to process.
  • width (int) – Width (in pixels) of image bboxes are in.
  • height (int) – Height (in pixels) of image bboxes are in.
Returns:

tuple containing:

list of float: Normalized minimum x-values

list of float: Normalized minimum y-values

list of float: Normalized maximum x-values

list of float: Normalized maximum y-values

Return type:

tuple

carpedm.data.ops.shard_batch(features, labels, batch_size, num_shards)[source]

Shard a batch of examples.

Parameters:
  • features (dict) – Dictionary of features.
  • labels (tf.Tensor) – labels
  • batch_size (int) – The batch size.
  • num_shards (int) – Number of shards into which batch is split.
Returns:

Features as a list of dictionaries.

Return type:

list of dict

carpedm.data.ops.sparsify_label(label, length)[source]

Convert a regular Tensor into a SparseTensor.

Parameters:
  • label (tf.Tensor) – The label to convert.
  • length (tf.Tensor) – Length of the label
Returns:

tf.SparseTensor

carpedm.data.ops.to_sequence_example(feature_dict)[source]

Convert features to TensorFlow SequenceExample.

Parameters:feature_dict (dict) – Dictionary of features.
Returns:tf.train.SequenceExample

carpedm.data.preproc

Preprocessing methods.

This module provides methods for preprocessing images.

Todo

  • Tests
    • convert_to_grayscale
    • normalize
    • pad_borders
  • Fix and generalize distort_image
carpedm.data.preproc.convert_to_grayscale(image)[source]

Convert RGB image to grayscale.

carpedm.data.preproc.normalize(image)[source]

Rescale pixels values (to [-1, 1]).

carpedm.data.preproc.pad_borders_or_shrink(image, char_bbox, line_bbox, shape, maintain_aspect=True)[source]

Pad or resize the image.

If the desired shape is larger than the original, then that axis is padded equally on both sides with the mean pixel value in the image. Otherwise, the image is resized with BILINEAR interpolation such that the aspect ratio is maintained.

Parameters:
  • image (tf.Tensor) – Image tensor [height, width, channels].
  • char_bbox (tf.Tensor) – Character bounding box [4].
  • line_bbox (tf.Tensor) – Line bounding box [4].
  • shape (tuple of int) – Output shape.
  • maintain_aspect (bool) – Maintain the aspect ratio.
Returns:

Resized image. tf.Tensor: Adjusted character bounding boxes. tf.Tensor: Adjusted line bounding boxes.

Return type:

tf.Tensor

carpedm.data.providers

Data providers for Task input function.

This module provides a generic interface for providing data useable by machine learning algorithms.

A provider may either (1) receive data from the method that initialized it, or (2) receive a directory path where the data to load is stored.

Todo

  • Generator
    • numpy
    • pandas DataFrame
class carpedm.data.providers.DataProvider(target_id)[source]

Data provider abstract class.

make_batch(batch_size)[source]

Generator method that returns a new batch with each call.

Parameters:batch_size (int) – Number of examples per batch.
Returns:Batch features. array_like: Batch targets.
Return type:dict
class carpedm.data.providers.TFDataSet(target_id, data_dir, subset, num_examples, pad_shape, sparse_labels)[source]

TensorFlow DataSet provider from TFRecords stored on disk.

make_batch(batch_size, single_char=False)[source]

Generator method that returns a new batch with each call.

Parameters:batch_size (int) – Number of examples per batch.
Returns:Batch features. array_like: Batch targets.
Return type:dict

carpedm.data.util

Data utilities.

This module provides utility methods/classes used by other data modules.

Todo

  • Tests
    • generate_features
  • Refactor generate_features
  • Fix class_mask for overlapping characters.
class carpedm.data.util.BBox(xmin, xmax, ymin, ymax)[source]

Bounding box helper class.

class carpedm.data.util.Character(label, image_id, x, y, block_id, char_id, w, h)[source]

Helper class for storing a single character.

class carpedm.data.util.ImageMeta(filepath, full_image=False, first_char=None)[source]

Class for storing and manipulating image metadata.

add_char(char)[source]

Add a character to the image.

Parameters:char (Character) – The character to add.
char_bboxes

Bounding boxes for characters.

Returned bounding boxes are relative to (xmin(), ymin()).

Returns:The return values.
Return type:list of carpedm.data.util.BBox
char_labels

Character labels

Returns:The return value.
Return type:list of str
char_mask

Generate pseudo-pixel-level character mask.

Pixels within character bounding boxes are assigned to positive class (1), others assigned negative class (0).

Returns:Character mask of shape (height, width, 1)
Return type:numpy.ndarray
class_mask(vocab)[source]

Generate a character class image mask.

Note

Where characters overlap, the last character added is arbitrarily the one that will be represented in the mask. This should be fixed in a future version.

Parameters:vocab (Vocabulary) – The vocabulary for converting to ID.
Returns:Class mask of shape (height, width, 1)
Return type:numpy.ndarray
combine_with(images)[source]
Parameters:images (list of ImageMeta) –
full_h

Height (in pixels) of full raw parent image.

Returns:The return value.
Return type:int
full_w

Width (in pixels) of full raw parent image.

Returns:The return value.
Return type:int
generate_features(image_shape, vocab, chunk, character, line, label, bbox)[source]
Parameters:
  • image_shape (tuple or None) – Shape (height, width) to which images are resized, or the size of each chunk if chunks == True.
  • vocab (Vocabulary or None) – Vocabulary for converting characters to IDs. Required if character and label.
  • chunk (bool) – Instead of using the original image, return a list of image chunks and corresponding features extracted from the original image on a regular grid. The original image is padded to divide evenly by chunk shape.
  • character (bool) – Include character info (ID, bbox).
  • line (bool) – Include line info (bbox) in features.
  • label (bool) – Include label IDs in features.
  • bbox (str or None) – If not None, include bbox in features as unit (e.g. ‘pixel’, ‘ratio’ [of image]))
Returns:

Feature dictionaries.

Return type:

list of dict

height

Height (in pixels) in full parent image original scale.

Returns:The return value.
Return type:int
line_bboxes

Bounding boxes for lines in the image,

Note: Currently only meaningful when using full page image.

Returns:The return values.
Return type:list of BBox
line_mask

Generate pseudo-pixel-level line mask.

Pixels within line bounding boxes are assigned to positive class (1), others assigned negative class (0).

Returns:Line mask of shape (height, width, 1)
Return type:numpy.ndarray
load_image(shape)[source]

Load image and resize to shape.

If shape is None or (None, None), original size is maintained.

Parameters:shape (tuple or None) – Output dimensions (height, width).
Returns:Resized image.
Return type:numpy.ndarray
new_shape(shape, ratio=False)[source]

Resolves (and computes) input shape to a consistent type.

Parameters:
  • shape (tuple or None) – New shape of image (height, width), with potentially inconsistent types.
  • ratio (bool) – Return new size as ratio of original size.
Returns:

Absolute or relative height int or float: Absolute or relative width

Return type:

int or float

num_chars

Number of characters in the image.

Returns:The return value.
Return type:int
valid_char(char, same_line=False)[source]

Check if char is a valid character to include in image.

Parameters:
  • char (Character) – The character to validate.
  • same_line (bool) – Consider whether char is in the same line as those already in the image example.
Returns:

True for valid, False otherwise.

Return type:

bool

width

Width (in pixels) in full parent image original scale.

Returns:The return value.
Return type:int
xmax

Image’s maximum x-coordinate (column) in raw parent image.

Returns:The return value.
Return type:int
xmin

Image’s minimum x-coordinate (column) in raw parent image.

Returns:The return value.
Return type:int
ymax

Image’s maximum y-coordinate (row) in raw parent image.

Returns:The return value.
Return type:int
ymin

Image’s minimum y-coordinate (row) in raw parent image.

Returns:The return value.
Return type:int
class carpedm.data.util.ImageTFOps[source]

Helper class for decoding and resizing images.

carpedm.data.util.image_path(data_dir, bib_id, image_id)[source]

Generate path to a specified image.

Parameters:
  • data_dir (str) – Path to top-level data directory.
  • bib_id (str) – Bibliography ID.
  • image_id (str) – Image ID.

Returns: String

Neural Networks

carpedm.nn.conv

Convolutional layers and components.

class carpedm.nn.conv.CNN(kernel_size=((3, 3), (3, 3), (3, 3), (3, 3)), num_filters=(64, 96, 128, 160), padding='same', pool_size=((2, 2), (2, 2), (2, 2), (2, 2)), pool_stride=(2, 2, 2, 2), pool_every_n=1, pooling_fn=<MagicMock name='mock.max_pooling2d' id='140502995016616'>, activation_fn=<MagicMock name='mock.relu' id='140502994992376'>, *args, **kwargs)[source]

Modular convolutional neural network layer class.

name

Unique identifier for the model.

The model name will serve as directory name for model-specific results and as the top-level tf.variable_scope.

Returns:The model name.
Return type:str

carpedm.nn.op

Operations for transforming network layer or input.

carpedm.nn.rnn

Recurrent layers and components.

carpedm.nn.util

Utilities for managing and visualizing neural network layers.

carpedm.nn.util.activation_summary(x)[source]

Helper to create summaries for activations. Creates a summary that provides a histogram of activations. Creates a summary that measures the sparsity of activations. :param x: Tensor

Returns:nothing
carpedm.nn.util.name_nice(raw)[source]

Convert tensor name to a nice format.

Remove ‘tower_[0-9]/’ from the name in case this is a multi-GPU training session. This helps the clarity of presentation on tensorboard.

Models

carpedm.models.generic

This module defines base model classes.

class carpedm.models.generic.Model[source]

Abstract class for models.

forward_pass(features, data_format, axes_order, is_training)[source]

Main model functionality.

Must be implemented by subclass.

Parameters:
  • features (array_like or dict) – Input features.
  • data_format (str) – Image format expected for computation, ‘channels_last’ (NHWC) or ‘channels_first’ (NCHW).
  • axes_order (list or None) – If not None, is a list defining the axes order to which image input should be transposed in order to match data_format.
  • is_training (bool) – Training if true, else evaluating.
Returns:

The return value, e.g. class logits.

Return type:

array_like or dict

initialize_pretrained(pretrained_dir)[source]

Initialize a pre-trained model or sub-model.

Parameters:pretrained_dir (str) –

Path to directory where pretrained model is stored. May be used to extract model/sub-model name. For example:

name = pretrained_dir.split('/')[-1].split('_')[0]
Returns:Map from pre-trained variable to model variable.
Return type:dict
name

Unique identifier for the model.

Used to identify results generated with the model.

Must be implemented by subclass.

Returns:The model name.
Return type:str
class carpedm.models.generic.TFModel[source]

Abstract class for TensorFlow models.

_forward_pass(features, data_format, axes_order, is_training, reuse)[source]

Main model functionality.

Must be implemented by subclass.

forward_pass(features, data_format, axes_order, is_training, new_var_scope=False, reuse=False)[source]

Wrapper for making nested variable scopes.

Extends Model.

Parameters:
  • new_var_scope (bool) – Use a new variable scope.
  • reuse (bool) – Reuse variables with same scope.
name

Unique identifier for the model.

The model name will serve as directory name for model-specific results and as the top-level tf.variable_scope.

Returns:The model name.
Return type:str

Tasks

carpedm.tasks.generic

Base task class.

Todo

  • Get rid of model_fn dependency on input_fn.
  • LONG TERM: Training methods other than TensorFlow Estimator.
class carpedm.tasks.generic.Task(data_dir, task_dir, test_split='hnsd00000', dev_split=0.1, dev_factor=1, dataset_format='tfrecords', num_shards=8, num_threads=8, shape_store=None, shape_in=None, vocab_size=None, min_frequency=0, seed=None, **kwargs)[source]

Abstract class for Tasks.

__init__(data_dir, task_dir, test_split='hnsd00000', dev_split=0.1, dev_factor=1, dataset_format='tfrecords', num_shards=8, num_threads=8, shape_store=None, shape_in=None, vocab_size=None, min_frequency=0, seed=None, **kwargs)[source]

Initializer.

Parameters:
  • data_dir (str) – Directory where raw data is stored.
  • task_dir (str) – Top-level directory for storing tasks data and results.
  • test_split (float or str) – Either the ratio of all data to use for testing or specific bibliography ID(s). Use comma-separated IDs for multiple books.
  • dev_split (float or str) – Either the ratio of training data to use for dev/val or specific bibliography ID(s). Use comma-separated IDs for multiple books.
  • dev_factor – (int): Size of development set should be divisible by this value. Useful for training on multiple GPUs.
  • dataset_format (str) – Base storage unit for the dataset.
  • vocab_size (int) – Maximum vocab size.
  • min_frequency (int) – Minimum frequency of type to be included in vocab.
  • shape_store (tuple or None) – Size to which images are resized for storage, if needed, e.g. for TFRecords. The default is to not perform any resize. Please see this note on image shape for more information.
  • shape_in (tuple or None) – Size to which images are resized by interpolation or padding before being input to a model. Please see this note on image shape for more information.
  • num_shards (int) – Number of sharded output files.
  • num_threads (int) – Number of threads to run in parallel.
  • seed (int or None) – Number for seeding rng.
  • **kwargs – Unused arguments.
__metaclass__

alias of abc.ABCMeta

__weakref__

list of weak references to the object (if defined)

bbox

When creating a dataset, generate appropriate bounding boxes for the tasks (determined by e.g. self.character, self.line).

Returns:Use bounding boxes.
Return type:bool
character

When creating a dataset, tell the meta_loader to generate character features, e.g. label, bbox.

Returns:Use character features.
Return type:bool
character_set

The Japanese characters (e.g. kana, kanji) of interest.

Preset character sets may include the following component sets:

  • hiragana
  • katakana
  • kana
  • kanji
  • punct (punctuation)
  • misc
Returns:The character set.
Return type:CharacterSet
chunk

When creating a dataset, instead of using the original image, extract non-overlapping chunks of size image_shape and the corresponding features from the original image on a regular grid. The original image is padded to divide evenly by image_shape.

Note: currently only objects that are entirely contained in the block will have its features propagated.

Returns:
Return type:bool
image_scope

Portion of original image for each example.

Available scopes are ‘char’, ‘seq’, ‘line’, ‘page’.

Returns:Task image scope
Return type:str
input_fn(batch_size, subset, num_shards, overwrite=False)[source]

Returns (sharded) batches of data.

Parameters:
  • batch_size (int) – The batch_size
  • subset (str) – The subset to use. One of {train, dev, test}.
  • num_shards (int) – Number of data_shards to produce.
  • overwrite (bool) – Overwrite existing data.
Returns:

Features of length num_shards. (list): Labels of length num_shards.

Return type:

(list)

label

When creating a dataset, generate character labels.

Returns:Use character labels
Return type:bool
line

When creating a dataset, tell the meta_loader to generate line features, e.g. bbox.

Returns:Use line features.
Return type:bool
loss_fn(features, model_output, targets, is_training)[source]

Computes an appropriate loss for the tasks.

Must be implemented in subclass.

Parameters:
  • features (dict) – Additional features for computing loss.
  • model_output (tf.Tensor or dict of tf.Tensor) – Model output used for computing the batch loss, e.g. class logits.
  • targets (tf.Tensor) – Ground truth targets.
  • is_training (bool) – The model is training.
Returns:

Losses of type ‘int32’ and shape [batch_size, 1]

Return type:

tf.Tensor

max_sequence_length

Maximum sequence length.

Only used if image_scope == 'seq'.

Returns:
Return type:int or None
model_fn(model, variable_strategy, num_gpus, num_workers, devices=None)[source]

Model function used by TensorFlow Estimator class.

Parameters:
  • model (pmjtc.models.generic.Model) – The models to run.
  • variable_strategy (str) – Where to locate variable operations, either ‘CPU’ or ‘GPU’.
  • num_gpus (int) – Number of GPUs to use, if available.
  • devices (tuple) – Specific devices to use. If provided, overrides num_gpus.
  • num_workers (int) – Parameter for distributed training.

Returns:

num_classes

Total number of output nodes, includes reserved tokens.

regularization(hparams)[source]
Parameters:hparams – Hyperparameters, e.g. weight_decay

Returns:

reserved

Reserved tokens for the tasks.

The index of each token in the returned tuple will be used as its integer ID.

Returns:The reserved characters
Return type:tuple
results(loss, tower_features, tower_preds, tower_targets, is_training)[source]

Accumulates predictions, computes metrics, and determines the tensors to log and/or visualize.

Parameters:
  • loss (tf.float) – Global loss.
  • tower_features (list of dict) – Tower feature dicts.
  • tower_preds (list) – Tower predictions.
  • tower_targets (list of tf.Tensor) – Tower targets.
  • is_training (bool) – The model is training.
Returns:

The tensors to log dict: All predictions dict: Evaluation metrics

Return type:

dict

sequence_length

If max_sequence_length is None, this gives the deterministic length of a sequence, else the minimum sequence length.

Only used if image_scope == 'seq'.

Returns:
Return type:int or None
sparse_labels

Generate labels as a SparseTensor, e.g. for CTC loss.

Returns:Use sparse labels.
Return type:(bool)
target

Determines the value against which predictions are compared.

For a list of possible targets, refer to carpedm.data.util.ImageMeta.generate_features()

Returns:feature key for the target
Return type:str
task_data_dir

Directory where tasks data is stored.

Returns:str

Utilities

carpedm.util.eval

Evaluation helpers.

carpedm.util.eval.confusion_matrix_metric(labels, predictions, num_classes)[source]

A confusion matrix metric.

Parameters:
  • labels (tf.Tensor) – Ground truth labels.
  • predictions (tf.Tensor) – Predictions.
  • num_classes (int) – Number of classs.
Returns:

tf.update_op:

Return type:

tf.Tensor

carpedm.util.eval.plot_confusion_matrix(cm, classes, normalize=False, save_as=None, title='Confusion matrix')[source]

This function prints and plots the confusion matrix. Normalization can be applied by setting normalize=True.

Slight modification of methods here

carpedm.util.registry

Registry for models and tasks.

Define a new models by subclassing models.Model and register it:

@registry.register_model
class MyModel(models.Model):
    ...

Access by snake-cased name: registry.model("my_model").

See all the models registered: registry.list_models().

References

  1. Lightly modified Tensor2Tensor registry.
carpedm.util.registry.default_name(obj_class)[source]

Convert class name to the registry’s default name for the class.

Parameters:obj_class – the name of a class
Returns:The registry’s default name for the class.
carpedm.util.registry.default_object_name(obj)[source]

Convert object to the registry’s default name for the object class.

Parameters:obj – an object instance
Returns:The registry’s default name for the class of the object.
carpedm.util.registry.display_list_by_prefix(names_list, starting_spaces=0)[source]

Creates a help string for names_list grouped by prefix.

carpedm.util.registry.help_string()[source]

Generate help string with contents of registry.

carpedm.util.registry.model(name)[source]

Retrieve a model by name.

carpedm.util.registry.register_model(name=None)[source]

Register a models. name defaults to class name snake-cased.

carpedm.util.registry.register_task(name=None)[source]

Register a Task. name defaults to cls name snake-cased.

carpedm.util.registry.task(name)[source]

Retrieve a task by name.

carpedm.util.train

Training utilities.

This modules provides utilities for training machine learning models. It uses or makes slight modifications to code from the TensorFlow CIFAR-10 estimator tutorial.

carpedm.util.train.config_optimizer(params)[source]

Configure the optimizer used for training.

Sets the learning rate schedule and optimization algorithm.

Parameters:params (tf.contrib.training.HParams) – Hyperparameters.
Returns:tf.train.Optimizer