API¶
Data¶
carpedm.data.download¶
Download scripts.
This module provides the interface for downloading raw datasets from their source.
ID | Dataset |
---|---|
pmjtc | provided by the Center for Open Data in the Humanities (CODH).
|
Example
Data may be downloaded externally using the provided script:
$ download_data --data-dir <download/to/this/directory> --data-id pmjtc
Note
If an expected data subdirectory already exists in the
specified target data-dir
that data will not be downloaded, even
if the subdirectory is empty. This should be fixed in a future
version.
Todo
- Update
get_books_list
once list is included in downloadables. - Check subdirectory contents.
- Generalize download structure for other datasets.
carpedm.data.io¶
Input and output.
This module provides functionality for reading and writing data.
Todo
- Tests
- DataWriter
- CSVParser
-
class
carpedm.data.io.
CSVParser
(csv_file, data_dir, bib_id)[source]¶ Utility class for parsing coordinate CSV files.
-
character
(row)[source]¶ Convert CSV row to a Character object.
Returns: The next character Return type: Character
-
characters
()[source]¶ Generates rest of characters in CSV.
Yields: carpedm.data.util.Character
– The next character.
-
parse_characters
(charset)[source]¶ Generate metadata for single character images.
Parameters: charset (CharacterSet) – Character set. A more efficient implementation of
parse_sequences
whenimage_scope='seq'
andseq_len=1
.Only characters in the character set are included.
Returns: Single character image meta data. Return type: list
ofcarpedm.data.util.ImageMeta
-
parse_lines
()[source]¶ Generate metadata for vertical lines of characters.
Characters not in character set or vocabulary will be labeled as unknown when converted to integer IDs.
Returns: Line image meta data. Return type: list
ofcarpedm.data.util.ImageMeta
-
parse_pages
()[source]¶ Genereate metadata for full page images.
Includes every character on page. Characters not in character set or vocabulary will be labeled as unknown when converted to integer IDs.
Returns: Page image meta data. Return type: list
ofcarpedm.data.util.ImageMeta
-
parse_sequences
(charset, len_min, len_max)[source]¶ Generate metadata for images of character sequences.
Only includes sequences of chars in the desired character set. If
len_min == len_max
, sequence length is deterministic, else each sequence is of random length from [len_min, len_max].Parameters: - charset (CharacterSet) – The character set.
- len_min (int) – Minimum sequence length.
- len_max (int) – Maximum sequence length.
Returns: Sequence image meta data.
Return type:
-
-
class
carpedm.data.io.
DataWriter
(format_out, images, image_shape, vocab, chunk, character, line, label, bbox, subdirs)[source]¶ Utility for writing data to disk in various formats.
-
available_formats
¶ list – The available formats.
References
Heavy modification of
_process_dataset
in the input pipeline for the TensorFlow im2txt models.-
carpedm.data.lang¶
Language-specific and unicode utilities.
Todo
- Variable UNK token in Vocabulary
-
class
carpedm.data.lang.
CharacterSet
(charset, name=None)[source]¶ Character set abstract class.
-
class
carpedm.data.lang.
JapaneseUnicodes
(charset)[source]¶ Utility for accessing and manipulating Japanese character unicodes.
Inherits from
CharacterSet
.Unicode ranges taken from [1] with edits for exceptions.
References
-
class
carpedm.data.lang.
Vocabulary
(reserved, vocab)[source]¶ Simple vocabulary wrapper.
References
Lightly modified TensorFlow “im2txt” Vocabulary.
carpedm.data.meta¶
Image metadata management.
This module loads and manages metadata stored as CSV files in the raw data directory.
-
carpedm.data.meta.
DEFAULT_SEED
¶ int – The default random seed.
Examples
import carpedm as dm
Load, view, and generate a dataset of single kana characters.
single_kana = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='char', charset=dm.data.CharacterSet('kana'))
single_kana.view_images(subset='train', shape=(64,64))
single_kana.generate_dataset(out_dir='/tmp/pmjtc_data', subset='train')
Load and view a dataset of sequences of 3 kanji.
kanji_seq = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='seq', seq_len=3, charset=dm.data.CharacterSet('kanji'))
kanji_seq.view_images(subset='dev', shape=(None, 64))
Load and view a dataset of full pages.
full_page = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='page', charset=dm.data.CharacterSet('all'))
full_page.view_images(subset='test', shape=None)
Note
Unless stated otherwise, image shape arguments in this module should be a tuple (height, width). Tuple values may be one of the following:
int
- specifies the absolute size (in pixels) for that axis
float
- specifies a rescale factor relative to the original image size
None
- the corresponding axis size will be computed such that the aspect ratio is maintained. If both height and width are None, no resize is performed.
Caution
If the new shape is smaller than the original, information will be lost due to interpolation.
Todo
- Tests
- generate_dataset
- Sort characters by reading order, i.e. character ID.
- Rewrite data as CSV following original format
- Data generator option instead of writing data.
- Output formats and/or generator return types for
generate_dataset
- numpy
- hdf5
- pandas DataFrame
- Output formats and/or generator return types for
- Chunked
generate_dataset
option to include partial characters. - Low-priority:
- Fix bounding box display error in
view_images
- specify number of character type in sequence
- e.g. 2 Kanji, 1 kana
- Instead of padding, fill specified shape with surrounding
- Fix bounding box display error in
-
class
carpedm.data.meta.
MetaLoader
(data_dir, test_split='hnsd00000', dev_split=0.1, dev_factor=1, vocab_size=None, min_freq=0, reserved=('<PAD>', '<GO>', '<END>', '<UNK>'), charset=<carpedm.data.lang.JapaneseUnicodes object>, image_scope='char', seq_len=None, seq_maxlen=None, verbose=False, seed=None)[source]¶ Class for loading image metadata.
-
data_stats
(which_sets=('train', 'dev', 'test'), which_stats=('majority', 'frequency', 'unknowns'), save_dir=None, include=(None, None))[source]¶ Print or show data statistics.
Parameters:
-
generate_dataset
(out_dir, subset, format_store='tfrecords', shape_store=None, shape_in=None, num_shards=8, num_threads=4, target_id='image/seq/char/id', sparse_labels=False, chunk=False, character=True, line=False, label=True, bbox=False, overwrite=False)[source]¶ Generate data usable by machine learning algorithm.
Parameters: - out_dir (str) – Directory to write the data to if ‘generator’
not in
format_store
. - subset (str) – The subset of data to generate.
- format_store (str) – Format to save the data as.
- shape_store (tuple or None) – Size to which images are resized for storage (on disk). The default is to not perform any resize. Please see this note on image shape for more information.
- shape_in (tuple or None) – Size to which images are resized by interpolation or padding before being input to a model. Please see this note on image shape for more information.
- num_shards (int) – Number of sharded output files.
- num_threads (int) – Number of threads to run in parallel.
- target_id (str) – Determines the target feature (one of keys in dict returned by ImageMeta.generate_features).
- sparse_labels (bool) – Provide sparse_labels, only used for TFRecords.
- chunk (bool) –
Instead of using the original image, extract non-overlapping chunks and corresponding features from the original image on a regular grid. Pad the original image to divide by
shape
evenly.Note
Currently only characters that fit entirely in the block will be propagated to appropriate features.
- character (bool) – Include character info, e.g. label, bbox.
- line (bool) – Include line info (bbox) in features.
- label (bool) – Include label IDs in features.
- bbox (str or None) – If not None, include bbox in features as unit (e.g. ‘pixel’, ‘ratio’ [of image]))
- overwrite (bool) – Overwrite any existing data.
Returns: Object for accessing batches of data.
Return type: - out_dir (str) – Directory to write the data to if ‘generator’
not in
-
max_image_size
(subset, static_shape=(None, None))[source]¶ Retrieve the maximum image size (in pixels).
Parameters: Returns: Maximum size (height, width)
Return type:
-
view_images
(subset, shape=None)[source]¶ View and explore images in a data subset.
Parameters: - subset (str) – The subset to iterate through. One of {‘train’, ‘dev’, ‘test’}.
- shape (tuple or None) – Shape to which images are resized. Please see this note on image shape for more information.
-
carpedm.data.ops¶
Data operations.
This module contains several non-module-specific data operations.
Todo
- Tests
to_sequence_example
,parse_sequence_example
sparsify_label
shard_batch
same_line
ixs_in_region
seq_norm_bbox_values
-
carpedm.data.ops.
in_line
(xmin_line, xmax_line, ymin_line, xmin_new, xmax_new, ymax_new)[source]¶ Heuristic for determining whether a character is in a line.
Note
Currently dependent on the order in which characters are added. For example, a character may vertically overlap with a line, but adding it to the line would be out of reading order. This should be fixed in a future version.
Parameters: - xmin_line (
list
ofint
) – Minimum x-coordinate of characters in the line the new character is tested against. - xmax_line (
list
ofint
) – Maximum x-coordinate of characters in the line the new character is tested against. - ymin_line (int) – Minimum y-coordinate of line the new character is tested against.
- xmin_new (int) – Minimum x-coordinate of new character.
- xmax_new (int) – Maximum x-coordinate of new character.
- ymax_new (int) – Maximum y-coordinate of new character.
Returns: The new character vertically overlaps with the “average” character in the line.
Return type: - xmin_line (
-
carpedm.data.ops.
in_region
(obj, region, entire=True)[source]¶ Test if an object is in a region.
Parameters: Returns: Result
Return type:
-
carpedm.data.ops.
ixs_in_region
(bboxes, y1, y2, x1, x2)[source]¶ Heuristic for determining objects in a region.
Parameters: Returns: Indices of objects inside region.
Return type:
-
carpedm.data.ops.
parse_sequence_example
(serialized)[source]¶ Parse a sequence example.
Parameters: serialized ( tf.Tensor
) – Serialized 0-D tensor of type string.Returns: Dictionary of features. Return type: dict
-
carpedm.data.ops.
seq_norm_bbox_values
(bboxes, height, width)[source]¶ Sequence and normalize bounding box values.
Parameters: - bboxes (
list
ofcarpedm.data.util.BBox
) – Bounding boxes to process. - width (int) – Width (in pixels) of image bboxes are in.
- height (int) – Height (in pixels) of image bboxes are in.
Returns: tuple
containing:Return type: - bboxes (
-
carpedm.data.ops.
shard_batch
(features, labels, batch_size, num_shards)[source]¶ Shard a batch of examples.
Parameters: Returns: Features as a list of dictionaries.
Return type:
carpedm.data.preproc¶
Preprocessing methods.
This module provides methods for preprocessing images.
Todo
- Tests
convert_to_grayscale
normalize
pad_borders
- Fix and generalize
distort_image
-
carpedm.data.preproc.
pad_borders_or_shrink
(image, char_bbox, line_bbox, shape, maintain_aspect=True)[source]¶ Pad or resize the image.
If the desired shape is larger than the original, then that axis is padded equally on both sides with the mean pixel value in the image. Otherwise, the image is resized with BILINEAR interpolation such that the aspect ratio is maintained.
Parameters: Returns: Resized image.
tf.Tensor
: Adjusted character bounding boxes.tf.Tensor
: Adjusted line bounding boxes.Return type: tf.Tensor
carpedm.data.providers¶
Data providers for Task input function.
This module provides a generic interface for providing data useable by machine learning algorithms.
A provider may either (1) receive data from the method that initialized it, or (2) receive a directory path where the data to load is stored.
Todo
- Generator
- numpy
- pandas DataFrame
carpedm.data.util¶
Data utilities.
This module provides utility methods/classes used by other data modules.
Todo
- Tests
generate_features
- Refactor
generate_features
- Fix
class_mask
for overlapping characters.
-
class
carpedm.data.util.
Character
(label, image_id, x, y, block_id, char_id, w, h)[source]¶ Helper class for storing a single character.
-
class
carpedm.data.util.
ImageMeta
(filepath, full_image=False, first_char=None)[source]¶ Class for storing and manipulating image metadata.
-
add_char
(char)[source]¶ Add a character to the image.
Parameters: char (Character) – The character to add.
-
char_bboxes
¶ Bounding boxes for characters.
Returned bounding boxes are relative to (
xmin()
,ymin()
).Returns: The return values. Return type: list
ofcarpedm.data.util.BBox
-
char_mask
¶ Generate pseudo-pixel-level character mask.
Pixels within character bounding boxes are assigned to positive class (1), others assigned negative class (0).
Returns: Character mask of shape (height, width, 1) Return type: numpy.ndarray
-
class_mask
(vocab)[source]¶ Generate a character class image mask.
Note
Where characters overlap, the last character added is arbitrarily the one that will be represented in the mask. This should be fixed in a future version.
Parameters: vocab (Vocabulary) – The vocabulary for converting to ID. Returns: Class mask of shape (height, width, 1) Return type: numpy.ndarray
-
generate_features
(image_shape, vocab, chunk, character, line, label, bbox)[source]¶ Parameters: - image_shape (tuple or None) – Shape (height, width) to which images are resized, or the size of each chunk if chunks == True.
- vocab (Vocabulary or None) – Vocabulary for converting
characters to IDs. Required
if character and label
. - chunk (bool) – Instead of using the original image, return a list of image chunks and corresponding features extracted from the original image on a regular grid. The original image is padded to divide evenly by chunk shape.
- character (bool) – Include character info (ID, bbox).
- line (bool) – Include line info (bbox) in features.
- label (bool) – Include label IDs in features.
- bbox (str or None) – If not None, include bbox in features as unit (e.g. ‘pixel’, ‘ratio’ [of image]))
Returns: Feature dictionaries.
Return type:
-
height
¶ Height (in pixels) in full parent image original scale.
Returns: The return value. Return type: int
-
line_bboxes
¶ Bounding boxes for lines in the image,
Note: Currently only meaningful when using full page image.
Returns: The return values. Return type: list
ofBBox
-
line_mask
¶ Generate pseudo-pixel-level line mask.
Pixels within line bounding boxes are assigned to positive class (1), others assigned negative class (0).
Returns: Line mask of shape (height, width, 1) Return type: numpy.ndarray
-
load_image
(shape)[source]¶ Load image and resize to shape.
If
shape
is None or (None, None), original size is maintained.Parameters: shape (tuple or None) – Output dimensions (height, width). Returns: Resized image. Return type: numpy.ndarray
-
new_shape
(shape, ratio=False)[source]¶ Resolves (and computes) input shape to a consistent type.
Parameters: Returns: Absolute or relative height int or float: Absolute or relative width
Return type:
-
valid_char
(char, same_line=False)[source]¶ Check if char is a valid character to include in image.
Parameters: Returns: True for valid, False otherwise.
Return type:
-
width
¶ Width (in pixels) in full parent image original scale.
Returns: The return value. Return type: int
-
xmax
¶ Image’s maximum x-coordinate (column) in raw parent image.
Returns: The return value. Return type: int
-
xmin
¶ Image’s minimum x-coordinate (column) in raw parent image.
Returns: The return value. Return type: int
-
Neural Networks¶
carpedm.nn.conv¶
Convolutional layers and components.
-
class
carpedm.nn.conv.
CNN
(kernel_size=((3, 3), (3, 3), (3, 3), (3, 3)), num_filters=(64, 96, 128, 160), padding='same', pool_size=((2, 2), (2, 2), (2, 2), (2, 2)), pool_stride=(2, 2, 2, 2), pool_every_n=1, pooling_fn=<MagicMock name='mock.max_pooling2d' id='140502995016616'>, activation_fn=<MagicMock name='mock.relu' id='140502994992376'>, *args, **kwargs)[source]¶ Modular convolutional neural network layer class.
carpedm.nn.op¶
Operations for transforming network layer or input.
carpedm.nn.rnn¶
Recurrent layers and components.
carpedm.nn.util¶
Utilities for managing and visualizing neural network layers.
Models¶
carpedm.models.generic¶
This module defines base model classes.
-
class
carpedm.models.generic.
Model
[source]¶ Abstract class for models.
-
forward_pass
(features, data_format, axes_order, is_training)[source]¶ Main model functionality.
Must be implemented by subclass.
Parameters: - features (array_like or dict) – Input features.
- data_format (str) – Image format expected for computation, ‘channels_last’ (NHWC) or ‘channels_first’ (NCHW).
- axes_order (list or None) – If not None, is a list defining the axes order to which image input should be transposed in order to match data_format.
- is_training (bool) – Training if true, else evaluating.
Returns: The return value, e.g. class logits.
Return type: array_like or dict
-
initialize_pretrained
(pretrained_dir)[source]¶ Initialize a pre-trained model or sub-model.
Parameters: pretrained_dir (str) – Path to directory where pretrained model is stored. May be used to extract model/sub-model name. For example:
name = pretrained_dir.split('/')[-1].split('_')[0]
Returns: Map from pre-trained variable to model variable. Return type: dict
-
-
class
carpedm.models.generic.
TFModel
[source]¶ Abstract class for TensorFlow models.
-
_forward_pass
(features, data_format, axes_order, is_training, reuse)[source]¶ Main model functionality.
Must be implemented by subclass.
-
Tasks¶
carpedm.tasks.generic¶
Base task class.
Todo
- Get rid of
model_fn
dependency oninput_fn
. - LONG TERM: Training methods other than TensorFlow Estimator.
-
class
carpedm.tasks.generic.
Task
(data_dir, task_dir, test_split='hnsd00000', dev_split=0.1, dev_factor=1, dataset_format='tfrecords', num_shards=8, num_threads=8, shape_store=None, shape_in=None, vocab_size=None, min_frequency=0, seed=None, **kwargs)[source]¶ Abstract class for Tasks.
-
__init__
(data_dir, task_dir, test_split='hnsd00000', dev_split=0.1, dev_factor=1, dataset_format='tfrecords', num_shards=8, num_threads=8, shape_store=None, shape_in=None, vocab_size=None, min_frequency=0, seed=None, **kwargs)[source]¶ Initializer.
Parameters: - data_dir (str) – Directory where raw data is stored.
- task_dir (str) – Top-level directory for storing tasks data and results.
- test_split (float or str) – Either the ratio of all data to use for testing or specific bibliography ID(s). Use comma-separated IDs for multiple books.
- dev_split (float or str) – Either the ratio of training data to use for dev/val or specific bibliography ID(s). Use comma-separated IDs for multiple books.
- dev_factor – (int): Size of development set should be divisible by this value. Useful for training on multiple GPUs.
- dataset_format (str) – Base storage unit for the dataset.
- vocab_size (int) – Maximum vocab size.
- min_frequency (int) – Minimum frequency of type to be included in vocab.
- shape_store (tuple or None) – Size to which images are resized for storage, if needed, e.g. for TFRecords. The default is to not perform any resize. Please see this note on image shape for more information.
- shape_in (tuple or None) – Size to which images are resized by interpolation or padding before being input to a model. Please see this note on image shape for more information.
- num_shards (int) – Number of sharded output files.
- num_threads (int) – Number of threads to run in parallel.
- seed (int or None) – Number for seeding rng.
- **kwargs – Unused arguments.
-
__metaclass__
¶ alias of
abc.ABCMeta
-
__weakref__
¶ list of weak references to the object (if defined)
-
bbox
¶ When creating a dataset, generate appropriate bounding boxes for the tasks (determined by e.g. self.character, self.line).
Returns: Use bounding boxes. Return type: bool
-
character
¶ When creating a dataset, tell the meta_loader to generate character features, e.g. label, bbox.
Returns: Use character features. Return type: bool
-
character_set
¶ The Japanese characters (e.g. kana, kanji) of interest.
Preset character sets may include the following component sets:
- hiragana
- katakana
- kana
- kanji
- punct (punctuation)
- misc
Returns: The character set. Return type: CharacterSet
-
chunk
¶ When creating a dataset, instead of using the original image, extract non-overlapping chunks of size image_shape and the corresponding features from the original image on a regular grid. The original image is padded to divide evenly by image_shape.
Note: currently only objects that are entirely contained in the block will have its features propagated.
Returns: Return type: bool
-
image_scope
¶ Portion of original image for each example.
Available scopes are ‘char’, ‘seq’, ‘line’, ‘page’.
Returns: Task image scope Return type: str
-
input_fn
(batch_size, subset, num_shards, overwrite=False)[source]¶ Returns (sharded) batches of data.
Parameters: Returns: Features of length num_shards. (list): Labels of length num_shards.
Return type: (list)
-
label
¶ When creating a dataset, generate character labels.
Returns: Use character labels Return type: bool
-
line
¶ When creating a dataset, tell the meta_loader to generate line features, e.g. bbox.
Returns: Use line features. Return type: bool
-
loss_fn
(features, model_output, targets, is_training)[source]¶ Computes an appropriate loss for the tasks.
Must be implemented in subclass.
Parameters: Returns: Losses of type ‘int32’ and shape [batch_size, 1]
Return type: tf.Tensor
-
max_sequence_length
¶ Maximum sequence length.
Only used if
image_scope == 'seq'
.Returns: Return type: int or None
-
model_fn
(model, variable_strategy, num_gpus, num_workers, devices=None)[source]¶ Model function used by TensorFlow Estimator class.
Parameters: - model (pmjtc.models.generic.Model) – The models to run.
- variable_strategy (str) – Where to locate variable operations, either ‘CPU’ or ‘GPU’.
- num_gpus (int) – Number of GPUs to use, if available.
- devices (tuple) – Specific devices to use. If provided, overrides num_gpus.
- num_workers (int) – Parameter for distributed training.
Returns:
-
num_classes
¶ Total number of output nodes, includes reserved tokens.
-
reserved
¶ Reserved tokens for the tasks.
The index of each token in the returned tuple will be used as its integer ID.
Returns: The reserved characters Return type: tuple
-
results
(loss, tower_features, tower_preds, tower_targets, is_training)[source]¶ Accumulates predictions, computes metrics, and determines the tensors to log and/or visualize.
Parameters: Returns: The tensors to log dict: All predictions dict: Evaluation metrics
Return type:
-
sequence_length
¶ If max_sequence_length is None, this gives the deterministic length of a sequence, else the minimum sequence length.
Only used if
image_scope == 'seq'
.Returns: Return type: int or None
-
sparse_labels
¶ Generate labels as a SparseTensor, e.g. for CTC loss.
Returns: Use sparse labels. Return type: (bool)
-
target
¶ Determines the value against which predictions are compared.
For a list of possible targets, refer to carpedm.data.util.ImageMeta.generate_features()
Returns: feature key for the target Return type: str
-
task_data_dir
¶ Directory where tasks data is stored.
Returns: str
-
Utilities¶
carpedm.util.eval¶
Evaluation helpers.
carpedm.util.registry¶
Registry for models and tasks.
Define a new models by subclassing models.Model and register it:
@registry.register_model
class MyModel(models.Model):
...
Access by snake-cased name: registry.model("my_model")
.
See all the models registered: registry.list_models()
.
References
- Lightly modified Tensor2Tensor registry.
-
carpedm.util.registry.
default_name
(obj_class)[source]¶ Convert class name to the registry’s default name for the class.
Parameters: obj_class – the name of a class Returns: The registry’s default name for the class.
-
carpedm.util.registry.
default_object_name
(obj)[source]¶ Convert object to the registry’s default name for the object class.
Parameters: obj – an object instance Returns: The registry’s default name for the class of the object.
-
carpedm.util.registry.
display_list_by_prefix
(names_list, starting_spaces=0)[source]¶ Creates a help string for
names_list
grouped by prefix.
-
carpedm.util.registry.
register_model
(name=None)[source]¶ Register a models.
name
defaults to class name snake-cased.
carpedm.util.train¶
Training utilities.
This modules provides utilities for training machine learning models. It uses or makes slight modifications to code from the TensorFlow CIFAR-10 estimator tutorial.