Basic Usage¶

Getting Started¶

There is some sample data provided, accessed as follows:

from carpedm.data import sample as PATH_TO_SAMPLE_DATA

This small dataset is useful for getting started and debugging purposes.

Full datasets can be downloaded with:

$ download_data -d <download/to/this/directory> -i <dataset-id>

It may take a while. For a list of available dataset IDs, use:

$ download_data -h

Exploring the Data¶

To quickly load and review data for a task, use the carpedm.data.meta.MetaLoader class directly. Here are some example datasets that vary each image’s scope and the characters included.

import carpedm as dm


# Create objects for storing meta data
single_kana = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='char', charset=dm.data.CharacterSet('kana'))
kanji_seq = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='seq', seq_len=3, charset=dm.data.CharacterSet('kanji'))
full_page = dm.data.MetaLoader(data_dir=dm.data.sample, image_scope='page', charset=dm.data.CharacterSet('all'))

Note that these objects only store the metadata for images in the dataset, so they are relatively time and space efficient. Assuming matplotlib is installed (see Optional Dependencies), you can use view_images to actually load and view images within the dataset. Or use generate_dataset to save training data for a machine learning algorithm. For example:

single_kana.view_images(subset='train', shape=(64,64))
kanji_seq.view_images(subset='dev', shape=(None, 64))
full_page.view_images(subset='test', shape=None)

# Save the data as TFRecords (default format_store)
single_kana.generate_dataset(out_dir='/tmp/pmjtc_data', subset='train')

Note

Currently, view_images does not work in a Jupyter notebook instance.

Training a Model¶

The MetaLoader class on its own is useful for rapid data exploration, but the Tasks module provides a high-level interface for the entire training pipeline, from loading the raw data and automatically generating model-ready datasets, to actually training and evaluating a model.

Next, we will walk through a simple example that uses the provided single character recognition task and a simple baseline Convolutional Neural Network model.

First, let’s set our TensorFlow verbosity so we can see the training progress.

import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)

Next, we’ll initialize our single kana recognition task

import carpedm as dm

# Task definition
args = {'data_dir': dm.data.sample,
        'task_dir': '/tmp/carpedm_tasks',
        'shape_store': None,
        'shape_in': (64, 64)}
task = registry.task('ocr_single_kana')(**args)

Most of the Task functionality, such as the target character_set, sequence_length (if we’re looking at character sequences image_scope == 'seq'), or loss_fn is encapsulated in the class definition. However, there are some REQUIRED run-time task arguments: data_dir and task_dir tell the task where to find the raw data, and where to store task-specific data/results, respectively. The other optional run-time arguments shape_store and shape_in determine the size of images when they are stored on disk and fed into our neural network, respectively. If shape_store or shape_in are not provided, the original image size is used.

Caution

Using the default for shape_in may break a model expecting fixed-size input.

For more information and a full list of optional arguments, please refer to the Tasks API.

A task can be accessed from the registry with the appropriate task ID. By default, the ID for a stored task is a “snake_cased” version of the task class name. Custom tasks can be added to the registry using the @registry.register_model decorator, importing the new class in tasks.__init__, and importing carpedm, more specifically, the carpedm.tasks package.

Now let’s define our hyper-parameters for training and our model.

from carpedm.util import registry

# Training Hyperparameters
num_epochs = 30
training_hparams = {'train_batch_size': 32,
                    'eval_batch_size': 1,
                    'data_format': 'channels_last',
                    'optimizer': 'sgd',
                    'learning_rate': 1e-3,
                    'momentum': 0.96,
                    'weight_decay': 2e-4,
                    'gradient_clipping': None,
                    'lr_decay_steps': None,
                    'init_dir': None,  # for pre-trained models
                    'sync': False}

# Model hyperparameters and definition
model_hparams = {}
model = registry.model('single_char_baseline')(num_classes=task.num_classes, **model_hparams)

The training_hparams above represent the minimal set that must be defined for training to run. In practice, you may want to use a tool like argparse and define some defaults so you don’t have to explicitly define each one manually every time. Accessing and registering models is similar to the process for tasks (see here for more details).

The baseline_cnn model is fully defined except for the number of classes to predict, so it doesn’t take any hyper-parameters.

To distinguish this model from others, we should define a unique job_id, which can then be used in some boilerplate TensorFlow configuration.

# Unique job_id
experiment_id = 'example'
shape = re.sub(r'([,])', '_', re.sub(r'([() ])', '', str(args['shape_in'])))
job_id = os.path.join(experiment_id, shape, model.name)
task.job_id = job_id  # Used to check for first model initialization.
job_dir = os.path.join(task.task_log_dir, job_id)

# TensorFlow Configuration
sess_config = tf.ConfigProto(
    allow_soft_placement=True,
    log_device_placement=False,
    intra_op_parallelism_threads=0,
    gpu_options=tf.GPUOptions(force_gpu_compatible=True))
config = tf.estimator.RunConfig(session_config=sess_config,
                                model_dir=job_dir,
                                save_summary_steps=10)
hparams = tf.contrib.training.HParams(is_chief=config.is_chief,
                                      **training_hparams)

We include shape_in in the job ID to avoid conflicts with loading models meant for images of different sizes. Although we don’t do so here for simplicity, it would also be a good idea to include training hyperparameter settings in the job ID, as those are not represented in model.name.

Now comes the important part: defining the input and model functions used by a TensorFlow Estimator.

# Input and model functions
train_input_fn = task.input_fn(hparams.train_batch_size,
                               subset='train',
                               num_shards=1,
                               overwrite=False)
eval_input_fn = task.input_fn(hparams.eval_batch_size,
                              subset='dev',
                              num_shards=1,
                              overwrite=False)
model_fn = task.model_fn(model, num_gpus=0, variable_strategy='CPU',
                         num_workers=config.num_worker_replicas or 1)

As we can see, the Task interface makes this extremely easy! The appropriate data subset for the task is generated (and saved) once automatically when task.input_fn is called. You can overwrite previously saved data by setting the overwrite parameter to True. The num_shards parameter can be used for training in parallel, e.g. on multiple GPUs.

model_fn is a bit more complicated under the hood, but its components are simple:

It uses model.forward_pass to generate predictions,
task.loss_fn to train the model
and task.results for compiling results.

I don’t assume access to any GPUs, hence the values for num_gpus and variable_strategy. variable_strategy tells the training manager where to collect and update variables. You can ignore the num_workers parameter, unless you want to use special distributed training, e.g. on Google Cloud.

Note

The input_fn definitions must come before the model_fn definition because model_fn relies on a variable, original_format, defined in input_fn. This dependence will likely be removed in future versions.

We’re almost ready to train. We just need to tell it how long to train,

# Number of training steps
train_examples = dm.data.num_examples_per_epoch(task.task_data_dir, 'train')
eval_examples = dm.data.num_examples_per_epoch(task.task_data_dir, 'dev')

if eval_examples % hparams.eval_batch_size != 0:
    raise ValueError(('validation set size (%d) must be multiple of '
                      'eval_batch_size (%d)') % (eval_examples,
                                                 hparams.eval_batch_size))

eval_steps = eval_examples // hparams.eval_batch_size
train_steps = num_epochs * ((train_examples // hparams.train_batch_size) or 1)

define our training manager,

estimator = tf.estimator.Estimator(model_fn=model_fn, config=config, params=hparams)

and hit the train button!

tf.estimator.train_and_evaluate(estimator, train_spec=train_spec, eval_spec=eval_spec)

Putting it all together, we have a very minimal main.py module for training models. Running it took 8 minutes on a MacBook Pro, which includes data generation and training the model. At the end of 30 epochs, it achieved a development set accuracy of 65.27%. Not great, but this example only uses the small sample dataset (1,447 training examples). And considering the 70 character classes and 4.19% majority class for this task and specific dataset, we are already doing much better than chance!

Running this same code for the full currently available PMJTC dataset takes much longer but—as you would expect when adding more data—achieves a higher accuracy (see Benchmarks). Though certainly indicative of the benefit of more data, note that the accuracies presented in the benchmarks are not a fair comparison to the one above for two reasons:

There are more kana character classes in the full dataset: 131

The development sets on which accuracies are reported are different.

Conclusion¶

I hope that this guide has introduced the basics of using CarpeDM and encourages you to define your own models and tasks, and conduct enriching research on Pre-modern Japanese Text Characters and beyond!

Seize the Data Manager!