Spaces:

deanna-emery
/

ASL-MoViNet-T5-translator

Runtime error

App Files Files Community

deanna-emery commited on Dec 4, 2023

Commit

5672777

1 Parent(s): 9e6df20

updates

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

modeling/official/README-TPU.md +32 -0
modeling/official/README.md +166 -0
modeling/official/__init__.py +14 -0
modeling/official/common/__init__.py +15 -0
modeling/official/common/dataset_fn.py +44 -0
modeling/official/common/distribute_utils.py +233 -0
modeling/official/common/distribute_utils_test.py +124 -0
modeling/official/common/flags.py +114 -0
modeling/official/common/registry_imports.py +20 -0
modeling/official/common/streamz_counters.py +27 -0
modeling/official/core/__init__.py +31 -0
modeling/official/core/actions.py +236 -0
modeling/official/core/actions_test.py +131 -0
modeling/official/core/base_task.py +360 -0
modeling/official/core/base_trainer.py +498 -0
modeling/official/core/base_trainer_test.py +363 -0
modeling/official/core/config_definitions.py +309 -0
modeling/official/core/exp_factory.py +32 -0
modeling/official/core/export_base.py +182 -0
modeling/official/core/export_base_test.py +133 -0
modeling/official/core/file_writers.py +80 -0
modeling/official/core/file_writers_test.py +53 -0
modeling/official/core/input_reader.py +591 -0
modeling/official/core/registry.py +101 -0
modeling/official/core/registry_test.py +88 -0
modeling/official/core/savedmodel_checkpoint_manager.py +258 -0
modeling/official/core/savedmodel_checkpoint_manager_test.py +125 -0
modeling/official/core/task_factory.py +70 -0
modeling/official/core/test_utils.py +59 -0
modeling/official/core/tf_example_builder.py +144 -0
modeling/official/core/tf_example_builder_test.py +165 -0
modeling/official/core/tf_example_feature_key.py +62 -0
modeling/official/core/tf_example_feature_key_test.py +49 -0
modeling/official/core/train_lib.py +372 -0
modeling/official/core/train_lib_test.py +280 -0
modeling/official/core/train_utils.py +610 -0
modeling/official/core/train_utils_test.py +215 -0
modeling/official/legacy/README.md +5 -0
modeling/official/legacy/__init__.py +14 -0
modeling/official/legacy/albert/README.md +4 -0
modeling/official/legacy/albert/__init__.py +14 -0
modeling/official/legacy/albert/configs.py +50 -0
modeling/official/legacy/bert/README.md +395 -0
modeling/official/legacy/bert/__init__.py +15 -0
modeling/official/legacy/bert/bert_cloud_tpu.md +110 -0
modeling/official/legacy/bert/bert_models.py +365 -0
modeling/official/legacy/bert/bert_models_test.py +106 -0
modeling/official/legacy/bert/common_flags.py +125 -0
modeling/official/legacy/bert/configs.py +104 -0
modeling/official/legacy/bert/export_tfhub.py +139 -0

modeling/official/README-TPU.md ADDED Viewed

	@@ -0,0 +1,32 @@

+# Offically Supported TensorFlow 2.1+ Models on Cloud TPU
+## Natural Language Processing
+*   [bert](nlp/bert): A powerful pre-trained language representation model:
+    BERT, which stands for Bidirectional Encoder Representations from
+    Transformers.
+    [BERT FineTuning with Cloud TPU](https://cloud.google.com/tpu/docs/tutorials/bert-2.x) provides step by step instructions on Cloud TPU training. You can look [Bert MNLI Tensorboard.dev metrics](https://tensorboard.dev/experiment/LijZ1IrERxKALQfr76gndA) for MNLI fine tuning task.
+*   [transformer](nlp/transformer): A transformer model to translate the WMT
+    English to German dataset.
+        [Training transformer on Cloud TPU](https://cloud.google.com/tpu/docs/tutorials/transformer-2.x) for step by step instructions on Cloud TPU training.
+## Computer Vision
+*   [efficientnet](vision/image_classification): A family of convolutional
+    neural networks that scale by balancing network depth, width, and
+    resolution and can be used to classify ImageNet's dataset of 1000 classes.
+    See [Tensorboard.dev training metrics](https://tensorboard.dev/experiment/KnaWjrq5TXGfv0NW5m7rpg/#scalars).
+*   [mnist](vision/image_classification): A basic model to classify digits
+    from the MNIST dataset. See [Running MNIST on Cloud TPU](https://cloud.google.com/tpu/docs/tutorials/mnist-2.x) tutorial and [Tensorboard.dev metrics](https://tensorboard.dev/experiment/mIah5lppTASvrHqWrdr6NA).
+*   [mask-rcnn](vision/detection): An object detection and instance segmentation model. See [Tensorboard.dev training metrics](https://tensorboard.dev/experiment/LH7k0fMsRwqUAcE09o9kPA).
+*   [resnet](vision/image_classification): A deep residual network that can
+    be used to classify ImageNet's dataset of 1000 classes.
+    See [Training ResNet on Cloud TPU](https://cloud.google.com/tpu/docs/tutorials/resnet-2.x) tutorial and [Tensorboard.dev metrics](https://tensorboard.dev/experiment/CxlDK8YMRrSpYEGtBRpOhg).
+*   [retinanet](vision/detection): A fast and powerful object detector. See [Tensorboard.dev training metrics](https://tensorboard.dev/experiment/b8NRnWU3TqG6Rw0UxueU6Q).
+*   [shapemask](vision/detection): An object detection and instance segmentation model using shape priors. See [Tensorboard.dev training metrics](https://tensorboard.dev/experiment/ZbXgVoc6Rf6mBRlPj0JpLA).
+## Recommendation
+*   [dlrm](recommendation/ranking): [Deep Learning Recommendation Model for
+Personalization and Recommendation Systems](https://arxiv.org/abs/1906.00091).
+*   [dcn v2](recommendation/ranking): [Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/abs/2008.13535).
+*   [ncf](recommendation): Neural Collaborative Filtering. See [Tensorboard.dev training metrics](https://tensorboard.dev/experiment/0k3gKjZlR1ewkVTRyLB6IQ).

modeling/official/README.md ADDED Viewed

	@@ -0,0 +1,166 @@

+<div align="center">
+  <img src="https://storage.googleapis.com/tf_model_garden/tf_model_garden_logo.png">
+</div>
+# TensorFlow Official Models
+The TensorFlow official models are a collection of models
+that use TensorFlow’s high-level APIs.
+They are intended to be well-maintained, tested, and kept up to date
+with the latest TensorFlow API.
+They should also be reasonably optimized for fast performance while still
+being easy to read.
+These models are used as end-to-end tests, ensuring that the models run
+with the same or improved speed and performance with each new TensorFlow build.
+The API documentation of the latest stable release is published to
+[tensorflow.org](https://www.tensorflow.org/api_docs/python/tfm).
+## More models to come!
+The team is actively developing new models.
+In the near future, we will add:
+* State-of-the-art language understanding models.
+* State-of-the-art image classification models.
+* State-of-the-art object detection and instance segmentation models.
+* State-of-the-art video classification models.
+## Table of Contents
+- [Models and Implementations](#models-and-implementations)
+  * [Computer Vision](#computer-vision)
+    + [Image Classification](#image-classification)
+    + [Object Detection and Segmentation](#object-detection-and-segmentation)
+    + [Video Classification](#video-classification)
+  * [Natural Language Processing](#natural-language-processing)
+  * [Recommendation](#recommendation)
+- [How to get started with the official models](#how-to-get-started-with-the-official-models)
+- [Contributions](#contributions)
+## Models and Implementations
+### [Computer Vision](vision/README.md)
+#### Image Classification
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [ResNet](vision/MODEL_GARDEN.md) | [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) |
+| [ResNet-RS](vision/MODEL_GARDEN.md) | [Revisiting ResNets: Improved Training and Scaling Strategies](https://arxiv.org/abs/2103.07579) |
+| [EfficientNet](vision/MODEL_GARDEN.md) | [EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks](https://arxiv.org/abs/1905.11946) |
+| [Vision Transformer](vision/MODEL_GARDEN.md) | [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) |
+#### Object Detection and Segmentation
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [RetinaNet](vision/MODEL_GARDEN.md) | [Focal Loss for Dense Object Detection](https://arxiv.org/abs/1708.02002) |
+| [Mask R-CNN](vision/MODEL_GARDEN.md) | [Mask R-CNN](https://arxiv.org/abs/1703.06870) |
+| [YOLO](projects/yolo/README.md) | [YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors](https://arxiv.org/abs/2207.02696) |
+| [SpineNet](vision/MODEL_GARDEN.md) | [SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization](https://arxiv.org/abs/1912.05027) |
+| [Cascade RCNN-RS and RetinaNet-RS](vision/MODEL_GARDEN.md) | [Simple Training Strategies and Model Scaling for Object Detection](https://arxiv.org/abs/2107.00057)|
+#### Video Classification
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [Mobile Video Networks (MoViNets)](projects/movinet) | [MoViNets: Mobile Video Networks for Efficient Video Recognition](https://arxiv.org/abs/2103.11511) |
+### [Natural Language Processing](nlp/README.md)
+#### Pre-trained Language Model
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [ALBERT](nlp/MODEL_GARDEN.md#available-model-configs) | [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942) |
+| [BERT](nlp/MODEL_GARDEN.md#available-model-configs) | [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) |
+| [ELECTRA](nlp/tasks/electra_task.py) | [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555) |
+#### Neural Machine Translation
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [Transformer](nlp/MODEL_GARDEN.md#available-model-configs) | [Attention Is All You Need](https://arxiv.org/abs/1706.03762) |
+#### Natural Language Generation
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [NHNet (News Headline generation model)](projects/nhnet) | [Generating Representative Headlines for News Stories](https://arxiv.org/abs/2001.09386) |
+#### Knowledge Distillation
+| Model | Reference (Paper) |
+|-------|-------------------|
+| [MobileBERT](projects/mobilebert) | [MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices](https://arxiv.org/abs/2004.02984) |
+### Recommendation
+Model                            | Reference (Paper)
+-------------------------------- | -----------------
+[DLRM](recommendation/ranking)   | [Deep Learning Recommendation Model for Personalization and Recommendation Systems](https://arxiv.org/abs/1906.00091)
+[DCN v2](recommendation/ranking) | [Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems](https://arxiv.org/abs/2008.13535)
+[NCF](recommendation)            | [Neural Collaborative Filtering](https://arxiv.org/abs/1708.05031)
+## How to get started with the official models
+*   The official models in the master branch are developed using
+[master branch of TensorFlow 2](https://github.com/tensorflow/tensorflow/tree/master).
+When you clone (the repository) or download (`pip` binary) master branch of
+official models , master branch of TensorFlow gets downloaded as a
+dependency. This is equivalent to the following.
+```shell
+pip3 install tf-models-nightly
+pip3 install tensorflow-text-nightly # when model uses `nlp` packages
+```
+*   Incase of stable versions, targeting a specific release, Tensorflow-models
+repository version numbers match with the target TensorFlow release. For
+example, [TensorFlow-models v2.8.x](https://github.com/tensorflow/models/releases/tag/v2.8.0)
+is compatible with [TensorFlow v2.8.x](https://github.com/tensorflow/tensorflow/releases/tag/v2.8.0).
+This is equivalent to the following:
+```shell
+pip3 install tf-models-official==2.8.0
+pip3 install tensorflow-text==2.8.0 # when models in uses `nlp` packages
+```
+Starting from 2.9.x release, we release the modeling library as
+`tensorflow_models` package and users can `import tensorflow_models` directly to
+access to the exported symbols. If you are
+using the latest nightly version or github code directly, please follow the
+docstrings in the github.
+Please follow the below steps before running models in this repository.
+### Requirements
+* The latest TensorFlow Model Garden release and the latest TensorFlow 2
+  * If you are on a version of TensorFlow earlier than 2.2, please
+upgrade your TensorFlow to [the latest TensorFlow 2](https://www.tensorflow.org/install/).
+* Python 3.7+
+Our integration tests run with Python 3.7. Although Python 3.6 should work, we
+don't recommend earlier versions.
+### Installation
+Please check [here](https://github.com/tensorflow/models#Installation) for the
+instructions.
+Available pypi packages:
+* [tf-models-official](https://pypi.org/project/tf-models-official/)
+* [tf-models-nightly](https://pypi.org/project/tf-models-nightly/): nightly
+release with the latest changes.
+* [tf-models-no-deps](https://pypi.org/project/tf-models-no-deps/): without
+`tensorflow` and `tensorflow-text` in the `install_requires` list.
+## Contributions
+If you want to contribute, please review the [contribution guidelines](https://github.com/tensorflow/models/wiki/How-to-contribute).

modeling/official/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

modeling/official/common/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

modeling/official/common/dataset_fn.py ADDED Viewed

	@@ -0,0 +1,44 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Utility library for picking an appropriate dataset function."""
+import functools
+from typing import Any, Callable, Type, Union
+import tensorflow as tf, tf_keras
+PossibleDatasetType = Union[Type[tf.data.Dataset], Callable[[tf.Tensor], Any]]
+def pick_dataset_fn(file_type: str) -> PossibleDatasetType:
+  if file_type == 'tfrecord':
+    return tf.data.TFRecordDataset
+  if file_type == 'tfrecord_compressed':
+    return functools.partial(tf.data.TFRecordDataset, compression_type='GZIP')
+  raise ValueError('Unrecognized file_type: {}'.format(file_type))

modeling/official/common/distribute_utils.py ADDED Viewed

	@@ -0,0 +1,233 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Helper functions for running models in a distributed setting."""
+import json
+import os
+import tensorflow as tf, tf_keras
+def _collective_communication(all_reduce_alg):
+  """Return a CollectiveCommunication based on all_reduce_alg.
+  Args:
+    all_reduce_alg: a string specifying which collective communication to pick,
+      or None.
+  Returns:
+    tf.distribute.experimental.CollectiveCommunication object
+  Raises:
+    ValueError: if `all_reduce_alg` not in [None, "ring", "nccl"]
+  """
+  collective_communication_options = {
+      None: tf.distribute.experimental.CollectiveCommunication.AUTO,
+      "ring": tf.distribute.experimental.CollectiveCommunication.RING,
+      "nccl": tf.distribute.experimental.CollectiveCommunication.NCCL
+  }
+  if all_reduce_alg not in collective_communication_options:
+    raise ValueError(
+        "When used with `multi_worker_mirrored`, valid values for "
+        "all_reduce_alg are [`ring`, `nccl`].  Supplied value: {}".format(
+            all_reduce_alg))
+  return collective_communication_options[all_reduce_alg]
+def _mirrored_cross_device_ops(all_reduce_alg, num_packs):
+  """Return a CrossDeviceOps based on all_reduce_alg and num_packs.
+  Args:
+    all_reduce_alg: a string specifying which cross device op to pick, or None.
+    num_packs: an integer specifying number of packs for the cross device op.
+  Returns:
+    tf.distribute.CrossDeviceOps object or None.
+  Raises:
+    ValueError: if `all_reduce_alg` not in [None, "nccl", "hierarchical_copy"].
+  """
+  if all_reduce_alg is None:
+    return None
+  mirrored_all_reduce_options = {
+      "nccl": tf.distribute.NcclAllReduce,
+      "hierarchical_copy": tf.distribute.HierarchicalCopyAllReduce
+  }
+  if all_reduce_alg not in mirrored_all_reduce_options:
+    raise ValueError(
+        "When used with `mirrored`, valid values for all_reduce_alg are "
+        "[`nccl`, `hierarchical_copy`].  Supplied value: {}".format(
+            all_reduce_alg))
+  cross_device_ops_class = mirrored_all_reduce_options[all_reduce_alg]
+  return cross_device_ops_class(num_packs=num_packs)
+def tpu_initialize(tpu_address):
+  """Initializes TPU for TF 2.x training.
+  Args:
+    tpu_address: string, bns address of master TPU worker.
+  Returns:
+    A TPUClusterResolver.
+  """
+  cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(
+      tpu=tpu_address)
+  if tpu_address not in ("", "local"):
+    tf.config.experimental_connect_to_cluster(cluster_resolver)
+  tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
+  return cluster_resolver
+def get_distribution_strategy(distribution_strategy="mirrored",
+                              num_gpus=0,
+                              all_reduce_alg=None,
+                              num_packs=1,
+                              tpu_address=None,
+                              **kwargs):
+  """Return a Strategy for running the model.
+  Args:
+    distribution_strategy: a string specifying which distribution strategy to
+      use. Accepted values are "off", "one_device", "mirrored",
+      "parameter_server", "multi_worker_mirrored", and "tpu" -- case
+      insensitive. "tpu" means to use TPUStrategy using `tpu_address`.
+      "off" means to use the default strategy which is obtained from
+      tf.distribute.get_strategy (for details on the default strategy, see
+      https://www.tensorflow.org/guide/distributed_training#default_strategy).
+    num_gpus: Number of GPUs to run this model.
+    all_reduce_alg: Optional. Specifies which algorithm to use when performing
+      all-reduce. For `MirroredStrategy`, valid values are "nccl" and
+      "hierarchical_copy". For `MultiWorkerMirroredStrategy`, valid values are
+      "ring" and "nccl".  If None, DistributionStrategy will choose based on
+      device topology.
+    num_packs: Optional.  Sets the `num_packs` in `tf.distribute.NcclAllReduce`
+      or `tf.distribute.HierarchicalCopyAllReduce` for `MirroredStrategy`.
+    tpu_address: Optional. String that represents TPU to connect to. Must not be
+      None if `distribution_strategy` is set to `tpu`.
+    **kwargs: Additional kwargs for internal usages.
+  Returns:
+    tf.distribute.Strategy object.
+  Raises:
+    ValueError: if `distribution_strategy` is "off" or "one_device" and
+      `num_gpus` is larger than 1; or `num_gpus` is negative or if
+      `distribution_strategy` is `tpu` but `tpu_address` is not specified.
+  """
+  del kwargs
+  if num_gpus < 0:
+    raise ValueError("`num_gpus` can not be negative.")
+  if not isinstance(distribution_strategy, str):
+    msg = ("distribution_strategy must be a string but got: %s." %
+           (distribution_strategy,))
+    if distribution_strategy == False:  # pylint: disable=singleton-comparison,g-explicit-bool-comparison
+      msg += (" If you meant to pass the string 'off', make sure you add "
+              "quotes around 'off' so that yaml interprets it as a string "
+              "instead of a bool.")
+    raise ValueError(msg)
+  distribution_strategy = distribution_strategy.lower()
+  if distribution_strategy == "off":
+    if num_gpus > 1:
+      raise ValueError(f"When {num_gpus} GPUs are specified, "
+                       "distribution_strategy flag cannot be set to `off`.")
+    # Return the default distribution strategy.
+    return tf.distribute.get_strategy()
+  if distribution_strategy == "tpu":
+    # When tpu_address is an empty string, we communicate with local TPUs.
+    cluster_resolver = tpu_initialize(tpu_address)
+    return tf.distribute.TPUStrategy(cluster_resolver)
+  if distribution_strategy == "multi_worker_mirrored":
+    return tf.distribute.experimental.MultiWorkerMirroredStrategy(
+        communication=_collective_communication(all_reduce_alg))
+  if distribution_strategy == "one_device":
+    if num_gpus == 0:
+      return tf.distribute.OneDeviceStrategy("device:CPU:0")
+    if num_gpus > 1:
+      raise ValueError("`OneDeviceStrategy` can not be used for more than "
+                       "one device.")
+    return tf.distribute.OneDeviceStrategy("device:GPU:0")
+  if distribution_strategy == "mirrored":
+    if num_gpus == 0:
+      devices = ["device:CPU:0"]
+    else:
+      devices = ["device:GPU:%d" % i for i in range(num_gpus)]
+    return tf.distribute.MirroredStrategy(
+        devices=devices,
+        cross_device_ops=_mirrored_cross_device_ops(all_reduce_alg, num_packs))
+  if distribution_strategy == "parameter_server":
+    cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
+    return tf.distribute.experimental.ParameterServerStrategy(cluster_resolver)
+  raise ValueError("Unrecognized Distribution Strategy: %r" %
+                   distribution_strategy)
+def configure_cluster(worker_hosts=None, task_index=-1):
+  """Set multi-worker cluster spec in TF_CONFIG environment variable.
+  Args:
+    worker_hosts: comma-separated list of worker ip:port pairs.
+    task_index: index of the worker.
+  Returns:
+    Number of workers in the cluster.
+  """
+  tf_config = json.loads(os.environ.get("TF_CONFIG", "{}"))
+  if tf_config:
+    num_workers = (
+        len(tf_config["cluster"].get("chief", [])) +
+        len(tf_config["cluster"].get("worker", [])))
+  elif worker_hosts:
+    workers = worker_hosts.split(",")
+    num_workers = len(workers)
+    if num_workers > 1 and task_index < 0:
+      raise ValueError("Must specify task_index when number of workers > 1")
+    task_index = 0 if num_workers == 1 else task_index
+    os.environ["TF_CONFIG"] = json.dumps({
+        "cluster": {
+            "worker": workers
+        },
+        "task": {
+            "type": "worker",
+            "index": task_index
+        }
+    })
+  else:
+    num_workers = 1
+  return num_workers
+def get_strategy_scope(strategy):
+  if strategy:
+    strategy_scope = strategy.scope()
+  else:
+    strategy_scope = DummyContextManager()
+  return strategy_scope
+class DummyContextManager(object):
+  def __enter__(self):
+    pass
+  def __exit__(self, *args):
+    pass

modeling/official/common/distribute_utils_test.py ADDED Viewed

	@@ -0,0 +1,124 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for distribution util functions."""
+import sys
+import tensorflow as tf, tf_keras
+from official.common import distribute_utils
+TPU_TEST = 'test_tpu' in sys.argv[0]
+class DistributeUtilsTest(tf.test.TestCase):
+  """Tests for distribute util functions."""
+  def test_invalid_args(self):
+    with self.assertRaisesRegex(ValueError, '`num_gpus` can not be negative.'):
+      _ = distribute_utils.get_distribution_strategy(num_gpus=-1)
+    with self.assertRaisesRegex(ValueError,
+                                '.*If you meant to pass the string .*'):
+      _ = distribute_utils.get_distribution_strategy(
+          distribution_strategy=False, num_gpus=0)
+    with self.assertRaisesRegex(ValueError, 'When 2 GPUs are specified.*'):
+      _ = distribute_utils.get_distribution_strategy(
+          distribution_strategy='off', num_gpus=2)
+    with self.assertRaisesRegex(ValueError,
+                                '`OneDeviceStrategy` can not be used.*'):
+      _ = distribute_utils.get_distribution_strategy(
+          distribution_strategy='one_device', num_gpus=2)
+  def test_one_device_strategy_cpu(self):
+    ds = distribute_utils.get_distribution_strategy('one_device', num_gpus=0)
+    self.assertEquals(ds.num_replicas_in_sync, 1)
+    self.assertEquals(len(ds.extended.worker_devices), 1)
+    self.assertIn('CPU', ds.extended.worker_devices[0])
+  def test_one_device_strategy_gpu(self):
+    ds = distribute_utils.get_distribution_strategy('one_device', num_gpus=1)
+    self.assertEquals(ds.num_replicas_in_sync, 1)
+    self.assertEquals(len(ds.extended.worker_devices), 1)
+    self.assertIn('GPU', ds.extended.worker_devices[0])
+  def test_mirrored_strategy(self):
+    # CPU only.
+    _ = distribute_utils.get_distribution_strategy(num_gpus=0)
+    # 5 GPUs.
+    ds = distribute_utils.get_distribution_strategy(num_gpus=5)
+    self.assertEquals(ds.num_replicas_in_sync, 5)
+    self.assertEquals(len(ds.extended.worker_devices), 5)
+    for device in ds.extended.worker_devices:
+      self.assertIn('GPU', device)
+    _ = distribute_utils.get_distribution_strategy(
+        distribution_strategy='mirrored',
+        num_gpus=2,
+        all_reduce_alg='nccl',
+        num_packs=2)
+    with self.assertRaisesRegex(
+        ValueError,
+        'When used with `mirrored`, valid values for all_reduce_alg are.*'):
+      _ = distribute_utils.get_distribution_strategy(
+          distribution_strategy='mirrored',
+          num_gpus=2,
+          all_reduce_alg='dummy',
+          num_packs=2)
+  def test_mwms(self):
+    distribute_utils.configure_cluster(worker_hosts=None, task_index=-1)
+    ds = distribute_utils.get_distribution_strategy(
+        'multi_worker_mirrored', all_reduce_alg='nccl')
+    self.assertIsInstance(
+        ds, tf.distribute.experimental.MultiWorkerMirroredStrategy)
+    with self.assertRaisesRegex(
+        ValueError,
+        'When used with `multi_worker_mirrored`, valid values.*'):
+      _ = distribute_utils.get_distribution_strategy(
+          'multi_worker_mirrored', all_reduce_alg='dummy')
+  def test_no_strategy(self):
+    ds = distribute_utils.get_distribution_strategy('off')
+    self.assertIs(ds, tf.distribute.get_strategy())
+  def test_tpu_strategy(self):
+    if not TPU_TEST:
+      self.skipTest('Only Cloud TPU VM instances can have local TPUs.')
+    with self.assertRaises(ValueError):
+      _ = distribute_utils.get_distribution_strategy('tpu')
+    ds = distribute_utils.get_distribution_strategy('tpu', tpu_address='local')
+    self.assertIsInstance(
+        ds, tf.distribute.TPUStrategy)
+  def test_invalid_strategy(self):
+    with self.assertRaisesRegexp(
+        ValueError,
+        'distribution_strategy must be a string but got: False. If'):
+      distribute_utils.get_distribution_strategy(False)
+    with self.assertRaisesRegexp(
+        ValueError, 'distribution_strategy must be a string but got: 1'):
+      distribute_utils.get_distribution_strategy(1)
+  def test_get_strategy_scope(self):
+    ds = distribute_utils.get_distribution_strategy('one_device', num_gpus=0)
+    with distribute_utils.get_strategy_scope(ds):
+      self.assertIs(tf.distribute.get_strategy(), ds)
+    with distribute_utils.get_strategy_scope(None):
+      self.assertIsNot(tf.distribute.get_strategy(), ds)
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/common/flags.py ADDED Viewed

	@@ -0,0 +1,114 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""The central place to define flags."""
+from absl import flags
+def define_flags():
+  """Defines flags.
+  All flags are defined as optional, but in practice most models use some of
+  these flags and so mark_flags_as_required() should be called after calling
+  this function. Typically, 'experiment', 'mode', and 'model_dir' are required.
+  For example:
+  ```
+  from absl import flags
+  from official.common import flags as tfm_flags  # pylint: disable=line-too-long
+  ...
+  tfm_flags.define_flags()
+  flags.mark_flags_as_required(['experiment', 'mode', 'model_dir'])
+  ```
+  The reason all flags are optional is because unit tests often do not set or
+  use any of the flags.
+  """
+  flags.DEFINE_string(
+      'experiment', default=None, help=
+      'The experiment type registered, specifying an ExperimentConfig.')
+  flags.DEFINE_enum(
+      'mode',
+      default=None,
+      enum_values=[
+          'train', 'eval', 'train_and_eval', 'continuous_eval',
+          'continuous_train_and_eval', 'train_and_validate',
+          'train_and_post_eval'
+      ],
+      help='Mode to run: `train`, `eval`, `train_and_eval`, '
+      '`continuous_eval`, `continuous_train_and_eval` and '
+      '`train_and_validate` (which is not implemented in '
+      'the open source version).')
+  flags.DEFINE_string(
+      'model_dir',
+      default=None,
+      help='The directory where the model and training/evaluation summaries'
+      'are stored.')
+  flags.DEFINE_multi_string(
+      'config_file',
+      default=None,
+      help='YAML/JSON files which specifies overrides. The override order '
+      'follows the order of args. Note that each file '
+      'can be used as an override template to override the default parameters '
+      'specified in Python. If the same parameter is specified in both '
+      '`--config_file` and `--params_override`, `config_file` will be used '
+      'first, followed by params_override.')
+  flags.DEFINE_string(
+      'params_override',
+      default=None,
+      help='a YAML/JSON string or a YAML file which specifies additional '
+      'overrides over the default parameters and those specified in '
+      '`--config_file`. Note that this is supposed to be used only to override '
+      'the model parameters, but not the parameters like TPU specific flags. '
+      'One canonical use case of `--config_file` and `--params_override` is '
+      'users first define a template config file using `--config_file`, then '
+      'use `--params_override` to adjust the minimal set of tuning parameters, '
+      'for example setting up different `train_batch_size`. The final override '
+      'order of parameters: default_model_params --> params from config_file '
+      '--> params in params_override. See also the help message of '
+      '`--config_file`.')
+  # The libraries rely on gin often make mistakes that include flags inside
+  # the library files which causes conflicts.
+  try:
+    flags.DEFINE_multi_string(
+        'gin_file', default=None, help='List of paths to the config files.')
+  except flags.DuplicateFlagError:
+    pass
+  try:
+    flags.DEFINE_multi_string(
+        'gin_params',
+        default=None,
+        help='Newline separated list of Gin parameter bindings.')
+  except flags.DuplicateFlagError:
+    pass
+  flags.DEFINE_string(
+      'tpu',
+      default=None,
+      help='The Cloud TPU to use for training. This should be either the name '
+      'used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 '
+      'url.')
+  flags.DEFINE_string(
+      'tf_data_service', default=None, help='The tf.data service address')
+  flags.DEFINE_string(
+      'tpu_platform', default=None, help='TPU platform type.')

modeling/official/common/registry_imports.py ADDED Viewed

	@@ -0,0 +1,20 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""All necessary imports for registration."""
+# pylint: disable=unused-import
+from official import vision
+from official.nlp import tasks
+from official.nlp.configs import experiment_configs
+from official.utils.testing import mock_task

modeling/official/common/streamz_counters.py ADDED Viewed

	@@ -0,0 +1,27 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Global streamz counters."""
+from tensorflow.python.eager import monitoring
+progressive_policy_creation_counter = monitoring.Counter(
+    "/tensorflow/training/fast_training/progressive_policy_creation",
+    "Counter for the number of ProgressivePolicy creations.")
+stack_vars_to_vars_call_counter = monitoring.Counter(
+    "/tensorflow/training/fast_training/tf_vars_to_vars",
+    "Counter for the number of low-level stacking API calls.")

modeling/official/core/__init__.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Core is shared by both `nlp` and `vision`."""
+from official.core import actions
+from official.core import base_task
+from official.core import base_trainer
+from official.core import config_definitions
+from official.core import exp_factory
+from official.core import export_base
+from official.core import file_writers
+from official.core import input_reader
+from official.core import registry
+from official.core import savedmodel_checkpoint_manager
+from official.core import task_factory
+from official.core import tf_example_builder
+from official.core import tf_example_feature_key
+from official.core import train_lib
+from official.core import train_utils

modeling/official/core/actions.py ADDED Viewed

	@@ -0,0 +1,236 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Provides TFM orbit actions and associated helper functions/classes."""
+import os
+from typing import List
+from absl import logging
+import gin
+import orbit
+import tensorflow as tf, tf_keras
+from official.core import base_trainer
+from official.core import config_definitions
+from official.modeling import optimization
+class PruningAction:
+  """Train action to updates pruning related information.
+  This action updates pruning steps at the end of trainig loop, and log
+    pruning metrics to tensorboard.
+  This action must be used when training a pruned model to avoid pruning error.
+  """
+  def __init__(
+      self,
+      export_dir: str,
+      model: tf_keras.Model,
+      optimizer: tf_keras.optimizers.Optimizer,
+  ):
+    """Initializes the instance.
+    Args:
+      export_dir: `str` for the export directory of the pruning summaries.
+      model: `tf_keras.Model` model instance used for training. This will be
+        used to assign a pruning step to each prunable weight.
+      optimizer: `tf_keras.optimizers.Optimizer` optimizer instance used for
+        training. This will be used to find the current training steps.
+    """
+    # TODO(b/221490190): Avoid local import when the bug is fixed.
+    import tensorflow_model_optimization as tfmot  # pylint: disable=g-import-not-at-top
+    self._optimizer = optimizer
+    self.update_pruning_step = tfmot.sparsity.keras.UpdatePruningStep()
+    self.update_pruning_step.set_model(model)
+    self.update_pruning_step.on_train_begin()
+    self.pruning_summaries = tfmot.sparsity.keras.PruningSummaries(
+        log_dir=export_dir)
+    model.optimizer = optimizer
+    self.pruning_summaries.set_model(model)
+  def __call__(self, output: orbit.runner.Output):
+    """Update pruning step and log pruning summaries.
+    Args:
+      output: The train output.
+    """
+    self.update_pruning_step.on_epoch_end(batch=None)
+    self.pruning_summaries.on_epoch_begin(epoch=None)
+class EMACheckpointing:
+  """Eval action to save checkpoint with average weights when EMA is used.
+  This action swaps the weights of the model with the average weights, then it
+  saves the checkpoint under export_dir/ema_checkpoints. Checkpointing is
+  expensive for large models, so doing this action in eval is more efficient
+  than training.
+  """
+  def __init__(self,
+               export_dir: str,
+               optimizer: tf_keras.optimizers.Optimizer,
+               checkpoint: tf.train.Checkpoint,
+               max_to_keep: int = 1):
+    """Initializes the instance.
+    Args:
+      export_dir: `str` for the export directory of the EMA average weights.
+      optimizer: `tf_keras.optimizers.Optimizer` optimizer instance used for
+        training. This will be used to swap the model weights with the average
+        weigths.
+      checkpoint: `tf.train.Checkpoint` instance.
+      max_to_keep: `int` for max checkpoints to keep in ema_checkpoints subdir.
+    """
+    if not isinstance(optimizer, optimization.ExponentialMovingAverage):
+      raise ValueError('Optimizer has to be instance of'
+                       'optimization.ExponentialMovingAverage for'
+                       'EMACheckpointing action')
+    export_dir = os.path.join(export_dir, 'ema_checkpoints')
+    tf.io.gfile.makedirs(os.path.dirname(export_dir))
+    self._optimizer = optimizer
+    self._checkpoint = checkpoint
+    self._checkpoint_manager = tf.train.CheckpointManager(
+        checkpoint,
+        directory=export_dir,
+        max_to_keep=max_to_keep,
+        checkpoint_name='average_weights')
+  def __call__(self, output: orbit.runner.Output):
+    """Swaps model weights, and saves the checkpoint.
+    Args:
+      output: The train or eval output.
+    """
+    self._optimizer.swap_weights()
+    self._checkpoint_manager.save(checkpoint_number=self._optimizer.iterations)
+    self._optimizer.swap_weights()
+class RecoveryAction:
+  """Train action to recover from loss blowup.
+  Checks the loss value by the given threshold. If applicable, recover the
+  model by reading the checkpoint on disk.
+  """
+  def __init__(self, checkpoint_manager: tf.train.CheckpointManager):
+    self.checkpoint_manager = checkpoint_manager
+  def __call__(self, _):
+    """Recovers the training by triggering checkpoint restoration."""
+    # Loads the previous good checkpoint.
+    checkpoint_path = self.checkpoint_manager.restore_or_initialize()
+    logging.warning('Recovering the model from checkpoint: %s.',
+                    checkpoint_path)
+class RecoveryCondition:
+  """Recovery Condition."""
+  def __init__(self,
+               global_step: tf.Variable,
+               loss_upper_bound: float,
+               recovery_begin_steps: int = 0,
+               recovery_max_trials: int = 3):
+    self.recover_counter = 0
+    self.recovery_begin_steps = recovery_begin_steps
+    self.recovery_max_trials = recovery_max_trials
+    self.loss_upper_bound = loss_upper_bound
+    self.global_step = global_step
+  def __call__(self, outputs: orbit.runner.Output):
+    loss_value = outputs['training_loss']
+    if tf.math.is_nan(loss_value):
+      self.recover_counter += 1
+      if self.recover_counter > self.recovery_max_trials:
+        raise RuntimeError(
+            'The loss value is NaN after training loop and it happens %d times.'
+            % self.recover_counter)
+      return True
+    if (self.global_step >= self.recovery_begin_steps and
+        loss_value > self.loss_upper_bound):
+      self.recover_counter += 1
+      if self.recover_counter > self.recovery_max_trials:
+        raise RuntimeError(
+            f'The loss value is {loss_value}, which is larger than the bound {self.loss_upper_bound}, happens {self.recover_counter} times.'
+        )
+      return True
+    return False
+@gin.configurable
+def get_eval_actions(params: config_definitions.ExperimentConfig,
+                     trainer: base_trainer.Trainer,
+                     model_dir: str) -> List[orbit.Action]:
+  """Gets eval actions for TFM trainer."""
+  eval_actions = []
+  # Adds ema checkpointing action to save the average weights under
+  # ema_checkpoints subdir.
+  if isinstance(trainer.optimizer, optimization.ExponentialMovingAverage):
+    eval_actions.append(
+        EMACheckpointing(
+            export_dir=model_dir,
+            optimizer=trainer.optimizer,
+            checkpoint=trainer.checkpoint,
+            max_to_keep=params.trainer.max_to_keep))
+  return eval_actions
+@gin.configurable
+def get_train_actions(
+    params: config_definitions.ExperimentConfig, trainer: base_trainer.Trainer,
+    model_dir: str,
+    checkpoint_manager: tf.train.CheckpointManager) -> List[orbit.Action]:
+  """Gets train actions for TFM trainer."""
+  train_actions = []
+  # Adds pruning callback actions.
+  if hasattr(params.task, 'pruning') and params.task.pruning:
+    train_actions.append(
+        PruningAction(
+            export_dir=model_dir,
+            model=trainer.model,
+            optimizer=trainer.optimizer))
+  if params.trainer.recovery_max_trials >= 0:
+    recovery_condition = RecoveryCondition(
+        global_step=trainer.global_step,
+        loss_upper_bound=params.trainer.loss_upper_bound,
+        recovery_begin_steps=params.trainer.recovery_begin_steps,
+        recovery_max_trials=params.trainer.recovery_max_trials,
+    )
+    recover_action = orbit.actions.ConditionalAction(
+        condition=recovery_condition,
+        action=RecoveryAction(checkpoint_manager),
+    )
+    train_actions.append(recover_action)
+  if (
+      params.trainer.preemption_on_demand_checkpoint
+      and trainer.strategy.cluster_resolver
+  ):
+    on_demand_checkpoint_action = orbit.actions.SaveCheckpointIfPreempted(
+        trainer.strategy.cluster_resolver,
+        checkpoint_manager,
+        trainer.global_step,
+        keep_running_after_save=True,
+    )
+    train_actions.append(on_demand_checkpoint_action)
+  return train_actions

modeling/official/core/actions_test.py ADDED Viewed

	@@ -0,0 +1,131 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for TFM actions."""
+import os
+from absl.testing import parameterized
+import numpy as np
+import orbit
+import tensorflow as tf, tf_keras
+from tensorflow.python.distribute import combinations
+from tensorflow.python.distribute import strategy_combinations
+from official.core import actions
+from official.modeling import optimization
+class TestModel(tf_keras.Model):
+  def __init__(self):
+    super().__init__()
+    self.value = tf.Variable(0.0)
+    self.dense = tf_keras.layers.Dense(2)
+    _ = self.dense(tf.zeros((2, 2), tf.float32))
+  def call(self, x, training=None):
+    return self.value + x
+class ActionsTest(tf.test.TestCase, parameterized.TestCase):
+  @combinations.generate(
+      combinations.combine(
+          distribution=[
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy,
+          ],))
+  def test_ema_checkpointing(self, distribution):
+    with distribution.scope():
+      directory = self.create_tempdir()
+      model = TestModel()
+      optimizer = tf_keras.optimizers.SGD()
+      optimizer = optimization.ExponentialMovingAverage(
+          optimizer, trainable_weights_only=False)
+      # Creats average weights for the model variables. Average weights are
+      # initialized to zero.
+      optimizer.shadow_copy(model)
+      checkpoint = tf.train.Checkpoint(model=model)
+      # Changes model.value to 3, average value is still 0.
+      model.value.assign(3)
+      # Checks model.value is 3
+      self.assertEqual(model(0.), 3)
+      ema_action = actions.EMACheckpointing(directory, optimizer, checkpoint)
+      ema_action({})
+      self.assertNotEmpty(
+          tf.io.gfile.glob(os.path.join(directory, 'ema_checkpoints')))
+      checkpoint.read(
+          tf.train.latest_checkpoint(
+              os.path.join(directory, 'ema_checkpoints')))
+      # Checks model.value is 0 after swapping.
+      self.assertEqual(model(0.), 0)
+      # Raises an error for a normal optimizer.
+      with self.assertRaisesRegex(ValueError,
+                                  'Optimizer has to be instance of.*'):
+        _ = actions.EMACheckpointing(directory, tf_keras.optimizers.SGD(),
+                                     checkpoint)
+  @combinations.generate(
+      combinations.combine(
+          distribution=[
+              strategy_combinations.default_strategy,
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy_gpu,
+          ],))
+  def test_recovery_condition(self, distribution):
+    with distribution.scope():
+      global_step = orbit.utils.create_global_step()
+      recover_condition = actions.RecoveryCondition(
+          global_step, loss_upper_bound=0.5, recovery_max_trials=2)
+      outputs = {'training_loss': 0.6}
+      self.assertTrue(recover_condition(outputs))
+      self.assertTrue(recover_condition(outputs))
+      with self.assertRaises(RuntimeError):
+        recover_condition(outputs)
+      global_step = orbit.utils.create_global_step()
+      recover_condition = actions.RecoveryCondition(
+          global_step, loss_upper_bound=0.5, recovery_max_trials=2)
+      outputs = {'training_loss': tf.constant([np.nan], tf.float32)}
+      self.assertTrue(recover_condition(outputs))
+      self.assertTrue(recover_condition(outputs))
+      with self.assertRaises(RuntimeError):
+        recover_condition(outputs)
+  @combinations.generate(
+      combinations.combine(
+          distribution=[
+              strategy_combinations.one_device_strategy_gpu,
+              strategy_combinations.one_device_strategy,
+          ],))
+  def test_pruning(self, distribution):
+    with distribution.scope():
+      directory = self.get_temp_dir()
+      model = TestModel()
+      optimizer = tf_keras.optimizers.SGD()
+      pruning = actions.PruningAction(directory, model, optimizer)
+      pruning({})
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/base_task.py ADDED Viewed

	@@ -0,0 +1,360 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Defines the base task abstraction."""
+import abc
+import functools
+from typing import Optional
+from absl import logging
+import tensorflow as tf, tf_keras
+from official.core import config_definitions
+from official.modeling import optimization
+from official.modeling import performance
+from official.modeling.privacy import configs
+from official.modeling.privacy import ops
+OptimizationConfig = optimization.OptimizationConfig
+RuntimeConfig = config_definitions.RuntimeConfig
+DifferentialPrivacyConfig = configs.DifferentialPrivacyConfig
+class Task(tf.Module, metaclass=abc.ABCMeta):
+  """A single-replica view of training procedure.
+  Tasks provide artifacts for training/validation procedures, including
+  loading/iterating over Datasets, training/validation steps, calculating the
+  loss and customized metrics with reduction.
+  """
+  # Special keys in train/validate step returned logs.
+  loss = "loss"
+  def __init__(self,
+               params,
+               logging_dir: Optional[str] = None,
+               name: Optional[str] = None):
+    """Task initialization.
+    Args:
+      params: the task configuration instance, which can be any of dataclass,
+        ConfigDict, namedtuple, etc.
+      logging_dir: a string pointing to where the model, summaries etc. will be
+        saved. You can also write additional stuff in this directory.
+      name: the task name.
+    """
+    super().__init__(name=name)
+    self._task_config = params
+    self._logging_dir = (
+        logging_dir or ""
+    )  # Empty directory hints current working dir.
+  @property
+  def task_config(self):
+    return self._task_config
+  @property
+  def logging_dir(self) -> str:
+    return self._logging_dir
+  @classmethod
+  def create_optimizer(cls, optimizer_config: OptimizationConfig,
+                       runtime_config: Optional[RuntimeConfig] = None,
+                       dp_config: Optional[DifferentialPrivacyConfig] = None):
+    """Creates an TF optimizer from configurations.
+    Args:
+      optimizer_config: the parameters of the Optimization settings.
+      runtime_config: the parameters of the runtime.
+      dp_config: the parameter of differential privacy.
+    Returns:
+      A tf.optimizers.Optimizer object.
+    """
+    gradient_transformers = None
+    if dp_config is not None:
+      logging.info("Adding differential privacy transform with config %s.",
+                   dp_config.as_dict())
+      noise_stddev = dp_config.clipping_norm * dp_config.noise_multiplier
+      gradient_transformers = [
+          functools.partial(
+              ops.clip_l2_norm, l2_norm_clip=dp_config.clipping_norm),
+          functools.partial(
+              ops.add_noise, noise_stddev=noise_stddev)
+      ]
+    opt_factory = optimization.OptimizerFactory(optimizer_config)
+    optimizer = opt_factory.build_optimizer(
+        opt_factory.build_learning_rate(),
+        gradient_transformers=gradient_transformers
+        )
+    # Configuring optimizer when loss_scale is set in runtime config. This helps
+    # avoiding overflow/underflow for float16 computations.
+    if runtime_config:
+      optimizer = performance.configure_optimizer(
+          optimizer,
+          use_float16=runtime_config.mixed_precision_dtype == "float16",
+          loss_scale=runtime_config.loss_scale)
+    return optimizer
+  def initialize(self, model: tf_keras.Model):
+    """[Optional] A callback function used as CheckpointManager's init_fn.
+    This function will be called when no checkpoint is found for the model.
+    If there is a checkpoint, the checkpoint will be loaded and this function
+    will not be called. You can use this callback function to load a pretrained
+    checkpoint, saved under a directory other than the model_dir.
+    Args:
+      model: The keras.Model built or used by this task.
+    """
+    ckpt_dir_or_file = self.task_config.init_checkpoint
+    logging.info("Trying to load pretrained checkpoint from %s",
+                 ckpt_dir_or_file)
+    if ckpt_dir_or_file and tf.io.gfile.isdir(ckpt_dir_or_file):
+      ckpt_dir_or_file = tf.train.latest_checkpoint(ckpt_dir_or_file)
+    if not ckpt_dir_or_file:
+      logging.info("No checkpoint file found from %s. Will not load.",
+                   ckpt_dir_or_file)
+      return
+    if hasattr(model, "checkpoint_items"):
+      checkpoint_items = model.checkpoint_items
+    else:
+      checkpoint_items = dict(model=model)
+    ckpt = tf.train.Checkpoint(**checkpoint_items)
+    status = ckpt.read(ckpt_dir_or_file)
+    status.expect_partial().assert_existing_objects_matched()
+    logging.info("Finished loading pretrained checkpoint from %s",
+                 ckpt_dir_or_file)
+  def build_model(self) -> tf_keras.Model:
+    """[Optional] Creates model architecture.
+    Returns:
+      A model instance.
+    """  # pytype: disable=bad-return-type  # typed-keras
+  @abc.abstractmethod
+  def build_inputs(self,
+                   params,
+                   input_context: Optional[tf.distribute.InputContext] = None):
+    """Returns a dataset or a nested structure of dataset functions.
+    Dataset functions define per-host datasets with the per-replica batch size.
+    With distributed training, this method runs on remote hosts.
+    Args:
+      params: hyperparams to create input pipelines, which can be any of
+        dataclass, ConfigDict, namedtuple, etc.
+      input_context: optional distribution input pipeline context.
+    Returns:
+      A nested structure of per-replica input functions.
+    """
+  def build_losses(self, labels, model_outputs, aux_losses=None) -> tf.Tensor:
+    """Standard interface to compute losses.
+    Args:
+      labels: optional label tensors.
+      model_outputs: a nested structure of output tensors.
+      aux_losses: auxiliary loss tensors, i.e. `losses` in keras.Model.
+    Returns:
+      The total loss tensor.
+    """
+    del model_outputs, labels
+    if aux_losses is None:
+      losses = [tf.constant(0.0, dtype=tf.float32)]
+    else:
+      losses = aux_losses
+    total_loss = tf.add_n(losses)
+    return total_loss
+  def build_metrics(self, training: bool = True):
+    """Gets streaming metrics for training/validation."""
+    del training
+    return []
+  def process_metrics(self, metrics, labels, model_outputs, **kwargs):
+    """Process and update metrics.
+    Called when using custom training loop API.
+    Args:
+      metrics: a nested structure of metrics objects. The return of function
+        self.build_metrics.
+      labels: a tensor or a nested structure of tensors.
+      model_outputs: a tensor or a nested structure of tensors. For example,
+        output of the keras model built by self.build_model.
+      **kwargs: other args.
+    """
+    for metric in metrics:
+      metric.update_state(labels, model_outputs)
+  def process_compiled_metrics(self, compiled_metrics, labels, model_outputs):
+    """Process and update compiled_metrics.
+    call when using compile/fit API.
+    Args:
+      compiled_metrics: the compiled metrics (model.compiled_metrics).
+      labels: a tensor or a nested structure of tensors.
+      model_outputs: a tensor or a nested structure of tensors. For example,
+        output of the keras model built by self.build_model.
+    """
+    compiled_metrics.update_state(labels, model_outputs)
+  def train_step(self,
+                 inputs,
+                 model: tf_keras.Model,
+                 optimizer: tf_keras.optimizers.Optimizer,
+                 metrics=None):
+    """Does forward and backward.
+    With distribution strategies, this method runs on devices.
+    Args:
+      inputs: a dictionary of input tensors.
+      model: the model, forward pass definition.
+      optimizer: the optimizer for this training step.
+      metrics: a nested structure of metrics objects.
+    Returns:
+      A dictionary of logs.
+    """
+    if isinstance(inputs, tuple) and len(inputs) == 2:
+      features, labels = inputs
+    else:
+      features, labels = inputs, inputs
+    with tf.GradientTape() as tape:
+      outputs = model(features, training=True)
+      # Computes per-replica loss.
+      if model.compiled_loss:
+        loss = model.compiled_loss(
+            labels, outputs, regularization_losses=model.losses)
+        loss += self.build_losses(
+            labels=labels, model_outputs=outputs, aux_losses=None)
+      else:
+        loss = self.build_losses(
+            labels=labels, model_outputs=outputs, aux_losses=model.losses)
+      # Scales loss as the default gradients allreduce performs sum inside the
+      # optimizer.
+      scaled_loss = loss / tf.distribute.get_strategy().num_replicas_in_sync
+      # For mixed precision, when a LossScaleOptimizer is used, the loss is
+      # scaled to avoid numeric underflow.
+      if isinstance(optimizer,
+                    tf_keras.mixed_precision.LossScaleOptimizer):
+        scaled_loss = optimizer.get_scaled_loss(scaled_loss)
+    tvars = model.trainable_variables
+    grads = tape.gradient(scaled_loss, tvars)
+    if isinstance(optimizer,
+                  tf_keras.mixed_precision.LossScaleOptimizer):
+      grads = optimizer.get_unscaled_gradients(grads)
+    optimizer.apply_gradients(list(zip(grads, tvars)))
+    logs = {self.loss: loss}
+    if metrics:
+      self.process_metrics(metrics, labels, outputs)
+    if model.compiled_metrics:
+      self.process_compiled_metrics(model.compiled_metrics, labels, outputs)
+      logs.update({m.name: m.result() for m in metrics or []})
+      logs.update({m.name: m.result() for m in model.metrics})
+    return logs
+  def validation_step(self, inputs, model: tf_keras.Model, metrics=None):
+    """Validation step.
+    With distribution strategies, this method runs on devices.
+    Args:
+      inputs: a dictionary of input tensors.
+      model: the keras.Model.
+      metrics: a nested structure of metrics objects.
+    Returns:
+      A dictionary of logs.
+    """
+    if isinstance(inputs, tuple) and len(inputs) == 2:
+      features, labels = inputs
+    else:
+      features, labels = inputs, inputs
+    outputs = self.inference_step(features, model)
+    loss = self.build_losses(
+        labels=labels, model_outputs=outputs, aux_losses=model.losses)
+    logs = {self.loss: loss}
+    if metrics:
+      self.process_metrics(metrics, labels, outputs)
+    if model.compiled_metrics:
+      self.process_compiled_metrics(model.compiled_metrics, labels, outputs)
+      logs.update({m.name: m.result() for m in metrics or []})
+      logs.update({m.name: m.result() for m in model.metrics})
+    return logs
+  def inference_step(self, inputs, model: tf_keras.Model):
+    """Performs the forward step.
+    With distribution strategies, this method runs on devices.
+    Args:
+      inputs: a dictionary of input tensors.
+      model: the keras.Model.
+    Returns:
+      Model outputs.
+    """
+    return model(inputs, training=False)
+  def aggregate_logs(self, state, step_logs):
+    """Optional aggregation over logs returned from a validation step.
+    Given step_logs from a validation step, this function aggregates the logs
+    after each eval_step() (see eval_reduce() function in
+    official/core/base_trainer.py). It runs on CPU and can be used to aggregate
+    metrics during validation, when there are too many metrics that cannot fit
+    into TPU memory. Note that this may increase latency due to data transfer
+    between TPU and CPU. Also, the step output from a validation step may be a
+    tuple with elements from replicas, and a concatenation of the elements is
+    needed in such case.
+    Args:
+      state: The current state of training, for example, it can be a sequence of
+        metrics.
+      step_logs: Logs from a validation step. Can be a dictionary.
+    """
+    pass
+  def reduce_aggregated_logs(self,
+                             aggregated_logs,
+                             global_step: Optional[tf.Tensor] = None):
+    """Optional reduce of aggregated logs over validation steps.
+    This function reduces aggregated logs at the end of validation, and can be
+    used to compute the final metrics. It runs on CPU and in each eval_end() in
+    base trainer (see eval_end() function in official/core/base_trainer.py).
+    Args:
+      aggregated_logs: Aggregated logs over multiple validation steps.
+      global_step: An optional variable of global step.
+    Returns:
+      A dictionary of reduced results.
+    """
+    return {}

modeling/official/core/base_trainer.py ADDED Viewed

	@@ -0,0 +1,498 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Standard Trainer implementation.
+The base trainer implements the Orbit `StandardTrainable` and
+`StandardEvaluable` interfaces. Trainers inside this project should be
+interchangable and independent on model architectures and tasks.
+"""
+import functools
+from typing import Union, Optional
+from absl import logging
+import gin
+import orbit
+import tensorflow as tf, tf_keras
+from official.core import base_task
+from official.core import config_definitions
+from official.modeling import optimization
+ExperimentConfig = config_definitions.ExperimentConfig
+TrainerConfig = config_definitions.TrainerConfig
+class _AsyncTrainer(orbit.StandardTrainer, orbit.StandardEvaluator):
+  """Trainer class for both sync and async Strategy."""
+  def init_async(self):
+    """Initializes the Async Trainer base class."""
+    assert isinstance(self._strategy, tf.distribute.Strategy)
+    self._is_async = isinstance(
+        self._strategy, tf.distribute.experimental.ParameterServerStrategy)
+    self._coordinator = None
+    if self._is_async:
+      self._coordinator = (
+          tf.distribute.experimental.coordinator.ClusterCoordinator(
+              self._strategy))
+  def coordinator_for_async(
+      self,
+  ) -> tf.distribute.experimental.coordinator.ClusterCoordinator:
+    if not self._coordinator:
+      raise ValueError(
+          "Coordinator uninitialized for async run. Call init_async() first."
+      )
+    return self._coordinator
+  def join(self):
+    """Join all async steps. Only useful in aysnc training."""
+    if getattr(self, "_is_async", False):
+      self.coordinator_for_async().join()
+  def create_train_loop_fn(self):
+    """Creates a eval loop from the given step function and options."""
+    train_loop_fn = super().create_train_loop_fn()
+    if getattr(self, "_is_async", False):
+      def _async_loop_fn(iterator, num_steps):
+        self.coordinator_for_async().schedule(
+            train_loop_fn, args=(iterator, num_steps)
+        )
+      return _async_loop_fn
+    else:
+      return train_loop_fn
+  def create_eval_loop_fn(self, has_state: bool):
+    """Creates a training loop from the given step function and options."""
+    eval_loop_fn = super().create_eval_loop_fn(has_state)
+    if getattr(self, "_is_async", False):
+      if has_state:
+        raise ValueError(
+            "Stateful eval loop is not supported in async training.")
+      def _async_loop_fn(iterator, num_steps, state=None, reduce_fn=None):
+        assert state is None
+        assert reduce_fn is None
+        self.coordinator_for_async().schedule(
+            eval_loop_fn, args=(iterator, num_steps)
+        )
+      return _async_loop_fn
+    else:
+      return eval_loop_fn
+  def distribute_dataset(self, dataset_or_fn, *args, **kwargs):
+    """A utility function to help create a `tf.distribute.DistributedDataset`.
+    Args:
+      dataset_or_fn: A instance of `tf.data.Dataset`, or a "dataset function"
+        returning a `tf.data.Dataset`. If it is a function, it may optionally
+        have an argument named `input_context` which will be passed a
+        `tf.distribute.InputContext` instance.
+      *args: Any positional arguments to pass through to `dataset_or_fn`.
+      **kwargs: Any keyword arguments to pass through to `dataset_or_fn`.
+    Returns:
+      A distributed Dataset.
+    """
+    if getattr(self, "_is_async", False):
+      per_worker_dataset_fn = functools.partial(
+          orbit.utils.make_distributed_dataset, self._strategy, dataset_or_fn,
+          *args, **kwargs)
+      per_worker_dataset_fn = tf.function(per_worker_dataset_fn)
+      return self.coordinator_for_async().create_per_worker_dataset(
+          per_worker_dataset_fn
+      )
+    else:
+      return orbit.utils.make_distributed_dataset(self._strategy, dataset_or_fn,
+                                                  *args, **kwargs)
+def get_runtime_options(config: ExperimentConfig):
+  """Get tf.distribute.RunOptions from config."""
+  xla_options = {}
+  if config.runtime.tpu_enable_xla_dynamic_padder is not None:
+    xla_options["enable_xla_dynamic_padder"] = (
+        config.runtime.tpu_enable_xla_dynamic_padder)
+  return tf.distribute.RunOptions(
+      experimental_xla_options=tf.tpu.XLAOptions(**xla_options))
+@gin.configurable
+class Trainer(_AsyncTrainer):
+  """Implements the common trainer shared for TensorFlow models."""
+  # pylint: disable=super-init-not-called
+  def __init__(
+      self,
+      config: ExperimentConfig,
+      task: base_task.Task,
+      model: tf_keras.Model,
+      optimizer: tf.optimizers.Optimizer,
+      train: bool = True,
+      evaluate: bool = True,
+      train_dataset: Optional[Union[tf.data.Dataset,
+                                    tf.distribute.DistributedDataset]] = None,
+      validation_dataset: Optional[Union[
+          tf.data.Dataset, tf.distribute.DistributedDataset]] = None,
+      checkpoint_exporter=None):
+    """Initialize common trainer for TensorFlow models.
+    Args:
+      config: An `ExperimentConfig` instance specifying experiment config.
+      task: A base_task.Task instance.
+      model: The model instance, e.g. a tf_keras.Model instance.
+      optimizer: tf.optimizers.Optimizer instance.
+      train: bool, whether or not this trainer will be used for training.
+        default to True.
+      evaluate: bool, whether or not this trainer will be used for evaluation.
+        default to True.
+      train_dataset: a dataset object created for training. With tf.distribute,
+        it needs to be a `DistributedDataset`.
+      validation_dataset: a dataset object created for evaluation. With
+        tf.distribute, it needs to be a `DistributedDataset`. The evaluator will
+        create a dataset iterator for each eval round, so the dataset does not
+        need to repeat.
+      checkpoint_exporter: an object that has the `maybe_export_checkpoint`
+        interface.
+    """
+    # Gets the current distribution strategy. If not inside any strategy scope,
+    # it gets a single-replica no-op strategy.
+    self._strategy = tf.distribute.get_strategy()
+    self._validate_params(
+        config,
+        check_train_data=train_dataset is None,
+        check_validation_data=validation_dataset is None)
+    self._config = config
+    self._task = task
+    self._model = model
+    self._optimizer = optimizer
+    self._checkpoint_exporter = checkpoint_exporter
+    self._recovery = None
+    # Runtime options are only applied to train_step.
+    # We use default for eval_step.
+    self._runtime_options = get_runtime_options(config)
+    # Creates a shadow copy of the weights to store weights moving average.
+    if isinstance(self._optimizer, optimization.ExponentialMovingAverage
+                 ) and not self._optimizer.has_shadow_copy:
+      self._optimizer.shadow_copy(self._model)
+    # global_step increases by 1 after each training iteration.
+    # We should have global_step.numpy() == self.optimizer.iterations.numpy()
+    # when there is only 1 optimizer.
+    self._global_step = orbit.utils.create_global_step()
+    if hasattr(self.model, "checkpoint_items"):
+      checkpoint_items = self.model.checkpoint_items
+    else:
+      checkpoint_items = {}
+    self._checkpoint = tf.train.Checkpoint(
+        global_step=self.global_step,
+        model=self.model,
+        optimizer=self.optimizer,
+        **checkpoint_items)
+    self._train_loss = tf_keras.metrics.Mean("training_loss", dtype=tf.float32)
+    self._validation_loss = tf_keras.metrics.Mean(
+        "validation_loss", dtype=tf.float32)
+    model_metrics = model.metrics if hasattr(model, "metrics") else []
+    self.init_async()
+    if train:
+      self._train_metrics = self.task.build_metrics(
+          training=True) + model_metrics
+      train_dataset = train_dataset or self.distribute_dataset(
+          self.task.build_inputs, self.config.task.train_data)
+      orbit.StandardTrainer.__init__(
+          self,
+          train_dataset,
+          options=orbit.StandardTrainerOptions(
+              use_tf_while_loop=config.trainer.train_tf_while_loop,
+              use_tf_function=config.trainer.train_tf_function,
+              use_tpu_summary_optimization=config.trainer.allow_tpu_summary))
+    if evaluate:
+      self._validation_metrics = self.task.build_metrics(
+          training=False) + model_metrics
+      validation_dataset = validation_dataset or self.distribute_dataset(
+          self.task.build_inputs, self.config.task.validation_data)
+      orbit.StandardEvaluator.__init__(
+          self,
+          validation_dataset,
+          options=orbit.StandardEvaluatorOptions(
+              use_tf_function=config.trainer.eval_tf_function,
+              use_tf_while_loop=config.trainer.eval_tf_while_loop))
+  def _validate_params(self,
+                       config,
+                       check_train_data=True,
+                       check_validation_data=True):
+    r"""Validates if the configuration object passed to the Trainer.
+    The experiment configuration should be structured as:
+    \trainer
+    \task
+      \train_data
+      \validation_data
+    Args:
+      config: a namedtuple, dataclass, ConfigDict, etc.
+      check_train_data: whether to check task.train_data field.
+      check_validation_data: whether to check task.validation_data field.
+    """
+    if not hasattr(config, "trainer"):
+      raise AttributeError("The trainer requires the configuration contains an"
+                           " attribute `trainer`.")
+    if not hasattr(config, "task"):
+      raise AttributeError("The trainer requires the configuration contains an"
+                           " attribute `task`.")
+    if check_train_data and not hasattr(config.task, "train_data"):
+      raise AttributeError("The trainer requires the configuration contains an"
+                           " attribute `task.train_data`.")
+    if check_validation_data and not hasattr(config.task, "validation_data"):
+      raise AttributeError("The trainer requires the configuration contains an"
+                           " attribute `task.validation_data`.")
+  @property
+  def strategy(self):
+    return self._strategy
+  @property
+  def config(self):
+    return self._config
+  @property
+  def task(self):
+    return self._task
+  @property
+  def model(self):
+    return self._model
+  @property
+  def optimizer(self):
+    if hasattr(self, "_optimizer"):
+      return self._optimizer
+    else:
+      return None
+  @property
+  def global_step(self):
+    return self._global_step
+  @property
+  def train_loss(self):
+    """Accesses the training loss metric object."""
+    return self._train_loss
+  @property
+  def validation_loss(self):
+    """Accesses the validation loss metric object."""
+    return self._validation_loss
+  @property
+  def train_metrics(self):
+    """Accesses all training metric objects."""
+    return self._train_metrics
+  @property
+  def validation_metrics(self):
+    """Accesses all validation metric metric objects."""
+    return self._validation_metrics
+  def initialize(self):
+    """A callback function.
+    This function will be called when no checkpoint found for the model.
+    If there is a checkpoint, the checkpoint will be loaded and this function
+    will not be called. Tasks may use this callback function to load a
+    pretrained checkpoint, saved under a directory other than the model_dir.
+    """
+    self.task.initialize(self.model)
+  @property
+  def checkpoint(self):
+    """Accesses the training checkpoint."""
+    return self._checkpoint
+  @property
+  def checkpoint_exporter(self):
+    """Accesses the checkpoint exporter."""
+    return self._checkpoint_exporter
+  def train_loop_end(self):
+    """See base class."""
+    self.join()
+    logs = {}
+    for metric in self.train_metrics + [self.train_loss]:
+      logs[metric.name] = metric.result()
+      metric.reset_states()
+    if callable(self.optimizer.learning_rate):
+      # Maybe a self-implemented optimizer does not have `optimizer.iterations`.
+      # So just to be safe here.
+      if hasattr(self.optimizer, "iterations"):
+        logs["learning_rate"] = self.optimizer.learning_rate(
+            self.optimizer.iterations)
+      else:
+        logs["learning_rate"] = self.optimizer.learning_rate(self.global_step)
+    else:
+      logs["learning_rate"] = self.optimizer.learning_rate
+    return logs
+  def next_train_inputs(self, iterator):
+    """Fetches the next inputs for the model during train.
+    This method consumes the input iterator and returns the next inputs for the
+    model.
+    This method provides a way to control how to fetch the next model input, and
+    what data to send to the model.
+    Note: This function runs on the host side when accelerators are used.
+    Note: Depending on the training setup this may or may not run in eager mode.
+    In most cases it will be run in graph mode.
+    Args:
+      iterator: Dataset iterator to generate the next inputs from.
+    Returns:
+      The inputs to the model.
+    """
+    return next(iterator)
+  def train_step(self, iterator):
+    """See base class."""
+    def step_fn(inputs):
+      if self.config.runtime.enable_xla and (self.config.runtime.num_gpus > 0):
+        task_train_step = tf.function(self.task.train_step, jit_compile=True)
+      else:
+        task_train_step = self.task.train_step
+      logs = task_train_step(
+          inputs,
+          model=self.model,
+          optimizer=self.optimizer,
+          metrics=self.train_metrics)
+      self._train_loss.update_state(logs[self.task.loss])
+      self.global_step.assign_add(1)
+    inputs = self.next_train_inputs(iterator)
+    self.strategy.run(step_fn, args=(inputs,), options=self._runtime_options)
+  def eval_begin(self):
+    """Sets up metrics."""
+    for metric in self.validation_metrics + [self.validation_loss]:
+      metric.reset_states()
+    # Swaps weights to test on weights moving average.
+    if self.optimizer and isinstance(self.optimizer,
+                                     optimization.ExponentialMovingAverage):
+      self.optimizer.swap_weights()
+  def next_eval_inputs(self, iterator):
+    """Fetches the next inputs for the model during eval.
+    This method consumes the input iterator and returns the next inputs for the
+    model and an additional logs dict. The output dict remains in the host (not
+    sent to GPUs/TPUs) and is merged with the model outputs which will be
+    processed later in `aggregate_logs`. This is useful for sending extra logs
+    downstream that are not compatible with the accelerators.
+    Note: This function runs on the host side when accelerators are used.
+    Note: Depending on the training setup this may or may not run in eager mode.
+    In most cases it will be run in graph mode.
+    Args:
+      iterator: Dataset iterator to generate the next inputs from.
+    Returns:
+      The inputs to the model, and an additional logs dictionnary. The logs
+      are not passed to the model, instead they are merged with model output
+      logs.
+    """
+    passthrough_logs = dict()
+    return next(iterator), passthrough_logs
+  def eval_step(self, iterator):
+    """See base class."""
+    def step_fn(inputs):
+      logs = self.task.validation_step(
+          inputs, model=self.model, metrics=self.validation_metrics)
+      if self.task.loss in logs:
+        self._validation_loss.update_state(logs[self.task.loss])
+      return logs
+    inputs, passthrough_logs = self.next_eval_inputs(iterator)
+    distributed_outputs = self.strategy.run(step_fn, args=(inputs,))
+    logs = tf.nest.map_structure(
+        self.strategy.experimental_local_results, distributed_outputs
+    )
+    if set(logs.keys()) & set(passthrough_logs.keys()):
+      logging.warning(
+          (
+              "Conflict between the pasthrough log keys and the returned model"
+              " log keys. Found %r keys in the passthrough logs and %r keys in"
+              " the model logs. Model log keys takes precedence."
+          ),
+          logs.keys(),
+          passthrough_logs.keys(),
+      )
+    return passthrough_logs | logs
+  def eval_end(self, aggregated_logs=None):
+    """Processes evaluation results."""
+    self.join()
+    logs = {}
+    for metric in self.validation_metrics:
+      logs[metric.name] = metric.result()
+    if self.validation_loss.count.numpy() != 0:
+      logs[self.validation_loss.name] = self.validation_loss.result()
+    else:
+      # `self.validation_loss` metric was not updated, because the validation
+      # loss was not returned from the task's `validation_step` method.
+      logging.info("The task did not report validation loss.")
+    if aggregated_logs:
+      metrics = self.task.reduce_aggregated_logs(
+          aggregated_logs, global_step=self.global_step)
+      logs.update(metrics)
+    if self._checkpoint_exporter:
+      self._checkpoint_exporter.maybe_export_checkpoint(
+          self.checkpoint, logs, self.global_step.numpy())
+      metric_name = self.config.trainer.best_checkpoint_eval_metric
+      logs["best_" +
+           metric_name] = self._checkpoint_exporter.best_ckpt_logs[metric_name]
+    # Swaps back weights after testing when EMA is used.
+    # This happens after best checkpoint export so that average weights used for
+    # eval are exported instead of regular weights.
+    if self.optimizer and isinstance(self.optimizer,
+                                     optimization.ExponentialMovingAverage):
+      self.optimizer.swap_weights()
+    return logs
+  def eval_reduce(self, state=None, step_outputs=None):
+    return self.task.aggregate_logs(state, step_outputs)

modeling/official/core/base_trainer_test.py ADDED Viewed

	@@ -0,0 +1,363 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for tensorflow_models.core.trainers.trainer."""
+# pylint: disable=g-direct-tensorflow-import
+import gc
+import multiprocessing
+import os
+import sys
+from absl.testing import parameterized
+import orbit
+import portpicker
+import tensorflow as tf, tf_keras
+from tensorflow.python.distribute import combinations
+from tensorflow.python.distribute import strategy_combinations
+from official.core import base_trainer as trainer_lib
+from official.core import config_definitions as cfg
+from official.core import train_lib
+from official.utils.testing import mock_task
+TPU_TEST = 'test_tpu' in sys.argv[0]
+GPU_TEST = 'test_gpu' in sys.argv[0]
+def all_strategy_combinations():
+  return combinations.combine(
+      distribution=[
+          strategy_combinations.default_strategy,
+          strategy_combinations.cloud_tpu_strategy,
+          strategy_combinations.one_device_strategy_gpu,
+      ],)
+def create_in_process_cluster(num_workers, num_ps):
+  """Creates and starts local servers and returns the cluster_resolver."""
+  worker_ports = [portpicker.pick_unused_port() for _ in range(num_workers)]
+  ps_ports = [portpicker.pick_unused_port() for _ in range(num_ps)]
+  cluster_dict = {}
+  cluster_dict['worker'] = ['localhost:%s' % port for port in worker_ports]
+  if num_ps > 0:
+    cluster_dict['ps'] = ['localhost:%s' % port for port in ps_ports]
+  cluster_spec = tf.train.ClusterSpec(cluster_dict)
+  # Workers need some inter_ops threads to work properly.
+  worker_config = tf.compat.v1.ConfigProto()
+  if multiprocessing.cpu_count() < num_workers + 1:
+    worker_config.inter_op_parallelism_threads = num_workers + 1
+  for i in range(num_workers):
+    tf.distribute.Server(
+        cluster_spec,
+        job_name='worker',
+        task_index=i,
+        config=worker_config,
+        protocol='grpc')
+  for i in range(num_ps):
+    tf.distribute.Server(
+        cluster_spec, job_name='ps', task_index=i, protocol='grpc')
+  cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
+      cluster_spec, rpc_layer='grpc')
+  return cluster_resolver
+def dataset_fn(input_context=None):
+  del input_context
+  def dummy_data(_):
+    return tf.zeros((1, 1), dtype=tf.float32)
+  dataset = tf.data.Dataset.range(1)
+  dataset = dataset.repeat()
+  dataset = dataset.map(
+      dummy_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
+  return dataset
+class MockAsyncTrainer(trainer_lib._AsyncTrainer):
+  """Mock AsyncTrainer to test the _AsyncTrainer class."""
+  def __init__(self):
+    self._strategy = tf.distribute.get_strategy()
+    self.init_async()
+    self.global_step = tf.Variable(
+        0,
+        dtype=tf.int64,
+        name='global_step',
+        trainable=False,
+        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA)
+    self.eval_global_step = tf.Variable(
+        0,
+        dtype=tf.int64,
+        name='eval_global_step',
+        trainable=False,
+        aggregation=tf.VariableAggregation.ONLY_FIRST_REPLICA)
+    train_dataset = self.distribute_dataset(dataset_fn)
+    orbit.StandardTrainer.__init__(
+        self, train_dataset, options=orbit.StandardTrainerOptions())
+    validation_dataset = self.distribute_dataset(dataset_fn)
+    orbit.StandardEvaluator.__init__(
+        self,
+        validation_dataset,
+        options=orbit.StandardEvaluatorOptions(use_tf_while_loop=True))
+  def train_loop_begin(self):
+    self.global_step.assign(0)
+  def train_step(self, iterator):
+    def replica_step(_):
+      self.global_step.assign_add(1)
+    self._strategy.run(replica_step, args=(next(iterator),))
+  def train_loop_end(self):
+    self.join()
+    return self.global_step.numpy()
+  def eval_begin(self):
+    self.eval_global_step.assign(0)
+  def eval_step(self, iterator):
+    def replica_step(_):
+      self.eval_global_step.assign_add(1)
+    self._strategy.run(replica_step, args=(next(iterator),))
+  def eval_end(self):
+    self.join()
+    return self.eval_global_step.numpy()
+class TrainerTest(tf.test.TestCase, parameterized.TestCase):
+  def setUp(self):
+    super().setUp()
+    self._config = cfg.ExperimentConfig(
+        trainer=cfg.TrainerConfig(
+            optimizer_config=cfg.OptimizationConfig({
+                'optimizer': {
+                    'type': 'sgd'
+                },
+                'learning_rate': {
+                    'type': 'constant'
+                }
+            })))
+  def tearDown(self):
+    gc.collect()
+    # This will only contain uncollectable garbage, i.e. reference cycles
+    # involving objects with __del__ defined.
+    self.assertEmpty(gc.garbage)
+    super().tearDown()
+  def create_test_trainer(self, config, model_dir=None, task=None):
+    task = task or mock_task.MockTask(config.task, logging_dir=model_dir)
+    ckpt_exporter = train_lib.maybe_create_best_ckpt_exporter(config, model_dir)
+    trainer = trainer_lib.Trainer(
+        config,
+        task,
+        model=task.build_model(),
+        optimizer=task.create_optimizer(config.trainer.optimizer_config,
+                                        config.runtime),
+        checkpoint_exporter=ckpt_exporter)
+    return trainer
+  @combinations.generate(all_strategy_combinations())
+  def test_trainer_train(self, distribution):
+    with distribution.scope():
+      trainer = self.create_test_trainer(self._config)
+      logs = trainer.train(tf.convert_to_tensor(5, dtype=tf.int32))
+      self.assertIn('training_loss', logs)
+      self.assertIn('learning_rate', logs)
+  @combinations.generate(all_strategy_combinations())
+  def test_trainer_passing_datasets(self, distribution):
+    with distribution.scope():
+      task = mock_task.MockTask(self._config)
+      train_dataset = orbit.utils.make_distributed_dataset(
+          distribution, task.build_inputs, self._config.task.train_data)
+      validation_dataset = orbit.utils.make_distributed_dataset(
+          distribution, task.build_inputs, self._config.task.validation_data)
+      self._config.task.train_data = None
+      self._config.task.validation_data = None
+      trainer = trainer_lib.Trainer(
+          self._config,
+          task,
+          model=task.build_model(),
+          optimizer=task.create_optimizer(self._config.trainer.optimizer_config,
+                                          self._config.runtime),
+          train_dataset=train_dataset,
+          validation_dataset=validation_dataset)
+    logs = trainer.train(tf.convert_to_tensor(5, dtype=tf.int32))
+    self.assertIn('training_loss', logs)
+    self.assertIn('learning_rate', logs)
+    logs = trainer.evaluate(tf.convert_to_tensor(5, dtype=tf.int32))
+    self.assertIn('validation_loss', logs)
+  def test_base_async_trainer(self):
+    if TPU_TEST or GPU_TEST:
+      self.skipTest('Aysnc training is not available on GPU/GPU.')
+    num_workers = 3
+    num_ps = 2
+    cluster_resolver = create_in_process_cluster(num_workers, num_ps)
+    distribution = tf.distribute.experimental.ParameterServerStrategy(
+        cluster_resolver)
+    with distribution.scope():
+      trainer = MockAsyncTrainer()
+      trainer.init_async()
+      self.assertIsInstance(
+          trainer._coordinator,
+          tf.distribute.experimental.coordinator.ClusterCoordinator)
+      self.assertEqual(trainer.train(tf.constant(10)), 10)
+      self.assertEqual(trainer.evaluate(tf.constant(11)), 11)
+  def test_async_trainer_train(self):
+    if TPU_TEST or GPU_TEST:
+      self.skipTest('Aysnc training is not available on GPU/TPU.')
+    num_workers = 3
+    num_ps = 2
+    cluster_resolver = create_in_process_cluster(num_workers, num_ps)
+    distribution = tf.distribute.experimental.ParameterServerStrategy(
+        cluster_resolver)
+    with distribution.scope():
+      config = cfg.ExperimentConfig(**self._config.as_dict())
+      config.trainer.eval_tf_while_loop = True
+      trainer = self.create_test_trainer(config)
+      logs = trainer.train(tf.convert_to_tensor(5, dtype=tf.int32))
+      self.assertIn('training_loss', logs)
+      self.assertIn('learning_rate', logs)
+  def test_async_trainer_validate(self):
+    if TPU_TEST or GPU_TEST:
+      self.skipTest('Aysnc training is not available on GPU/GPU.')
+    num_workers = 3
+    num_ps = 2
+    cluster_resolver = create_in_process_cluster(num_workers, num_ps)
+    distribution = tf.distribute.experimental.ParameterServerStrategy(
+        cluster_resolver)
+    with distribution.scope():
+      config = cfg.ExperimentConfig(**self._config.as_dict())
+      config.trainer.eval_tf_while_loop = True
+      trainer = self.create_test_trainer(config)
+      logs = trainer.evaluate(tf.convert_to_tensor(5, dtype=tf.int32))
+      self.assertIn('acc', logs)
+      self.assertIn('validation_loss', logs)
+  @combinations.generate(all_strategy_combinations())
+  def test_trainer_validate(self, distribution):
+    with distribution.scope():
+      trainer = self.create_test_trainer(self._config)
+      logs = trainer.evaluate(tf.convert_to_tensor(5, dtype=tf.int32))
+      self.assertEqual(logs['counter'], 5. * distribution.num_replicas_in_sync)
+      self.assertIn('validation_loss', logs)
+  @combinations.generate(all_strategy_combinations())
+  def test_trainer_validate_without_loss(self, distribution):
+    class MockTaskWithoutValidationLoss(mock_task.MockTask):
+      def validation_step(self, inputs, model, metrics=None):
+        # Disable validation loss.
+        logs = super().validation_step(inputs, model)
+        del logs[self.loss]
+        return logs
+    with distribution.scope():
+      task = MockTaskWithoutValidationLoss()
+      trainer = self.create_test_trainer(self._config, task=task)
+      logs = trainer.evaluate(tf.convert_to_tensor(5, dtype=tf.int32))
+      self.assertEqual(logs['counter'], 5. * distribution.num_replicas_in_sync)
+      self.assertNotIn('validation_loss', logs)
+  @combinations.generate(
+      combinations.combine(
+          mixed_precision_dtype=['float32', 'bfloat16', 'float16'],
+          loss_scale=[None, 'dynamic', 128, 256],
+      ))
+  def test_configure_optimizer(self, mixed_precision_dtype, loss_scale):
+    config = cfg.ExperimentConfig(
+        runtime=cfg.RuntimeConfig(
+            mixed_precision_dtype=mixed_precision_dtype, loss_scale=loss_scale),
+        trainer=cfg.TrainerConfig(
+            optimizer_config=cfg.OptimizationConfig({
+                'optimizer': {
+                    'type': 'sgd'
+                },
+                'learning_rate': {
+                    'type': 'constant'
+                },
+            })))
+    trainer = self.create_test_trainer(config)
+    if mixed_precision_dtype == 'float16':
+      self.assertIsInstance(trainer.optimizer,
+                            tf_keras.mixed_precision.LossScaleOptimizer)
+      if loss_scale in (None, 'dynamic'):
+        self.assertTrue(trainer.optimizer.dynamic)
+      else:
+        self.assertFalse(trainer.optimizer.dynamic)
+        self.assertEqual(trainer.optimizer.initial_scale, loss_scale)
+    else:
+      self.assertIsInstance(
+          trainer.optimizer,
+          (tf_keras.optimizers.SGD, tf_keras.optimizers.legacy.SGD))
+    metrics = trainer.train(tf.convert_to_tensor(5, dtype=tf.int32))
+    self.assertIn('training_loss', metrics)
+  def test_export_best_ckpt(self):
+    config = cfg.ExperimentConfig(
+        trainer=cfg.TrainerConfig(
+            best_checkpoint_export_subdir='best_ckpt',
+            best_checkpoint_eval_metric='acc',
+            optimizer_config=cfg.OptimizationConfig({
+                'optimizer': {
+                    'type': 'sgd'
+                },
+                'learning_rate': {
+                    'type': 'constant'
+                }
+            })))
+    model_dir = self.get_temp_dir()
+    trainer = self.create_test_trainer(config, model_dir=model_dir)
+    trainer.train(tf.convert_to_tensor(1, dtype=tf.int32))
+    trainer.evaluate(tf.convert_to_tensor(1, dtype=tf.int32))
+    self.assertTrue(
+        tf.io.gfile.exists(os.path.join(model_dir, 'best_ckpt', 'info.json')))
+  def test_model_with_compiled_loss(self):
+    task = mock_task.MockTask()
+    model = task.build_model()
+    model.compile(loss=tf_keras.losses.CategoricalCrossentropy())
+    trainer = trainer_lib.Trainer(
+        self._config,
+        task,
+        model=model,
+        optimizer=task.create_optimizer(self._config.trainer.optimizer_config))
+    logs = trainer.train(tf.convert_to_tensor(5, dtype=tf.int32))
+    self.assertIn('training_loss', logs)
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/config_definitions.py ADDED Viewed

	@@ -0,0 +1,309 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Common configuration settings."""
+import dataclasses
+from typing import Optional, Sequence, Union
+from official.modeling.hyperparams import base_config
+from official.modeling.optimization.configs import optimization_config
+from official.modeling.privacy import configs as dp_configs
+OptimizationConfig = optimization_config.OptimizationConfig
+@dataclasses.dataclass
+class DataConfig(base_config.Config):
+  """The base configuration for building datasets.
+  Attributes:
+    input_path: The path to the input. It can be either (1) a str indicating a
+      file path/pattern, or (2) a str indicating multiple file paths/patterns
+      separated by comma (e.g "a, b, c" or no spaces "a,b,c"), or (3) a list of
+      str, each of which is a file path/pattern or multiple file paths/patterns
+      separated by comma, or (4) a dictionary of the previous three approaches
+      for more advanced data mixing using named access. It should not be
+      specified when the following `tfds_name` is specified.
+    tfds_name: The name of the tensorflow dataset (TFDS). It should not be
+      specified when the above `input_path` is specified.
+    tfds_split: A str indicating which split of the data to load from TFDS. It
+      is required when above `tfds_name` is specified.
+    global_batch_size: The global batch size across all replicas.
+    is_training: Whether this data is used for training or not. This flag is
+      useful for consumers of this object to determine whether the data should
+      be repeated or shuffled.
+    drop_remainder: Whether the last batch should be dropped in the case it has
+      fewer than `global_batch_size` elements.
+    shuffle_buffer_size: The buffer size used for shuffling training data.
+    cache: Whether to cache dataset examples. If `True`, we will cache the
+      dataset after applying the decode_fn and parse_fn. It can be used to avoid
+      re-reading from disk, re-decoding and re-parsing the example on the second
+      epoch, but it requires significant memory overhead.
+    cycle_length: The number of files that will be processed concurrently when
+      interleaving files.
+    block_length: The number of consecutive elements to produce from each input
+      element before cycling to another input element when interleaving files.
+    deterministic: A boolean controlling whether determinism should be enforced.
+    sharding: Whether sharding is used in the input pipeline.
+    enable_tf_data_service: A boolean indicating whether to enable tf.data
+      service for the input pipeline.
+    tf_data_service_address: The URI of a tf.data service to offload
+      preprocessing onto during training. The URI should be in the format
+      "protocol://address", e.g. "grpc://tf-data-service:5050". It can be
+      overridden by `FLAGS.tf_data_service` flag in the binary.
+    tf_data_service_job_name: The name of the tf.data service job. This argument
+      makes it possible for multiple datasets to share the same job. The default
+      behavior is that the dataset creates anonymous, exclusively owned jobs.
+    tfds_data_dir: A str specifying the directory to read/write TFDS data.
+    tfds_as_supervised: A bool. When loading dataset from TFDS, if True, the
+      returned tf.data.Dataset will have a 2-tuple structure (input, label)
+      according to builder.info.supervised_keys; if False, the default, the
+      returned tf.data.Dataset will have a dictionary with all the features.
+    tfds_skip_decoding_feature: A str to indicate which features are skipped for
+      decoding when loading dataset from TFDS. Use comma to separate multiple
+      features. The main use case is to skip the image/video decoding for better
+      performance.
+    enable_shared_tf_data_service_between_parallel_trainers: A bool. When set to
+      true, only a single tf.data service will be started, and it will be shared
+      between all the trainer run simultaneously, e.g. using vizier to tune
+      hyperparameters. This will save CPU and RAM resources compared to running
+      separate tf.data service for each trainer. Notice that if batch size is
+      different for different trainers, the field
+      apply_tf_data_service_before_batching also needs to be true so that only a
+      single tf.data service instance will be created. In this case, tf.data
+      service will be applied before batching operation. So make sure to not
+      apply any processing steps after batching (e.g. in postprocess_fn) since
+      they wouldn't be paralleled by tf.data service and may slow down your
+      tf.data pipeline. When using shared tf.data service, the tf.data dataset
+      must be infinite, and slow trainer may skip certain training examples.
+      More details about shared tf.data service can be found at:
+      https://www.tensorflow.org/api_docs/python/tf/data/experimental/service#sharing_tfdata_service_with_concurrent_trainers.
+    apply_tf_data_service_before_batching: A bool. If set to True, tf.data
+      service will be applied before batching operation. This is useful to make
+      sure only a single tf.data service instance is created when
+      enable_shared_tf_data_service_between_parallel_trainers is true and batch
+      size is changing between parallel trainers.
+    trainer_id: A string. The id of the trainer if there are multiple parallel
+      trainer running at the same time, e.g. in vizier tuning case. It will be
+      automatically set if this field is needed. Users does not need to set it
+      when creating experiment configs.
+    seed: An optional seed to use for deterministic shuffling/preprocessing.
+    prefetch_buffer_size: An int specifying the buffer size of prefetch
+      datasets. If None, the buffer size is autotuned. Specifying this is useful
+      in case autotuning uses up too much memory by making the buffer size too
+      high.
+    autotune_algorithm: If specified, use this algorithm for AUTOTUNE. See:
+      https://www.tensorflow.org/api_docs/python/tf/data/experimental/AutotuneAlgorithm
+  """
+  input_path: Union[Sequence[str], str, base_config.Config] = ""
+  tfds_name: Union[str, base_config.Config] = ""
+  tfds_split: str = ""
+  global_batch_size: int = 0
+  is_training: Optional[bool] = None
+  drop_remainder: bool = True
+  shuffle_buffer_size: int = 100
+  cache: bool = False
+  cycle_length: Optional[int] = None
+  block_length: int = 1
+  deterministic: Optional[bool] = None
+  sharding: bool = True
+  enable_tf_data_service: bool = False
+  tf_data_service_address: Optional[str] = None
+  tf_data_service_job_name: Optional[str] = None
+  tfds_data_dir: str = ""
+  tfds_as_supervised: bool = False
+  tfds_skip_decoding_feature: str = ""
+  enable_shared_tf_data_service_between_parallel_trainers: bool = False
+  apply_tf_data_service_before_batching: bool = False
+  trainer_id: Optional[str] = None
+  seed: Optional[int] = None
+  prefetch_buffer_size: Optional[int] = None
+  autotune_algorithm: Optional[str] = None
+@dataclasses.dataclass
+class RuntimeConfig(base_config.Config):
+  """High-level configurations for Runtime.
+  These include parameters that are not directly related to the experiment,
+  e.g. directories, accelerator type, etc.
+  Attributes:
+    distribution_strategy: e.g. 'mirrored', 'tpu', etc.
+    enable_xla: Whether or not to enable XLA.
+    per_gpu_thread_count: thread count per GPU.
+    gpu_thread_mode: Whether and how the GPU device uses its own threadpool.
+    dataset_num_private_threads: Number of threads for a private threadpool
+      created for all datasets computation.
+    tpu: The address of the TPU to use, if any.
+    num_gpus: The number of GPUs to use, if any.
+    worker_hosts: comma-separated list of worker ip:port pairs for running
+      multi-worker models with DistributionStrategy.
+    task_index: If multi-worker training, the task index of this worker.
+    all_reduce_alg: Defines the algorithm for performing all-reduce.
+    num_packs: Sets `num_packs` in the cross device ops used in
+      MirroredStrategy.  For details, see tf.distribute.NcclAllReduce.
+    mixed_precision_dtype: dtype of mixed precision policy. It can be 'float32',
+      'float16', or 'bfloat16'.
+    loss_scale: The type of loss scale, or 'float' value. This is used when
+      setting the mixed precision policy.
+    run_eagerly: Whether or not to run the experiment eagerly.
+    batchnorm_spatial_persistent: Whether or not to enable the spatial
+      persistent mode for CuDNN batch norm kernel for improved GPU performance.
+  """
+  distribution_strategy: str = "mirrored"
+  enable_xla: bool = False
+  gpu_thread_mode: Optional[str] = None
+  dataset_num_private_threads: Optional[int] = None
+  per_gpu_thread_count: int = 0
+  tpu: Optional[str] = None
+  num_gpus: int = 0
+  worker_hosts: Optional[str] = None
+  task_index: int = -1
+  all_reduce_alg: Optional[str] = None
+  num_packs: int = 1
+  mixed_precision_dtype: Optional[str] = None
+  loss_scale: Optional[Union[str, float]] = None
+  run_eagerly: bool = False
+  batchnorm_spatial_persistent: bool = False
+  # XLA runtime params.
+  # XLA params are only applied to the train_step.
+  # These augments can improve training speed. They can also improve eval, but
+  # may reduce usability and users would need to make changes to code.
+  # Whether to enable XLA dynamic padder
+  # infrastructure to handle dynamic shapes inputs inside XLA. True by
+  # default. Disabling this may cause correctness issues with dynamic shapes
+  # inputs, as XLA will just assume the inputs are with padded shapes. However
+  # users can optionally set it to False to improve device time if masking is
+  # already handled in the user side.
+  # If None, will respect XLA default.
+  tpu_enable_xla_dynamic_padder: Optional[bool] = None
+  # Global model parallelism configurations.
+  num_cores_per_replica: int = 1
+  default_shard_dim: int = -1
+  use_tpu_mp_strategy: bool = False
+  def model_parallelism(self):
+    return dict(
+        num_cores_per_replica=self.num_cores_per_replica,
+        default_shard_dim=self.default_shard_dim)
+@dataclasses.dataclass
+class TrainerConfig(base_config.Config):
+  """Configuration for trainer.
+  Attributes:
+    optimizer_config: optimizer config, it includes optimizer, learning rate,
+      and warmup schedule configs.
+    train_tf_while_loop: whether or not to use tf while loop.
+    train_tf_function: whether or not to use tf_function for training loop.
+    eval_tf_function: whether or not to use tf_function for eval.
+    eval_tf_while_loop: whether or not to use tf while loop for eval.
+    allow_tpu_summary: Whether to allow summary happen inside the XLA program
+      runs on TPU through automatic outside compilation.
+    steps_per_loop: number of steps per loop to report training metrics. This
+      can also be used to reduce host worker communication in a TPU setup.
+    summary_interval: number of steps between each summary.
+    checkpoint_interval: number of steps between checkpoints.
+    max_to_keep: max checkpoints to keep.
+    continuous_eval_timeout: maximum number of seconds to wait between
+      checkpoints, if set to None, continuous eval will wait indefinitely. This
+      is only used continuous_train_and_eval and continuous_eval modes. Default
+      value is 1 hrs.
+    train_steps: number of train steps.
+    validation_steps: number of eval steps. If -1, the entire eval dataset is
+      used.
+    validation_interval: number of training steps to run between evaluations.
+    best_checkpoint_export_subdir: if set, the trainer will keep track of the
+      best evaluation metric, and export the corresponding best checkpoint under
+      `model_dir/best_checkpoint_export_subdir`. Note that this only works if
+      mode contains eval (such as `train_and_eval`, `continuous_eval`, and
+      `continuous_train_and_eval`).
+    best_checkpoint_eval_metric: for exporting the best checkpoint, which
+      evaluation metric the trainer should monitor. This can be any evaluation
+      metric appears on tensorboard.
+    best_checkpoint_metric_comp: for exporting the best checkpoint, how the
+      trainer should compare the evaluation metrics. This can be either `higher`
+      (higher the better) or `lower` (lower the better).
+    validation_summary_subdir: A 'str', sub directory for saving eval summary.
+    preemption_on_demand_checkpoint: whether or not to save on-demand
+      checkpoints after a preemption.
+  """
+  optimizer_config: OptimizationConfig = dataclasses.field(
+      default_factory=OptimizationConfig
+  )
+  # Orbit settings.
+  train_tf_while_loop: bool = True
+  train_tf_function: bool = True
+  eval_tf_function: bool = True
+  eval_tf_while_loop: bool = False
+  allow_tpu_summary: bool = False
+  # Trainer intervals.
+  steps_per_loop: int = 1000
+  summary_interval: int = 1000
+  checkpoint_interval: int = 1000
+  # Checkpoint manager.
+  max_to_keep: int = 5
+  continuous_eval_timeout: int = 60 * 60
+  # Train/Eval routines.
+  train_steps: int = 0
+  # Sets validation steps to be -1 to evaluate the entire dataset.
+  validation_steps: int = -1
+  validation_interval: int = 1000
+  # Best checkpoint export.
+  best_checkpoint_export_subdir: str = ""
+  best_checkpoint_eval_metric: str = ""
+  best_checkpoint_metric_comp: str = "higher"
+  # Blowup recovery.
+  loss_upper_bound: float = 1e6
+  recovery_begin_steps: int = 0  # Enforcing the loss bound after these steps.
+  # When max trials < 0, no recovery module; max trials = 0, we will check
+  # the condition and fail the job if the condition happens; max trials > 0,
+  # we will retore the model states.
+  recovery_max_trials: int = 0
+  validation_summary_subdir: str = "validation"
+  # Preemption on-demand checkpoint.
+  preemption_on_demand_checkpoint: bool = True  # copybara-replace
+@dataclasses.dataclass
+class TaskConfig(base_config.Config):
+  """Config passed to task."""
+  init_checkpoint: str = ""
+  model: Optional[base_config.Config] = None
+  train_data: DataConfig = dataclasses.field(default_factory=DataConfig)
+  validation_data: DataConfig = dataclasses.field(default_factory=DataConfig)
+  name: Optional[str] = None
+  # Configs for differential privacy
+  # These configs are only effective if you use create_optimizer in
+  # tensorflow_models/official/core/base_task.py
+  # DEPRECATED b/264611883
+  differential_privacy_config: Optional[
+      dp_configs.DifferentialPrivacyConfig] = None
+  # Whether to show image summary. Useful to visualize model predictions. Only
+  # work for vision tasks.
+  allow_image_summary: bool = False
+@dataclasses.dataclass
+class ExperimentConfig(base_config.Config):
+  """Top-level configuration."""
+  task: TaskConfig = dataclasses.field(default_factory=TaskConfig)
+  trainer: TrainerConfig = dataclasses.field(default_factory=TrainerConfig)
+  runtime: RuntimeConfig = dataclasses.field(default_factory=RuntimeConfig)

modeling/official/core/exp_factory.py ADDED Viewed

	@@ -0,0 +1,32 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Experiment factory methods."""
+from official.core import config_definitions as cfg
+from official.core import registry
+_REGISTERED_CONFIGS = {}
+def register_config_factory(name):
+  """Register ExperimentConfig factory method."""
+  return registry.register(_REGISTERED_CONFIGS, name)
+def get_exp_config(exp_name: str) -> cfg.ExperimentConfig:
+  """Looks up the `ExperimentConfig` according to the `exp_name`."""
+  exp_creater = registry.lookup(_REGISTERED_CONFIGS, exp_name)
+  return exp_creater()

modeling/official/core/export_base.py ADDED Viewed

	@@ -0,0 +1,182 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Base class for model export."""
+import abc
+import functools
+import time
+from typing import Any, Callable, Dict, Mapping, List, Optional, Text, Union
+from absl import logging
+import tensorflow as tf, tf_keras
+MAX_DIRECTORY_CREATION_ATTEMPTS = 10
+class ExportModule(tf.Module, metaclass=abc.ABCMeta):
+  """Base Export Module."""
+  def __init__(self,
+               params,
+               model: Union[tf.Module, tf_keras.Model],
+               inference_step: Optional[Callable[..., Any]] = None,
+               *,
+               preprocessor: Optional[Callable[..., Any]] = None,
+               postprocessor: Optional[Callable[..., Any]] = None):
+    """Instantiates an ExportModel.
+    Examples:
+    `inference_step` must be a function that has `model` as an kwarg or the
+    second positional argument.
+    ```
+    def _inference_step(inputs, model=None):
+      return model(inputs, training=False)
+    module = ExportModule(params, model, inference_step=_inference_step)
+    ```
+    `preprocessor` and `postprocessor` could be either functions or `tf.Module`.
+    The usages of preprocessor and postprocessor are managed by the
+    implementation of `serve()` method.
+    Args:
+      params: A dataclass for parameters to the module.
+      model: A model instance which contains weights and forward computation.
+      inference_step: An optional callable to forward-pass the model. If not
+        specified, it creates a parital function with `model` as an required
+        kwarg.
+      preprocessor: An optional callable to preprocess the inputs.
+      postprocessor: An optional callable to postprocess the model outputs.
+    """
+    super().__init__(name=None)
+    self.model = model
+    self.params = params
+    if inference_step is not None:
+      self.inference_step = functools.partial(inference_step, model=self.model)
+    else:
+      if issubclass(type(model), tf_keras.Model):
+        # Default to self.model.call instead of self.model.__call__ to avoid
+        # keras tracing logic designed for training.
+        # Since most of Model Garden's call doesn't not have training kwargs
+        # or the default is False, we don't pass anything here.
+        # Please pass custom inference step if your model has training=True as
+        # default.
+        self.inference_step = self.model.call
+      else:
+        self.inference_step = functools.partial(
+            self.model.__call__, training=False)
+    self.preprocessor = preprocessor
+    self.postprocessor = postprocessor
+  @abc.abstractmethod
+  def serve(self) -> Mapping[Text, tf.Tensor]:
+    """The bare inference function which should run on all devices.
+    Expecting tensors are passed in through keyword arguments. Returns a
+    dictionary of tensors, when the keys will be used inside the SignatureDef.
+    """
+  @abc.abstractmethod
+  def get_inference_signatures(
+      self, function_keys: Dict[Text, Text]) -> Mapping[Text, Any]:
+    """Get defined function signatures."""
+def export(export_module: ExportModule,
+           function_keys: Union[List[Text], Dict[Text, Text]],
+           export_savedmodel_dir: Text,
+           checkpoint_path: Optional[Text] = None,
+           timestamped: bool = True,
+           save_options: Optional[tf.saved_model.SaveOptions] = None,
+           checkpoint: Optional[tf.train.Checkpoint] = None) -> Text:
+  """Exports to SavedModel format.
+  Args:
+    export_module: a ExportModule with the keras Model and serving tf.functions.
+    function_keys: a list of string keys to retrieve pre-defined serving
+      signatures. The signaute keys will be set with defaults. If a dictionary
+      is provided, the values will be used as signature keys.
+    export_savedmodel_dir: Output saved model directory.
+    checkpoint_path: Object-based checkpoint path or directory.
+    timestamped: Whether to export the savedmodel to a timestamped directory.
+    save_options: `SaveOptions` for `tf.saved_model.save`.
+    checkpoint: An optional tf.train.Checkpoint. If provided, the export module
+      will use it to read the weights.
+  Returns:
+    The savedmodel directory path.
+  """
+  ckpt_dir_or_file = checkpoint_path
+  if ckpt_dir_or_file is not None and tf.io.gfile.isdir(ckpt_dir_or_file):
+    ckpt_dir_or_file = tf.train.latest_checkpoint(ckpt_dir_or_file)
+  if ckpt_dir_or_file:
+    if checkpoint is None:
+      checkpoint = tf.train.Checkpoint(model=export_module.model)
+    checkpoint.read(
+        ckpt_dir_or_file).assert_existing_objects_matched().expect_partial()
+  if isinstance(function_keys, list):
+    if len(function_keys) == 1:
+      function_keys = {
+          function_keys[0]: tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY
+      }
+    else:
+      raise ValueError(
+          'If the function_keys is a list, it must contain a single element. %s'
+          % function_keys)
+  signatures = export_module.get_inference_signatures(function_keys)
+  if timestamped:
+    export_dir = get_timestamped_export_dir(export_savedmodel_dir).decode(
+        'utf-8')
+  else:
+    export_dir = export_savedmodel_dir
+  tf.saved_model.save(
+      export_module, export_dir, signatures=signatures, options=save_options)
+  return export_dir
+def get_timestamped_export_dir(export_dir_base):
+  """Builds a path to a new subdirectory within the base directory.
+  Args:
+    export_dir_base: A string containing a directory to write the exported graph
+      and checkpoints.
+  Returns:
+    The full path of the new subdirectory (which is not actually created yet).
+  Raises:
+    RuntimeError: if repeated attempts fail to obtain a unique timestamped
+      directory name.
+  """
+  attempts = 0
+  while attempts < MAX_DIRECTORY_CREATION_ATTEMPTS:
+    timestamp = int(time.time())
+    result_dir = tf.io.gfile.join(
+        tf.compat.as_bytes(export_dir_base), tf.compat.as_bytes(str(timestamp)))
+    if not tf.io.gfile.exists(result_dir):
+      # Collisions are still possible (though extremely unlikely): this
+      # directory is not actually created yet, but it will be almost
+      # instantly on return from this function.
+      return result_dir
+    time.sleep(1)
+    attempts += 1
+    logging.warning('Directory %s already exists; retrying (attempt %s/%s)',
+                    str(result_dir), attempts, MAX_DIRECTORY_CREATION_ATTEMPTS)
+  raise RuntimeError('Failed to obtain a unique export directory name after '
+                     f'{MAX_DIRECTORY_CREATION_ATTEMPTS} attempts.')

modeling/official/core/export_base_test.py ADDED Viewed

	@@ -0,0 +1,133 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for official.core.export_base."""
+import os
+from typing import Any, Dict, Mapping, Text
+import tensorflow as tf, tf_keras
+from official.core import export_base
+class TestModule(export_base.ExportModule):
+  @tf.function
+  def serve(self, inputs: tf.Tensor) -> Mapping[Text, tf.Tensor]:
+    x = inputs if self.preprocessor is None else self.preprocessor(
+        inputs=inputs)
+    x = self.inference_step(x)
+    x = self.postprocessor(x) if self.postprocessor else x
+    return {'outputs': x}
+  def get_inference_signatures(
+      self, function_keys: Dict[Text, Text]) -> Mapping[Text, Any]:
+    input_signature = tf.TensorSpec(shape=[None, None], dtype=tf.float32)
+    return {'foo': self.serve.get_concrete_function(input_signature)}
+class ExportBaseTest(tf.test.TestCase):
+  def test_export_module(self):
+    tmp_dir = self.get_temp_dir()
+    model = tf_keras.layers.Dense(2)
+    inputs = tf.ones([2, 4], tf.float32)
+    expected_output = model(inputs, training=False)
+    module = TestModule(params=None, model=model)
+    ckpt_path = tf.train.Checkpoint(model=model).save(
+        os.path.join(tmp_dir, 'ckpt'))
+    export_dir = export_base.export(
+        module, ['foo'],
+        export_savedmodel_dir=tmp_dir,
+        checkpoint_path=ckpt_path,
+        timestamped=True)
+    self.assertTrue(os.path.exists(os.path.join(export_dir, 'saved_model.pb')))
+    self.assertTrue(
+        os.path.exists(
+            os.path.join(export_dir, 'variables', 'variables.index')))
+    self.assertTrue(
+        os.path.exists(
+            os.path.join(export_dir, 'variables',
+                         'variables.data-00000-of-00001')))
+    imported = tf.saved_model.load(export_dir)
+    output = imported.signatures['foo'](inputs)
+    self.assertAllClose(output['outputs'].numpy(), expected_output.numpy())
+  def test_custom_inference_step(self):
+    tmp_dir = self.get_temp_dir()
+    model = tf_keras.layers.Dense(2)
+    inputs = tf.ones([2, 4], tf.float32)
+    def _inference_step(inputs, model):
+      return tf.nn.softmax(model(inputs, training=False))
+    module = TestModule(
+        params=None, model=model, inference_step=_inference_step)
+    expected_output = _inference_step(inputs, model)
+    ckpt_path = tf.train.Checkpoint(model=model).save(
+        os.path.join(tmp_dir, 'ckpt'))
+    export_dir = export_base.export(
+        module, ['foo'],
+        export_savedmodel_dir=tmp_dir,
+        checkpoint_path=ckpt_path,
+        timestamped=False)
+    imported = tf.saved_model.load(export_dir)
+    output = imported.signatures['foo'](inputs)
+    self.assertAllClose(output['outputs'].numpy(), expected_output.numpy())
+  def test_processors(self):
+    model = tf.Module()
+    inputs = tf.zeros((), tf.float32)
+    def _inference_step(inputs, model):
+      del model
+      return inputs + 1.0
+    def _preprocessor(inputs):
+      print(inputs)
+      return inputs + 0.1
+    module = TestModule(
+        params=None,
+        model=model,
+        inference_step=_inference_step,
+        preprocessor=_preprocessor)
+    output = module.serve(inputs)
+    self.assertAllClose(output['outputs'].numpy(), 1.1)
+    class _PostProcessor(tf.Module):
+      def __call__(self, inputs):
+        return inputs + 0.01
+    module = TestModule(
+        params=None,
+        model=model,
+        inference_step=_inference_step,
+        preprocessor=_preprocessor,
+        postprocessor=_PostProcessor())
+    output = module.serve(inputs)
+    self.assertAllClose(output['outputs'].numpy(), 1.11)
+  def test_get_timestamped_export_dir(self):
+    export_dir = self.get_temp_dir()
+    timed_dir = export_base.get_timestamped_export_dir(
+        export_dir_base=export_dir)
+    self.assertFalse(tf.io.gfile.exists(timed_dir))
+    self.assertIn(export_dir, str(timed_dir))
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/file_writers.py ADDED Viewed

	@@ -0,0 +1,80 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""File writer functions for dataset preparation, infra validation, and unit tests."""
+import io
+from typing import Optional, Sequence, Union
+import tensorflow as tf, tf_keras
+def write_small_dataset(examples: Sequence[Union[tf.train.Example,
+                                                 tf.train.SequenceExample]],
+                        output_path: str,
+                        file_type: str = 'tfrecord') -> None:
+  """Writes `examples` to a file at `output_path` with type `file_type`.
+  CAVEAT: This function is not recommended for writing large datasets, since it
+  will loop through `examples` and perform write operation sequentially.
+  Args:
+    examples: List of tf.train.Example or tf.train.SequenceExample.
+    output_path: Output path for the dataset.
+    file_type: A string indicating the file format, could be: 'tfrecord',
+      'tfrecords', 'tfrecord_compressed', 'tfrecords_gzip', 'riegeli'. The
+      string is case insensitive.
+  """
+  file_type = file_type.lower()
+  if file_type == 'tfrecord' or file_type == 'tfrecords':
+    _write_tfrecord(examples, output_path)
+  elif file_type == 'tfrecord_compressed' or file_type == 'tfrecords_gzip':
+    _write_tfrecord(examples, output_path,
+                    tf.io.TFRecordOptions(compression_type='GZIP'))
+  elif file_type == 'riegeli':
+    _write_riegeli(examples, output_path)
+  else:
+    raise ValueError(f'Unknown file_type: {file_type}')
+def _write_tfrecord(examples: Sequence[Union[tf.train.Example,
+                                             tf.train.SequenceExample]],
+                    output_path: str,
+                    options: Optional[tf.io.TFRecordOptions] = None) -> None:
+  """Writes `examples` to a TFRecord file at `output_path`.
+  Args:
+    examples: A list of tf.train.Example.
+    output_path: Output path for the dataset.
+    options: Options used for manipulating TFRecord files.
+  """
+  with tf.io.TFRecordWriter(output_path, options) as writer:
+    for example in examples:
+      writer.write(example.SerializeToString())
+def _write_riegeli(examples: Sequence[Union[tf.train.Example,
+                                            tf.train.SequenceExample]],
+                   output_path: str) -> None:
+  """Writes `examples` to a Riegeli file at `output_path`.
+  Args:
+    examples: A list of tf.train.Example.
+    output_path: Output path for the dataset.
+  """
+  with io.FileIO(output_path, 'wb') as fileio:
+    import riegeli  # pylint: disable=g-import-not-at-top
+    with riegeli.RecordWriter(fileio) as writer:
+      writer.write_messages(examples)

modeling/official/core/file_writers_test.py ADDED Viewed

	@@ -0,0 +1,53 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for file_writers."""
+import os
+from absl.testing import parameterized
+import tensorflow as tf, tf_keras
+from official.core import file_writers
+from official.core import tf_example_builder
+class FileWritersTest(tf.test.TestCase, parameterized.TestCase):
+  def setUp(self):
+    super().setUp()
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_bytes_feature('foo', 'Hello World!')
+    self._example = example_builder.example
+  @parameterized.parameters('tfrecord', 'TFRecord', 'tfrecords',
+                            'tfrecord_compressed', 'TFRecord_Compressed',
+                            'tfrecords_gzip')
+  def test_write_small_dataset_success(self, file_type):
+    temp_dir = self.create_tempdir()
+    temp_dataset_file = os.path.join(temp_dir.full_path, 'train')
+    file_writers.write_small_dataset([self._example], temp_dataset_file,
+                                     file_type)
+    self.assertTrue(os.path.exists(temp_dataset_file))
+  def test_write_small_dataset_unrecognized_format(self):
+    file_type = 'bar'
+    temp_dir = self.create_tempdir()
+    temp_dataset_file = os.path.join(temp_dir.full_path, 'train')
+    with self.assertRaises(ValueError):
+      file_writers.write_small_dataset([self._example], temp_dataset_file,
+                                       file_type)
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/input_reader.py ADDED Viewed

	@@ -0,0 +1,591 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A common dataset reader."""
+import dataclasses
+import random
+from typing import Any, Callable, Dict, List, Optional, Sequence, Text, Union
+from absl import logging
+import tensorflow as tf, tf_keras
+import tensorflow_datasets as tfds
+from official.core import config_definitions as cfg
+def _get_random_integer():
+  return random.randint(0, (1 << 31) - 1)
+def _maybe_map_fn(dataset: tf.data.Dataset,
+                  fn: Optional[Callable[..., Any]] = None) -> tf.data.Dataset:
+  """Calls dataset.map if a valid function is passed in."""
+  return dataset if fn is None else dataset.map(
+      fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
+def match_files(input_path: Union[Sequence[str], str]) -> List[str]:
+  """Matches files from an input_path."""
+  matched_files = []
+  # Read dataset from files.
+  usage = ('`input_path` should be either (1) a str indicating a file '
+           'path/pattern, or (2) a str indicating multiple file '
+           'paths/patterns separated by comma (e.g "a, b, c" or no spaces '
+           '"a,b,c", or (3) a list of str, each of which is a file '
+           'path/pattern or multiple file paths/patterns separated by '
+           'comma, but got: %s')
+  if isinstance(input_path, str):
+    input_path_list = [input_path]
+  elif isinstance(input_path, (list, tuple)):
+    if any(not isinstance(x, str) for x in input_path):
+      raise ValueError(usage % input_path)
+    input_path_list = input_path
+  else:
+    raise ValueError(usage % input_path)
+  for input_path in input_path_list:
+    input_patterns = input_path.strip().split(',')
+    for input_pattern in input_patterns:
+      input_pattern = input_pattern.strip()
+      if not input_pattern:
+        continue
+      if '*' in input_pattern or '?' in input_pattern:
+        tmp_matched_files = tf.io.gfile.glob(input_pattern)
+        if not tmp_matched_files:
+          raise ValueError('%s does not match any files.' % input_pattern)
+        matched_files.extend(tmp_matched_files)
+      else:
+        matched_files.append(input_pattern)
+  if not matched_files:
+    raise ValueError('%s does not match any files.' % input_path)
+  return matched_files
+def _read_files_then_shard(matched_files: List[str],
+                           dataset_fn,
+                           input_context: Optional[
+                               tf.distribute.InputContext] = None,
+                           sharding: bool = False,
+                           repeat: bool = False) -> tf.data.Dataset:
+  """Sends all data files to every worker and then shard by data."""
+  dataset = dataset_fn(matched_files)
+  # When `input_file` is a path to a single file or the number of files is
+  # less than the number of input pipelines, disable auto sharding
+  # so that same input file is sent to all workers.
+  options = tf.data.Options()
+  options.experimental_distribute.auto_shard_policy = (
+      tf.data.experimental.AutoShardPolicy.OFF)
+  dataset = dataset.with_options(options)
+  # Do not enable sharding if tf.data service is enabled, as sharding will be
+  # handled inside tf.data service.
+  if sharding and input_context and (input_context.num_input_pipelines > 1):
+    dataset = dataset.shard(input_context.num_input_pipelines,
+                            input_context.input_pipeline_id)
+  if repeat:
+    dataset = dataset.repeat()
+  return dataset
+def _shard_files_then_read(matched_files: List[str],
+                           dataset_fn,
+                           input_context: Optional[
+                               tf.distribute.InputContext] = None,
+                           seed: Optional[Union[int, tf.Tensor]] = None,
+                           is_training: bool = False,
+                           sharding: bool = False,
+                           cache: bool = False,
+                           cycle_length: Optional[int] = None,
+                           block_length: Optional[int] = None,
+                           deterministic: bool = False) -> tf.data.Dataset:
+  """Shards the data files and then sent a split to every worker to read."""
+  dataset = tf.data.Dataset.from_tensor_slices(matched_files)
+  # Shuffle and repeat at file level.
+  # If cache is enabled, `reshuffle_each_iteration` is set to False,
+  # because we will read the same cached data in every iteration anyway.
+  if is_training:
+    # We need a seed to shuffle the files so that when each TPU workers gets
+    # its own shard the files do not overlap.
+    if sharding and seed is None:
+      seed = _get_random_integer()
+    dataset = dataset.shuffle(
+        len(matched_files),
+        seed=seed,
+        reshuffle_each_iteration=True if not cache else False)
+  # Do not enable sharding if tf.data service is enabled, as sharding will be
+  # handled inside tf.data service.
+  if sharding and input_context and (input_context.num_input_pipelines > 1):
+    dataset = dataset.shard(input_context.num_input_pipelines,
+                            input_context.input_pipeline_id)
+  # If cache is enabled, we will call `repeat()` later after `cache()`.
+  if is_training and not cache:
+    dataset = dataset.repeat()
+  dataset = dataset.interleave(
+      map_func=dataset_fn,
+      cycle_length=cycle_length,
+      block_length=block_length,
+      num_parallel_calls=(cycle_length
+                          if cycle_length else tf.data.experimental.AUTOTUNE),
+      deterministic=deterministic)
+  return dataset
+def _read_tfds(tfds_name: Text,
+               tfds_data_dir: Text,
+               tfds_split: Text,
+               tfds_skip_decoding_feature: Text,
+               tfds_as_supervised: bool,
+               input_context: Optional[tf.distribute.InputContext] = None,
+               seed: Optional[Union[int, tf.Tensor]] = None,
+               is_training: bool = False,
+               cache: bool = False,
+               cycle_length: Optional[int] = None,
+               block_length: Optional[int] = None) -> tf.data.Dataset:
+  """Reads a dataset from tfds."""
+  repeat_filenames = is_training and not cache
+  read_config = tfds.ReadConfig(
+      interleave_cycle_length=cycle_length,
+      interleave_block_length=block_length,
+      input_context=input_context,
+      shuffle_seed=seed,
+      repeat_filenames=repeat_filenames,
+      # Only assert cardinality when we have a finite dataset.
+      assert_cardinality=not repeat_filenames,
+      skip_prefetch=True)
+  decoders = {}
+  if tfds_skip_decoding_feature:
+    for skip_feature in tfds_skip_decoding_feature.split(','):
+      decoders[skip_feature.strip()] = tfds.decode.SkipDecoding()
+  if tfds_name.startswith('mldataset.'):
+    dataset = tfds.load(name=tfds_name,
+                        split=tfds_split,
+                        as_supervised=tfds_as_supervised,
+                        decoders=decoders if decoders else None,
+                        read_config=read_config)
+  else:
+    builder = tfds.builder(tfds_name, data_dir=tfds_data_dir)
+    if builder.info.splits:
+      num_shards = len(builder.info.splits[tfds_split].file_instructions)
+    else:
+      # The tfds mock path often does not provide splits.
+      num_shards = 1
+    load_kwargs = dict(
+        name=tfds_name, download=True, split=tfds_split,
+        shuffle_files=is_training, as_supervised=tfds_as_supervised,
+        decoders=decoders if decoders else None)
+    if tfds_data_dir:
+      load_kwargs.update({'data_dir': tfds_data_dir})
+    if input_context and num_shards < input_context.num_input_pipelines:
+      # The number of files in the dataset split is smaller than the number of
+      # input pipelines. We read the entire dataset first and then shard in the
+      # host memory.
+      read_config = dataclasses.replace(read_config, input_context=None)
+      load_kwargs.update({'read_config': read_config})
+      dataset = tfds.load(**load_kwargs)
+      dataset = dataset.shard(input_context.num_input_pipelines,
+                              input_context.input_pipeline_id)
+    else:
+      load_kwargs.update({'read_config': read_config})
+      dataset = tfds.load(**load_kwargs)
+  return dataset
+class InputReader:
+  """Input reader that returns a tf.data.Dataset instance."""
+  # A static random number which is the same across different InputReader
+  # instances.
+  static_randnum = _get_random_integer()
+  def __init__(
+      self,
+      params: cfg.DataConfig,
+      dataset_fn=tf.data.TFRecordDataset,
+      decoder_fn: Optional[Callable[..., Any]] = None,
+      combine_fn: Optional[Callable[..., Any]] = None,
+      sample_fn: Optional[Callable[..., Any]] = None,
+      parser_fn: Optional[Callable[..., Any]] = None,
+      filter_fn: Optional[Callable[..., tf.Tensor]] = None,
+      transform_and_batch_fn: Optional[
+          Callable[
+              [tf.data.Dataset, Optional[tf.distribute.InputContext]],
+              tf.data.Dataset,
+          ]
+      ] = None,
+      postprocess_fn: Optional[Callable[..., Any]] = None,
+  ):
+    """Initializes an InputReader instance.
+    Args:
+      params: A config_definitions.DataConfig object.
+      dataset_fn: A `tf.data.Dataset` that consumes the input files. For
+        example, it can be `tf.data.TFRecordDataset`.
+      decoder_fn: An optional `callable` that takes the serialized data string
+        and decodes them into the raw tensor dictionary.
+      combine_fn: An optional `callable` that takes a dictionarty of
+        `tf.data.Dataset` objects as input and outputs a combined dataset. It
+        will be executed after the decoder_fn and before the sample_fn.
+      sample_fn: An optional `callable` that takes a `tf.data.Dataset` object as
+        input and outputs the transformed dataset. It performs sampling on the
+        decoded raw tensors dict before the parser_fn.
+      parser_fn: An optional `callable` that takes the decoded raw tensors dict
+        and parse them into a dictionary of tensors that can be consumed by the
+        model. It will be executed after decoder_fn.
+      filter_fn: An optional `callable` mapping a dataset element to a boolean.
+        It will be executed after parser_fn.
+      transform_and_batch_fn: An optional `callable` that takes a
+        `tf.data.Dataset` object and an optional `tf.distribute.InputContext` as
+        input, and returns a `tf.data.Dataset` object. It will be executed after
+        `parser_fn` to transform and batch the dataset; if None, after
+        `parser_fn` is executed, the dataset will be batched into per-replica
+        batch size.
+      postprocess_fn: A optional `callable` that processes batched tensors. It
+        will be executed after batching.
+    """
+    if params.input_path and params.tfds_name:
+      raise ValueError('At most one of `input_path` and `tfds_name` can be '
+                       'specified, but got %s and %s.' %
+                       (params.input_path, params.tfds_name))
+    if (isinstance(params.input_path, cfg.base_config.Config) or
+        isinstance(params.tfds_name, cfg.base_config.Config)
+        ) and combine_fn is None:
+      raise ValueError(
+          'A combine_fn is required if `input_path` or `tfds_name` is a dict.')
+    self._tfds_name = params.tfds_name
+    self._tfds_data_dir = params.tfds_data_dir
+    self._matched_files = None
+    if not params.input_path:
+      # Read dataset from TFDS.
+      if not params.tfds_split:
+        raise ValueError(
+            '`tfds_name` is %s, but `tfds_split` is not specified.' %
+            params.tfds_name)
+    else:
+      self._matched_files = self.get_files(params.input_path)
+    self._global_batch_size = params.global_batch_size
+    self._is_training = params.is_training
+    self._drop_remainder = params.drop_remainder
+    self._shuffle_buffer_size = params.shuffle_buffer_size
+    self._cache = params.cache
+    self._cycle_length = params.cycle_length
+    self._block_length = params.block_length
+    self._deterministic = params.deterministic
+    self._sharding = params.sharding
+    self._tfds_split = params.tfds_split
+    self._tfds_as_supervised = params.tfds_as_supervised
+    self._tfds_skip_decoding_feature = params.tfds_skip_decoding_feature
+    self._dataset_fn = dataset_fn
+    self._decoder_fn = decoder_fn
+    self._combine_fn = combine_fn
+    self._sample_fn = sample_fn
+    self._parser_fn = parser_fn
+    self._transform_and_batch_fn = transform_and_batch_fn
+    self._postprocess_fn = postprocess_fn
+    self._filter_fn = filter_fn
+    self._seed = params.seed
+    self._prefetch_buffer_size = (
+        params.prefetch_buffer_size or tf.data.experimental.AUTOTUNE)
+    self._autotune_algorithm = params.autotune_algorithm
+    # When tf.data service is enabled, each data service worker should get
+    # different random seeds. Thus, we set `seed` to None.
+    # Sharding should also be disabled because tf data service handles how
+    # each worker shard data with `processing_mode` in distribute method.
+    if params.enable_tf_data_service:
+      self._seed = None
+      self._sharding = False
+    self._enable_tf_data_service = (
+        params.enable_tf_data_service and params.tf_data_service_address)
+    self._tf_data_service_address = params.tf_data_service_address
+    self._enable_shared_tf_data_service_between_parallel_trainers = (
+        params.enable_shared_tf_data_service_between_parallel_trainers)
+    self._apply_tf_data_service_before_batching = (
+        params.apply_tf_data_service_before_batching)
+    self._trainer_id = params.trainer_id
+    if self._enable_tf_data_service:
+      # Add a random seed as the tf.data service job name suffix, so tf.data
+      # service doesn't reuse the previous state if TPU worker gets preempted.
+      # It's necessary to add global batch size into the tf data service job
+      # name because when tuning batch size with vizier and tf data service is
+      # also enable, the tf data servce job name should be different for
+      # different vizier trials since once batch size is changed, from the
+      # tf.data perspective, the dataset is a different instance, and a
+      # different job name should be used for tf data service. Otherwise, the
+      # model would read tensors from the incorrect tf data service job, which
+      # would causes dimension mismatch on the batch size dimension.
+      self._tf_data_service_job_name = (
+          f'{params.tf_data_service_job_name}_bs{params.global_batch_size}_'
+          f'{self.static_randnum}')
+      self._enable_round_robin_tf_data_service = params.get(
+          'enable_round_robin_tf_data_service', False)
+      if self._enable_shared_tf_data_service_between_parallel_trainers:
+        # When shared tf.data service is enabled, only a single tf.data service
+        # instance should be created and shared between parallel trainers. If
+        # the global batch size is different across trainers,
+        # params.apply_tf_data_service_before_batching should be set to true
+        # because tf.data service with different batch sizes will be considered
+        # separate tf.data service instances.
+        self._tf_data_service_job_name = (
+            f'{params.tf_data_service_job_name}_{self.static_randnum}')
+  def get_files(self, input_path):
+    """Gets matched files. Can be overridden by subclasses."""
+    if not input_path:
+      return None
+    # we want to combine / mix datasets
+    if isinstance(input_path, cfg.base_config.Config):
+      matched_files = {}
+      for k, v in input_path.as_dict().items():
+        matched_files[k] = match_files(v)
+    # single dataset
+    else:
+      matched_files = match_files(input_path)
+    return matched_files
+  def _read_data_source(
+      self,
+      matched_files: Union[Dict[str, List[str]], List[str]],
+      dataset_fn,
+      input_context: Optional[tf.distribute.InputContext] = None,
+  ):
+    """Reads the data source (files/tfds) to a dataset."""
+    def _files_to_dataset(files: List[str]) -> tf.data.Dataset:
+      if len(files) > 1:
+        if input_context and (len(files) < input_context.num_input_pipelines):
+          logging.warn(
+              (
+                  'The number of files %d is less than the number of input '
+                  'pipelines %d. We will send all input files to every worker. '
+                  'Please consider sharding your data into more files.'
+              ),
+              len(files),
+              input_context.num_input_pipelines,
+          )
+          return _read_files_then_shard(
+              files,
+              dataset_fn,
+              input_context,
+              sharding=self._sharding,
+              repeat=self._is_training and not self._cache)
+        else:
+          return _shard_files_then_read(
+              files,
+              dataset_fn,
+              input_context,
+              seed=self._seed,
+              is_training=self._is_training,
+              sharding=self._sharding,
+              cache=self._cache,
+              cycle_length=self._cycle_length,
+              block_length=self._block_length,
+              deterministic=self._deterministic)
+      elif len(files) == 1:
+        return _read_files_then_shard(
+            files,
+            dataset_fn,
+            input_context,
+            sharding=self._sharding,
+            repeat=self._is_training and not self._cache)
+      else:
+        raise ValueError('It is unexpected that `tfds_builder` is None and '
+                         'there is also no `files`.')
+    if self._tfds_name:
+      if isinstance(self._tfds_name, cfg.base_config.Config):
+        dataset = {}
+        for k, tfds_name in self._tfds_name.as_dict().items():
+          dataset[k] = _read_tfds(
+              tfds_name=tfds_name,
+              tfds_data_dir=self._tfds_data_dir,
+              tfds_split=self._tfds_split,
+              tfds_skip_decoding_feature=self._tfds_skip_decoding_feature,
+              tfds_as_supervised=self._tfds_as_supervised,
+              input_context=input_context,
+              seed=self._seed,
+              is_training=self._is_training,
+              cache=self._cache,
+              cycle_length=self._cycle_length,
+              block_length=self._block_length)
+      else:
+        dataset = _read_tfds(
+            tfds_name=self._tfds_name,
+            tfds_data_dir=self._tfds_data_dir,
+            tfds_split=self._tfds_split,
+            tfds_skip_decoding_feature=self._tfds_skip_decoding_feature,
+            tfds_as_supervised=self._tfds_as_supervised,
+            input_context=input_context,
+            seed=self._seed,
+            is_training=self._is_training,
+            cache=self._cache,
+            cycle_length=self._cycle_length,
+            block_length=self._block_length)
+    elif isinstance(matched_files, (list, tuple)):
+      dataset = _files_to_dataset(matched_files)
+    elif isinstance(matched_files, dict):
+      dataset = {}
+      for k, fs in matched_files.items():
+        dataset[k] = _files_to_dataset(fs)
+    else:
+      raise ValueError('`matched_files` should be a list or dict.')
+    return dataset
+  def _decode_and_parse_dataset(
+      self,
+      dataset: Union[tf.data.Dataset, Dict[Text, tf.data.Dataset]],
+      batch_size: int,
+      input_context: Optional[tf.distribute.InputContext] = None
+  ) -> tf.data.Dataset:
+    """Returns a tf.data.Dataset object after shuffling, decoding, and parsing."""
+    def _shuffle_and_decode(ds):
+      # If cache is enabled, we will call `shuffle()` later after `cache()`.
+      if self._is_training and not self._cache:
+        ds = ds.shuffle(self._shuffle_buffer_size, seed=self._seed)
+      # Decode
+      ds = _maybe_map_fn(ds, self._decoder_fn)
+      return ds
+    dataset = tf.nest.map_structure(_shuffle_and_decode, dataset)
+    if tf.nest.is_nested(dataset):
+      dataset = self._combine_fn(dataset)
+    if self._sample_fn is not None:
+      dataset = dataset.apply(self._sample_fn)
+    dataset = _maybe_map_fn(dataset, self._parser_fn)
+    if self._filter_fn is not None:
+      dataset = dataset.filter(self._filter_fn)
+    if self._cache:
+      dataset = dataset.cache()
+      if self._is_training:
+        dataset = dataset.repeat()
+        dataset = dataset.shuffle(self._shuffle_buffer_size, seed=self._seed)
+    # Applies tf.data service before batching operations. This is useful when
+    # tf.data service is shared between parallel trainers, and batch size is
+    # changing between parallel trainers. Then batch size is changing, tf.data
+    # services will be considered different instances if applied after batching
+    # operations, which make it difficult to share between parallel trainers.
+    # However, if there are additional expensive operations in
+    # self._transform_and_batch_fn and self._postprocess_fn, the entire tf.data
+    # pipeline could be slowed down. In this case, try to move these dataset
+    # operations into early stages if possible.
+    if (self._enable_shared_tf_data_service_between_parallel_trainers and
+        self._apply_tf_data_service_before_batching):
+      dataset = self._maybe_apply_data_service(dataset, input_context)
+    if self._transform_and_batch_fn is not None:
+      dataset = self._transform_and_batch_fn(dataset, input_context)
+    else:
+      per_replica_batch_size = input_context.get_per_replica_batch_size(
+          batch_size) if input_context else batch_size
+      dataset = dataset.batch(
+          per_replica_batch_size, drop_remainder=self._drop_remainder)
+    return dataset
+  def _maybe_apply_data_service(
+      self,
+      dataset: tf.data.Dataset,
+      input_context: Optional[tf.distribute.InputContext] = None
+  ) -> tf.data.Dataset:
+    """Potentially distributes a dataset."""
+    if self._enable_tf_data_service and input_context:
+      if self._enable_round_robin_tf_data_service:
+        replicas_per_input_pipeline = input_context.num_replicas_in_sync // (
+            input_context.num_input_pipelines)
+        base_consumer_index = input_context.input_pipeline_id * (
+            replicas_per_input_pipeline)
+        num_consumers = input_context.num_input_pipelines * (
+            replicas_per_input_pipeline)
+        range_dataset = tf.data.Dataset.range(replicas_per_input_pipeline)
+        tfds_kwargs = {
+            'processing_mode': 'parallel_epochs',
+            'service': self._tf_data_service_address,
+            'job_name': self._tf_data_service_job_name,
+            'num_consumers': num_consumers
+        }
+        if self._enable_shared_tf_data_service_between_parallel_trainers:
+          raise ValueError('Shared tf.data service does not support round-robin'
+                           ' tf.data service.')
+        dataset = range_dataset.map(lambda i: dataset.apply(  # pylint: disable=g-long-lambda
+            tf.data.experimental.service.distribute(
+                consumer_index=base_consumer_index + i, **tfds_kwargs)))
+        # Use parallel interleave to read multiple batches from a tf.data
+        # service worker in parallel.
+        dataset = dataset.interleave(
+            lambda x: x,
+            cycle_length=replicas_per_input_pipeline,
+            num_parallel_calls=replicas_per_input_pipeline,
+            deterministic=True)
+      else:
+        tfds_kwargs = {
+            'processing_mode': 'parallel_epochs',
+            'service': self._tf_data_service_address,
+            'job_name': self._tf_data_service_job_name,
+        }
+        if self._enable_shared_tf_data_service_between_parallel_trainers:
+          tfds_kwargs.update({
+              'processing_mode':
+                  tf.data.experimental.service.ShardingPolicy.OFF,
+              'cross_trainer_cache':
+                  tf.data.experimental.service.CrossTrainerCache(
+                      trainer_id=self._trainer_id)
+          })
+        dataset = dataset.apply(
+            tf.data.experimental.service.distribute(**tfds_kwargs))
+    return dataset
+  def read(self,
+           input_context: Optional[tf.distribute.InputContext] = None,
+           dataset: Optional[tf.data.Dataset] = None) -> tf.data.Dataset:
+    """Generates a tf.data.Dataset object."""
+    if dataset is None:
+      dataset = self._read_data_source(self._matched_files, self._dataset_fn,
+                                       input_context)
+    dataset = self._decode_and_parse_dataset(dataset, self._global_batch_size,
+                                             input_context)
+    dataset = _maybe_map_fn(dataset, self._postprocess_fn)
+    if not (self._enable_shared_tf_data_service_between_parallel_trainers and
+            self._apply_tf_data_service_before_batching):
+      dataset = self._maybe_apply_data_service(dataset, input_context)
+    if self._deterministic is not None:
+      options = tf.data.Options()
+      options.deterministic = self._deterministic
+      dataset = dataset.with_options(options)
+    if self._autotune_algorithm:
+      options = tf.data.Options()
+      options.autotune.autotune_algorithm = (
+          tf.data.experimental.AutotuneAlgorithm[self._autotune_algorithm])
+      dataset = dataset.with_options(options)
+    return dataset.prefetch(self._prefetch_buffer_size)

modeling/official/core/registry.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Registry utility."""
+def register(registered_collection, reg_key):
+  """Register decorated function or class to collection.
+  Register decorated function or class into registered_collection, in a
+  hierarchical order. For example, when reg_key="my_model/my_exp/my_config_0"
+  the decorated function or class is stored under
+  registered_collection["my_model"]["my_exp"]["my_config_0"].
+  This decorator is supposed to be used together with the lookup() function in
+  this file.
+  Args:
+    registered_collection: a dictionary. The decorated function or class will be
+      put into this collection.
+    reg_key: The key for retrieving the registered function or class. If reg_key
+      is a string, it can be hierarchical like my_model/my_exp/my_config_0
+  Returns:
+    A decorator function
+  Raises:
+    KeyError: when function or class to register already exists.
+  """
+  def decorator(fn_or_cls):
+    """Put fn_or_cls in the dictionary."""
+    if isinstance(reg_key, str):
+      hierarchy = reg_key.split("/")
+      collection = registered_collection
+      for h_idx, entry_name in enumerate(hierarchy[:-1]):
+        if entry_name not in collection:
+          collection[entry_name] = {}
+        collection = collection[entry_name]
+        if not isinstance(collection, dict):
+          raise KeyError(
+              "Collection path {} at position {} already registered as "
+              "a function or class.".format(entry_name, h_idx))
+      leaf_reg_key = hierarchy[-1]
+    else:
+      collection = registered_collection
+      leaf_reg_key = reg_key
+    if leaf_reg_key in collection:
+      raise KeyError("Function or class {} registered multiple times.".format(
+          leaf_reg_key))
+    collection[leaf_reg_key] = fn_or_cls
+    return fn_or_cls
+  return decorator
+def lookup(registered_collection, reg_key):
+  """Lookup and return decorated function or class in the collection.
+  Lookup decorated function or class in registered_collection, in a
+  hierarchical order. For example, when
+  reg_key="my_model/my_exp/my_config_0",
+  this function will return
+  registered_collection["my_model"]["my_exp"]["my_config_0"].
+  Args:
+    registered_collection: a dictionary. The decorated function or class will be
+      retrieved from this collection.
+    reg_key: The key for retrieving the registered function or class. If reg_key
+      is a string, it can be hierarchical like my_model/my_exp/my_config_0
+  Returns:
+    The registered function or class.
+  Raises:
+    LookupError: when reg_key cannot be found.
+  """
+  if isinstance(reg_key, str):
+    hierarchy = reg_key.split("/")
+    collection = registered_collection
+    for h_idx, entry_name in enumerate(hierarchy):
+      if entry_name not in collection:
+        raise LookupError(
+            f"collection path {entry_name} at position {h_idx} is never "
+            f"registered. Please make sure the {entry_name} and its library is "
+            "imported and linked to the trainer binary.")
+      collection = collection[entry_name]
+    return collection
+  else:
+    if reg_key not in registered_collection:
+      raise LookupError(
+          f"registration key {reg_key} is never "
+          f"registered. Please make sure the {reg_key} and its library is "
+          "imported and linked to the trainer binary.")
+    return registered_collection[reg_key]

modeling/official/core/registry_test.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for registry."""
+import tensorflow as tf, tf_keras
+from official.core import registry
+class RegistryTest(tf.test.TestCase):
+  def test_register(self):
+    collection = {}
+    @registry.register(collection, 'functions/func_0')
+    def func_test():
+      pass
+    self.assertEqual(registry.lookup(collection, 'functions/func_0'), func_test)
+    @registry.register(collection, 'classes/cls_0')
+    class ClassRegistryKey:
+      pass
+    self.assertEqual(
+        registry.lookup(collection, 'classes/cls_0'), ClassRegistryKey)
+    @registry.register(collection, ClassRegistryKey)
+    class ClassRegistryValue:
+      pass
+    self.assertEqual(
+        registry.lookup(collection, ClassRegistryKey), ClassRegistryValue)
+  def test_register_hierarchy(self):
+    collection = {}
+    @registry.register(collection, 'functions/func_0')
+    def func_test0():
+      pass
+    @registry.register(collection, 'func_1')
+    def func_test1():
+      pass
+    @registry.register(collection, func_test1)
+    def func_test2():
+      pass
+    expected_collection = {
+        'functions': {
+            'func_0': func_test0,
+        },
+        'func_1': func_test1,
+        func_test1: func_test2,
+    }
+    self.assertEqual(collection, expected_collection)
+  def test_register_error(self):
+    collection = {}
+    @registry.register(collection, 'functions/func_0')
+    def func_test0():  # pylint: disable=unused-variable
+      pass
+    with self.assertRaises(KeyError):
+      @registry.register(collection, 'functions/func_0/sub_func')
+      def func_test1():  # pylint: disable=unused-variable
+        pass
+    with self.assertRaises(LookupError):
+      registry.lookup(collection, 'non-exist')
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/savedmodel_checkpoint_manager.py ADDED Viewed

	@@ -0,0 +1,258 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Custom checkpoint manager that also exports saved models."""
+import os
+import re
+import time
+from typing import Callable, List, Mapping, Optional, Union
+from absl import logging
+import tensorflow as tf, tf_keras
+SAVED_MODULES_PATH_SUFFIX = 'saved_modules'
+def make_saved_modules_directory_name(checkpoint_name: str) -> str:
+  return f'{checkpoint_name}_{SAVED_MODULES_PATH_SUFFIX}'
+class SavedModelCheckpointManager(tf.train.CheckpointManager):
+  """A CheckpointManager that also exports `SavedModel`s."""
+  def __init__(self,
+               checkpoint: tf.train.Checkpoint,
+               directory: str,
+               max_to_keep: int,
+               modules_to_export: Optional[Mapping[str, tf.Module]] = None,
+               keep_checkpoint_every_n_hours: Optional[int] = None,
+               checkpoint_name: str = 'ckpt',
+               step_counter: Optional[tf.Variable] = None,
+               checkpoint_interval: Optional[int] = None,
+               init_fn: Optional[Callable[[], None]] = None):
+    """See base class."""
+    super().__init__(
+        checkpoint=checkpoint,
+        directory=directory,
+        max_to_keep=max_to_keep,
+        keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours,
+        checkpoint_name=checkpoint_name,
+        step_counter=step_counter,
+        checkpoint_interval=checkpoint_interval,
+        init_fn=init_fn)
+    self._modules_to_export = modules_to_export
+    self._savedmodels = self.get_existing_savedmodels()
+  def save(self,
+           checkpoint_number: Optional[int] = None,
+           check_interval: bool = True,
+           options: Optional[tf.train.CheckpointOptions] = None):
+    """See base class."""
+    checkpoint_path = super().save(
+        checkpoint_number=checkpoint_number,
+        check_interval=check_interval,
+        options=options)
+    if not checkpoint_path:  # Nothing got written.
+      return
+    if not self._modules_to_export:  # No modules to export.
+      logging.info('Skip saving SavedModel due to empty modules_to_export.')
+      return checkpoint_path
+    # Save the models for the checkpoint that just got written.
+    saved_modules_directory = make_saved_modules_directory_name(checkpoint_path)
+    # Atomic export of SavedModel. Write into a temporary direcotory and then
+    # rename as the final direcotory after finishing the writing.
+    # This can avoid trying to read an unfinished savedmodel.
+    saved_modules_directory_tmp = saved_modules_directory + '_temp'
+    for model_name, model in self._modules_to_export.items():
+      signatures = getattr(model, 'saved_model_signatures', None)
+      if signatures is not None:
+        tf.saved_model.save(
+            obj=model,
+            export_dir=os.path.join(saved_modules_directory_tmp, model_name),
+            signatures=signatures)
+    if tf.io.gfile.exists(saved_modules_directory_tmp):
+      tf.io.gfile.rename(saved_modules_directory_tmp, saved_modules_directory)
+    saved_modules_directories_to_keep = [
+        make_saved_modules_directory_name(ckpt) for ckpt in self.checkpoints
+    ]
+    existing_saved_modules_dirs = self.get_existing_savedmodels()
+    self._savedmodels = []
+    # Keep savedmodels in the same order as checkpoints (from oldest to newest).
+    for saved_modules_dir_to_keep in saved_modules_directories_to_keep:
+      if saved_modules_dir_to_keep in existing_saved_modules_dirs:
+        self._savedmodels.append(saved_modules_dir_to_keep)
+    for existing_saved_modules_dir in existing_saved_modules_dirs:
+      if existing_saved_modules_dir not in self._savedmodels:
+        tf.io.gfile.rmtree(existing_saved_modules_dir)
+    return checkpoint_path
+  def get_existing_savedmodels(self) -> List[str]:
+    """Gets a list of all existing SavedModel paths in `directory`.
+    Returns:
+      A list of all existing SavedModel paths.
+    """
+    saved_modules_glob = make_saved_modules_directory_name(
+        self._checkpoint_prefix + '-*')
+    savedmodels = tf.io.gfile.glob(saved_modules_glob)
+    # Filter out temporary savedmodel.
+    savedmodels = [
+        savedmodel
+        for savedmodel in savedmodels
+        if savedmodel.endswith(SAVED_MODULES_PATH_SUFFIX)
+    ]
+    return savedmodels
+  @property
+  def latest_savedmodel(self) -> Union[str, None]:
+    """The path of the most recent SavedModel in `directory`.
+    Returns:
+      The latest SavedModel path. If there are no SavedModels, returns `None`.
+    """
+    if self._savedmodels:
+      return self._savedmodels[-1]
+    return None
+  @property
+  def savedmodels(self) -> List[str]:
+    """A list of managed SavedModels.
+    Returns:
+      A list of SavedModel paths, sorted from oldest to newest.
+    """
+    return self._savedmodels
+  @property
+  def modules_to_export(self) -> Union[Mapping[str, tf.Module], None]:
+    return self._modules_to_export
+  def get_savedmodel_number_from_path(self,
+                                      savedmodel_path: str) -> Union[int, None]:
+    """Gets the savedmodel_number/checkpoint_number from savedmodel filepath.
+    The savedmodel_number is global step when using with orbit controller.
+    Args:
+      savedmodel_path: savedmodel directory path.
+    Returns:
+      Savedmodel number or None if no matched pattern found in savedmodel path.
+    """
+    pattern = rf'\d+_{SAVED_MODULES_PATH_SUFFIX}$'
+    savedmodel_number = re.search(pattern, savedmodel_path)
+    if savedmodel_number:
+      savedmodel_number = savedmodel_number.group()
+      return int(savedmodel_number[:-len(SAVED_MODULES_PATH_SUFFIX) - 1])
+    return None
+  def savedmodels_iterator(self,
+                           min_interval_secs: float = 0,
+                           timeout: Optional[float] = None,
+                           timeout_fn: Optional[Callable[[], bool]] = None):
+    """Continuously yield new SavedModel files as they appear.
+    The iterator only checks for new savedmodels when control flow has been
+    reverted to it. The logic is same to the `train.checkpoints_iterator`.
+    Args:
+      min_interval_secs: The minimum number of seconds between yielding
+        savedmodels.
+      timeout: The maximum number of seconds to wait between savedmodels. If
+        left as `None`, then the process will wait indefinitely.
+      timeout_fn: Optional function to call after a timeout.  If the function
+        returns True, then it means that no new savedmodels will be generated
+        and the iterator will exit.  The function is called with no arguments.
+    Yields:
+      String paths to latest SavedModel files as they arrive.
+    """
+    savedmodel_path = None
+    while True:
+      new_savedmodel_path = self.wait_for_new_savedmodel(
+          savedmodel_path, timeout=timeout)
+      if new_savedmodel_path is None:
+        if not timeout_fn:
+          # timed out
+          logging.info('Timed-out waiting for a savedmodel.')
+          return
+        if timeout_fn():
+          # The timeout_fn indicated that we are truly done.
+          return
+        else:
+          # The timeout_fn indicated that more savedmodels may come.
+          continue
+      start = time.time()
+      savedmodel_path = new_savedmodel_path
+      yield savedmodel_path
+      time_to_next_eval = start + min_interval_secs - time.time()
+      if time_to_next_eval > 0:
+        time.sleep(time_to_next_eval)
+  def wait_for_new_savedmodel(
+      self,
+      last_savedmodel: Optional[str] = None,
+      seconds_to_sleep: float = 1.0,
+      timeout: Optional[float] = None) -> Union[str, None]:
+    """Waits until a new savedmodel file is found.
+    Args:
+      last_savedmodel: The last savedmodel path used or `None` if we're
+        expecting a savedmodel for the first time.
+      seconds_to_sleep: The number of seconds to sleep for before looking for a
+        new savedmodel.
+      timeout: The maximum number of seconds to wait. If left as `None`, then
+        the process will wait indefinitely.
+    Returns:
+      A new savedmodel path, or None if the timeout was reached.
+    """
+    logging.info('Waiting for new savedmodel at %s', self._directory)
+    stop_time = time.time() + timeout if timeout is not None else None
+    last_savedmodel_number = -1
+    if last_savedmodel:
+      last_savedmodel_number = self.get_savedmodel_number_from_path(
+          last_savedmodel)
+    while True:
+      if stop_time is not None and time.time() + seconds_to_sleep > stop_time:
+        return None
+      existing_savedmodels = {}
+      for savedmodel_path in self.get_existing_savedmodels():
+        savedmodel_number = self.get_savedmodel_number_from_path(
+            savedmodel_path)
+        if savedmodel_number is not None:
+          existing_savedmodels[savedmodel_number] = savedmodel_path
+      # Find the first savedmodel with larger step number as next savedmodel.
+      savedmodel_path = None
+      existing_savedmodels = dict(sorted(existing_savedmodels.items()))
+      for savedmodel_number in existing_savedmodels:
+        if savedmodel_number > last_savedmodel_number:
+          savedmodel_path = existing_savedmodels[savedmodel_number]
+          break
+      if savedmodel_path:
+        logging.info('Found new savedmodel at %s', savedmodel_path)
+        return savedmodel_path
+      else:
+        time.sleep(seconds_to_sleep)

modeling/official/core/savedmodel_checkpoint_manager_test.py ADDED Viewed

	@@ -0,0 +1,125 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import time
+from typing import Iterable
+import tensorflow as tf, tf_keras
+from official.core import savedmodel_checkpoint_manager
+def _models_exist(checkpoint_path: str, models: Iterable[str]) -> bool:
+  for model_name in models:
+    if not tf.io.gfile.isdir(
+        os.path.join(
+            savedmodel_checkpoint_manager.make_saved_modules_directory_name(
+                checkpoint_path), model_name)):
+      return False
+  return True
+class _ModelForTest(tf_keras.Model):
+  def __init__(self, hidden_size: int = 8):
+    super().__init__()
+    self.dense = tf_keras.layers.Dense(hidden_size)
+  @tf.function(input_signature=[tf.TensorSpec([None, 16])])
+  def call(self, inputs):
+    return self.dense(inputs)
+  @property
+  def saved_model_signatures(self):
+    # Build SavedModel signatures.
+    return dict(serving_default=self.call)
+class CheckpointManagerTest(tf.test.TestCase):
+  def _create_manager(self, max_to_keep: int = 1) -> tf.train.CheckpointManager:
+    """Sets up SavedModelCheckpointManager object.
+    Args:
+      max_to_keep: max number of savedmodels to keep.
+    Returns:
+      created savedmodel manager.
+    """
+    models = {
+        'model_1': _ModelForTest(12),
+        'model_2': _ModelForTest(14),
+    }
+    checkpoint = tf.train.Checkpoint()
+    manager = savedmodel_checkpoint_manager.SavedModelCheckpointManager(
+        checkpoint=checkpoint,
+        directory=self.get_temp_dir(),
+        max_to_keep=max_to_keep,
+        modules_to_export=models)
+    return manager
+  def test_max_to_keep(self):
+    manager = self._create_manager()
+    models = manager.modules_to_export
+    first_path = manager.save()
+    second_path = manager.save()
+    savedmodel = savedmodel_checkpoint_manager.make_saved_modules_directory_name(
+        manager.latest_checkpoint)
+    self.assertEqual(savedmodel, manager.latest_savedmodel)
+    self.assertTrue(_models_exist(second_path, models.keys()))
+    self.assertFalse(_models_exist(first_path, models.keys()))
+  def test_returns_none_after_timeout(self):
+    manager = self._create_manager()
+    start = time.time()
+    ret = manager.wait_for_new_savedmodel(
+        None, timeout=1.0, seconds_to_sleep=0.5)
+    end = time.time()
+    self.assertIsNone(ret)
+    # We've waited 0.5 second.
+    self.assertGreater(end, start + 0.5)
+    # The timeout kicked in.
+    self.assertLess(end, start + 0.6)
+  def test_saved_model_iterator(self):
+    manager = self._create_manager(max_to_keep=2)
+    self.assertIsNotNone(manager.save(checkpoint_number=1))
+    self.assertIsNotNone(manager.save(checkpoint_number=2))
+    self.assertIsNotNone(manager.save(checkpoint_number=3))
+    # Savedmodels are in time order.
+    expected_savedmodels = manager.savedmodels
+    # Order not guaranteed.
+    existing_savedmodels = manager.get_existing_savedmodels()
+    savedmodels = list(manager.savedmodels_iterator(timeout=3.0))
+    self.assertEqual(savedmodels, expected_savedmodels)
+    self.assertEqual(set(savedmodels), set(existing_savedmodels))
+  def test_saved_model_iterator_timeout_fn(self):
+    manager = self._create_manager()
+    timeout_fn_calls = [0]
+    def timeout_fn():
+      timeout_fn_calls[0] += 1
+      return timeout_fn_calls[0] > 3
+    results = list(
+        manager.savedmodels_iterator(timeout=0.1, timeout_fn=timeout_fn))
+    self.assertEqual([], results)
+    self.assertEqual(4, timeout_fn_calls[0])
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/task_factory.py ADDED Viewed

	@@ -0,0 +1,70 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A global factory to register and access all registered tasks."""
+from official.core import registry
+_REGISTERED_TASK_CLS = {}
+# TODO(b/158741360): Add type annotations once pytype checks across modules.
+def register_task_cls(task_config_cls):
+  """Decorates a factory of Tasks for lookup by a subclass of TaskConfig.
+  This decorator supports registration of tasks as follows:
+  ```
+  @dataclasses.dataclass
+  class MyTaskConfig(TaskConfig):
+    # Add fields here.
+    pass
+  @register_task_cls(MyTaskConfig)
+  class MyTask(Task):
+    # Inherits def __init__(self, task_config).
+    pass
+  my_task_config = MyTaskConfig()
+  my_task = get_task(my_task_config)  # Returns MyTask(my_task_config).
+  ```
+  Besisdes a class itself, other callables that create a Task from a TaskConfig
+  can be decorated by the result of this function, as long as there is at most
+  one registration for each config class.
+  Args:
+    task_config_cls: a subclass of TaskConfig (*not* an instance of TaskConfig).
+      Each task_config_cls can only be used for a single registration.
+  Returns:
+    A callable for use as class decorator that registers the decorated class
+    for creation from an instance of task_config_cls.
+  """
+  return registry.register(_REGISTERED_TASK_CLS, task_config_cls)
+def get_task(task_config, **kwargs):
+  """Creates a Task (of suitable subclass type) from task_config."""
+  # TODO(hongkuny): deprecate the task factory to use config.BUILDER.
+  if task_config.BUILDER is not None:
+    return task_config.BUILDER(task_config, **kwargs)
+  return get_task_cls(task_config.__class__)(task_config, **kwargs)
+# The user-visible get_task() is defined after classes have been registered.
+# TODO(b/158741360): Add type annotations once pytype checks across modules.
+def get_task_cls(task_config_cls):
+  task_cls = registry.lookup(_REGISTERED_TASK_CLS, task_config_cls)
+  return task_cls

modeling/official/core/test_utils.py ADDED Viewed

	@@ -0,0 +1,59 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Utils for testing."""
+import tensorflow as tf, tf_keras
+class FakeKerasModel(tf_keras.Model):
+  """Fake keras model for testing."""
+  def __init__(self):
+    super().__init__()
+    self.dense = tf_keras.layers.Dense(4, activation=tf.nn.relu)
+    self.dense2 = tf_keras.layers.Dense(4, activation=tf.nn.relu)
+  def call(self, inputs):  # pytype: disable=signature-mismatch  # overriding-parameter-count-checks
+    return self.dense2(self.dense(inputs))
+class _Dense(tf.Module):
+  """A dense layer."""
+  def __init__(self, input_dim, output_size, name=None):
+    super().__init__(name=name)
+    with self.name_scope:
+      self.w = tf.Variable(
+          tf.random.normal([input_dim, output_size]), name='w')
+      self.b = tf.Variable(tf.zeros([output_size]), name='b')
+  @tf.Module.with_name_scope
+  def __call__(self, x):
+    y = tf.matmul(x, self.w) + self.b
+    return tf.nn.relu(y)
+class FakeModule(tf.Module):
+  """Fake model using tf.Module for testing."""
+  def __init__(self, input_size, name=None):
+    super().__init__(name=name)
+    with self.name_scope:
+      self.dense = _Dense(input_size, 4, name='dense')
+      self.dense2 = _Dense(4, 4, name='dense_1')
+  @tf.Module.with_name_scope
+  def __call__(self, x):
+    return self.dense2(self.dense(x))

modeling/official/core/tf_example_builder.py ADDED Viewed

	@@ -0,0 +1,144 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Builder class for preparing tf.train.Example."""
+# https://www.python.org/dev/peps/pep-0563/#enabling-the-future-behavior-in-python-3-7
+from __future__ import annotations
+from typing import Mapping, Sequence, Union
+import numpy as np
+import tensorflow as tf, tf_keras
+BytesValueType = Union[bytes, Sequence[bytes], str, Sequence[str]]
+_to_array = lambda v: [v] if not isinstance(v, (list, np.ndarray)) else v
+_to_bytes = lambda v: v.encode() if isinstance(v, str) else v
+_to_bytes_array = lambda v: list(map(_to_bytes, _to_array(v)))
+class TfExampleBuilder(object):
+  """Builder class for preparing tf.train.Example.
+  Read API doc at https://www.tensorflow.org/api_docs/python/tf/train/Example.
+  Example usage:
+    >>> example_builder = TfExampleBuilder()
+    >>> example = (
+            example_builder.add_bytes_feature('feature_a', 'foobarbaz')
+            .add_ints_feature('feature_b', [1, 2, 3])
+            .example)
+  """
+  def __init__(self) -> None:
+    self._example = tf.train.Example()
+  @property
+  def example(self) -> tf.train.Example:
+    """Returns a copy of the generated tf.train.Example proto."""
+    return self._example
+  @property
+  def serialized_example(self) -> str:
+    """Returns a serialized string of the generated tf.train.Example proto."""
+    return self._example.SerializeToString()
+  def set(self, example: tf.train.Example) -> TfExampleBuilder:
+    """Sets the example."""
+    self._example = example
+    return self
+  def reset(self) -> TfExampleBuilder:
+    """Resets the example to an empty proto."""
+    self._example = tf.train.Example()
+    return self
+  ###### Basic APIs for primitive data types ######
+  def add_feature_dict(
+      self, feature_dict: Mapping[str, tf.train.Feature]) -> TfExampleBuilder:
+    """Adds the predefined `feature_dict` to the example.
+    Note: Please prefer to using feature-type-specific methods.
+    Args:
+      feature_dict: A dictionary from tf.Example feature key to
+        tf.train.Feature.
+    Returns:
+      The builder object for subsequent method calls.
+    """
+    for k, v in feature_dict.items():
+      self._example.features.feature[k].CopyFrom(v)
+    return self
+  def add_feature(self, key: str,
+                  feature: tf.train.Feature) -> TfExampleBuilder:
+    """Adds predefined `feature` with `key` to the example.
+    Args:
+      key: String key of the feature.
+      feature: The feature to be added to the example.
+    Returns:
+      The builder object for subsequent method calls.
+    """
+    self._example.features.feature[key].CopyFrom(feature)
+    return self
+  def add_bytes_feature(self, key: str,
+                        value: BytesValueType) -> TfExampleBuilder:
+    """Adds byte(s) or string(s) with `key` to the example.
+    Args:
+      key: String key of the feature.
+      value: The byte(s) or string(s) to be added to the example.
+    Returns:
+      The builder object for subsequent method calls.
+    """
+    return self.add_feature(
+        key,
+        tf.train.Feature(
+            bytes_list=tf.train.BytesList(value=_to_bytes_array(value))))
+  def add_ints_feature(self, key: str,
+                       value: Union[int, Sequence[int]]) -> TfExampleBuilder:
+    """Adds integer(s) with `key` to the example.
+    Args:
+      key: String key of the feature.
+      value: The integer(s) to be added to the example.
+    Returns:
+      The builder object for subsequent method calls.
+    """
+    return self.add_feature(
+        key,
+        tf.train.Feature(int64_list=tf.train.Int64List(value=_to_array(value))))
+  def add_floats_feature(
+      self, key: str, value: Union[float, Sequence[float]]) -> TfExampleBuilder:
+    """Adds float(s) with `key` to the example.
+    Args:
+      key: String key of the feature.
+      value: The float(s) to be added to the example.
+    Returns:
+      The builder object for subsequent method calls.
+    """
+    return self.add_feature(
+        key,
+        tf.train.Feature(float_list=tf.train.FloatList(value=_to_array(value))))

modeling/official/core/tf_example_builder_test.py ADDED Viewed

	@@ -0,0 +1,165 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for tf_example_builder.
+See `test_add_image_matrix_feature_with_fake_image` for the typical structure of
+a unit test.
+"""
+from absl.testing import parameterized
+import tensorflow as tf, tf_keras
+from official.core import tf_example_builder
+class TfExampleBuilderTest(tf.test.TestCase, parameterized.TestCase):
+  def test_init_an_empty_example(self):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example = example_builder.example
+    self.assertProtoEquals('', example)
+  def test_init_an_empty_serialized_example(self):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example = example_builder.serialized_example
+    self.assertProtoEquals('', example)
+  def test_add_feature(self):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_feature(
+        'foo',
+        tf.train.Feature(
+            bytes_list=tf.train.BytesList(value=[b'Hello World!'])))
+    example = example_builder.example
+    # Use proto text to show how the entire proto would look like.
+    self.assertProtoEquals(
+        """
+        features: {
+          feature: {
+            key: "foo"
+            value: {
+              bytes_list: {
+                value: "Hello World!"
+              }
+            }
+          }
+        }""", example)
+  def test_add_feature_dict(self):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_feature_dict({
+        'foo':
+            tf.train.Feature(
+                bytes_list=tf.train.BytesList(value=[b'Hello World!'])),
+        'bar':
+            tf.train.Feature(
+                int64_list=tf.train.Int64List(value=[299, 792, 458]))
+    })
+    example = example_builder.example
+    # Use proto text to show how the entire proto would look like.
+    self.assertProtoEquals(
+        """
+        features: {
+          feature: {
+            key: "foo"
+            value: {
+              bytes_list: {
+                value: "Hello World!"
+              }
+            }
+          }
+          feature: {
+            key: "bar"
+            value: {
+              int64_list: {
+                value: 299
+                value: 792
+                value: 458
+              }
+            }
+          }
+        }""", example)
+  @parameterized.named_parameters(
+      ('single_bytes', b'Hello World!', b'Hello World!'),
+      ('single_string', 'Hello World!', b'Hello World!'))
+  def test_add_single_byte_feature(self, value, expected_value):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_bytes_feature('foo', value)
+    example = example_builder.example
+    # Use constructor to easily work with test parameters.
+    self.assertProtoEquals(
+        tf.train.Example(
+            features=tf.train.Features(
+                feature={
+                    'foo':
+                        tf.train.Feature(
+                            bytes_list=tf.train.BytesList(
+                                value=[expected_value]))
+                })), example)
+  @parameterized.named_parameters(
+      ('multiple_bytes', [b'Hello World!', b'Good Morning!'
+                         ], [b'Hello World!', b'Good Morning!']),
+      ('multiple_sring', ['Hello World!', 'Good Morning!'
+                         ], [b'Hello World!', b'Good Morning!']))
+  def test_add_multiple_bytes_feature(self, values, expected_values):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_bytes_feature('foo', values)
+    example = example_builder.example
+    self.assertProtoEquals(
+        tf.train.Example(
+            features=tf.train.Features(
+                feature={
+                    'foo':
+                        tf.train.Feature(
+                            bytes_list=tf.train.BytesList(
+                                value=expected_values))
+                })), example)
+  @parameterized.named_parameters(
+      ('single_integer', 123, [123]),
+      ('multiple_integers', [123, 456, 789], [123, 456, 789]))
+  def test_add_ints_feature(self, value, expected_value):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_ints_feature('bar', value)
+    example = example_builder.example
+    self.assertProtoEquals(
+        tf.train.Example(
+            features=tf.train.Features(
+                feature={
+                    'bar':
+                        tf.train.Feature(
+                            int64_list=tf.train.Int64List(value=expected_value))
+                })), example)
+  @parameterized.named_parameters(
+      ('single_float', 3.14, [3.14]),
+      ('multiple_floats', [3.14, 1.57, 6.28], [3.14, 1.57, 6.28]))
+  def test_add_floats_feature(self, value, expected_value):
+    example_builder = tf_example_builder.TfExampleBuilder()
+    example_builder.add_floats_feature('baz', value)
+    example = example_builder.example
+    self.assertProtoEquals(
+        tf.train.Example(
+            features=tf.train.Features(
+                feature={
+                    'baz':
+                        tf.train.Feature(
+                            float_list=tf.train.FloatList(value=expected_value))
+                })), example)
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/tf_example_feature_key.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Data classes for tf.Example proto feature keys.
+Feature keys are grouped by feature types. Key names follow conventions in
+go/tf-example.
+"""
+import dataclasses
+import functools
+from typing import Optional
+# Disable init function to use the one defined in base class.
+dataclass = functools.partial(dataclasses.dataclass(init=False))
+@dataclass
+class TfExampleFeatureKeyBase:
+  """Base dataclass for defining tf.Example proto feature keys.
+  This class defines the logic of adding prefix to feature keys. Subclasses
+  will define feature keys for a specific feature type in data fields.
+  NOTE: Please follow subclass examples in this module to define feature keys
+  for a new feature type.
+  """
+  def __init__(self, prefix: Optional[str] = None):
+    """Instantiates the feature key class.
+    Adds a string prefix to all fields of a feature key instance if `prefix` is
+    not None nor empty.
+    Example usage:
+    >>> test_key = EncodedImageFeatureKey()
+    >>> test_key.encoded
+    image/encoded
+    >>> test_key = EncodedImageFeatureKey('prefix')
+    >>> test_key.encoded
+    prefix/image/encoded
+    Args:
+      prefix: A prefix string that will be added before the feature key string
+        with a trailing slash '/'.
+    """
+    if prefix:
+      for field in dataclasses.fields(self):  # pytype: disable=wrong-arg-types  # re-none
+        key_name = field.name
+        key_value = getattr(self, key_name)
+        setattr(self, key_name, f'{prefix}/{key_value}')

modeling/official/core/tf_example_feature_key_test.py ADDED Viewed

	@@ -0,0 +1,49 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for tf_example_feature_key."""
+import dataclasses
+import inspect
+from absl.testing import absltest
+from absl.testing import parameterized
+from official.core import tf_example_feature_key
+@tf_example_feature_key.dataclass
+class TestFeatureKey(tf_example_feature_key.TfExampleFeatureKeyBase):
+  test: str = 'foo/bar'
+class TfExampleFeatureKeyTest(parameterized.TestCase):
+  def test_add_prefix_success(self):
+    test_key = TestFeatureKey('prefix')
+    self.assertEqual(test_key.test, 'prefix/foo/bar')
+  @parameterized.parameters(None, '')
+  def test_add_prefix_skip_success(self, prefix):
+    test_key = TestFeatureKey(prefix)
+    self.assertEqual(test_key.test, 'foo/bar')
+  def test_all_feature_key_classes_are_valid(self):
+    for _, obj in inspect.getmembers(tf_example_feature_key):
+      if inspect.isclass(obj):
+        self.assertTrue(dataclasses.is_dataclass(obj))
+        self.assertTrue(
+            issubclass(obj, tf_example_feature_key.TfExampleFeatureKeyBase))
+if __name__ == '__main__':
+  absltest.main()

modeling/official/core/train_lib.py ADDED Viewed

	@@ -0,0 +1,372 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""TFM common training driver library."""
+# pytype: disable=attribute-error
+import os
+import tempfile
+from typing import Any, List, Mapping, Optional, Tuple
+# Import libraries
+from absl import logging
+import orbit
+import tensorflow as tf, tf_keras
+from official.core import actions
+from official.core import base_task
+from official.core import base_trainer
+from official.core import config_definitions
+from official.core import train_utils
+maybe_create_best_ckpt_exporter = train_utils.maybe_create_best_ckpt_exporter
+class OrbitExperimentRunner:
+  """Runs experiment with Orbit training loop.
+  The default experiment runner for model garden experiments. User can
+  customize the experiment pipeline by subclassing this class and replacing
+  components or functions.
+  For example, an experiment runner with customized checkpoint manager:
+  ```python
+  class MyExpRunnerWithExporter(OrbitExperimentRunner):
+    def _maybe_build_checkpoint_manager(sefl):
+      # Replaces the default CheckpointManger with a customized one.
+      return MyCheckpointManager(*args)
+  # In user code, instead of the orginal
+  # `OrbitExperimentRunner(..).run(mode)`, now user can do:
+  MyExpRunnerWithExporter(**needed_kwargs).run(mode)
+  ```
+  Similar override can be done to other components.
+  """
+  def __init__(
+      self,
+      distribution_strategy: tf.distribute.Strategy,
+      task: base_task.Task,
+      mode: str,
+      params: config_definitions.ExperimentConfig,
+      model_dir: str,
+      run_post_eval: bool = False,
+      save_summary: bool = True,
+      train_actions: Optional[List[orbit.Action]] = None,
+      eval_actions: Optional[List[orbit.Action]] = None,
+      trainer: Optional[base_trainer.Trainer] = None,
+      controller_cls=orbit.Controller,
+      summary_manager: Optional[orbit.utils.SummaryManager] = None,
+      eval_summary_manager: Optional[orbit.utils.SummaryManager] = None,
+      enable_async_checkpointing: bool = False,
+  ):
+    """Constructor.
+    Args:
+      distribution_strategy: A distribution strategy.
+      task: A Task instance.
+      mode: A 'str', specifying the mode. Can be 'train', 'eval',
+        'train_and_eval' or 'continuous_eval'.
+      params: ExperimentConfig instance.
+      model_dir: A 'str', a path to store model checkpoints and summaries.
+      run_post_eval: Whether to run post eval once after training, metrics logs
+        are returned.
+      save_summary: Whether to save train and validation summary.
+      train_actions: Optional list of Orbit train actions.
+      eval_actions: Optional list of Orbit eval actions.
+      trainer: the base_trainer.Trainer instance. It should be created within
+        the strategy.scope().
+      controller_cls: The controller class to manage the train and eval process.
+        Must be a orbit.Controller subclass.
+      summary_manager: Instance of the summary manager to override default
+        summary manager.
+      eval_summary_manager: Instance of the eval summary manager to override
+        default eval summary manager.
+      enable_async_checkpointing: Optional boolean indicating whether to enable
+        async checkpoint saving.
+    """
+    self.strategy = distribution_strategy or tf.distribute.get_strategy()
+    self._params = params
+    self._model_dir = model_dir
+    self._mode = mode
+    self._run_post_eval = run_post_eval
+    self._trainer = trainer or self._build_trainer(
+        task,
+        train='train' in mode,
+        evaluate=('eval' in mode) or run_post_eval)
+    assert self.trainer is not None
+    self._checkpoint_manager = self._maybe_build_checkpoint_manager()
+    self._summary_manager = summary_manager
+    self._eval_summary_manager = eval_summary_manager
+    self._controller = self._build_controller(
+        trainer=self.trainer if 'train' in mode else None,
+        evaluator=self.trainer,
+        save_summary=save_summary,
+        train_actions=train_actions,
+        eval_actions=eval_actions,
+        controller_cls=controller_cls,
+        enable_async_checkpointing=enable_async_checkpointing)
+  @property
+  def params(self) -> config_definitions.ExperimentConfig:
+    """The whole experiment parameters object."""
+    return self._params
+  @property
+  def model_dir(self) -> str:
+    """Path to the model folder, which stores checkpoints, params, log, etc."""
+    return self._model_dir
+  @property
+  def trainer(self) -> base_trainer.Trainer:
+    """The underlying Orbit Trainer object."""
+    return self._trainer
+  @property
+  def checkpoint_manager(self) -> Optional[tf.train.CheckpointManager]:
+    """The CheckpointManager that stores the checkpoints in a train job."""
+    return self._checkpoint_manager
+  @property
+  def controller(self) -> orbit.Controller:
+    """The Orbit controller object."""
+    return self._controller
+  def _build_trainer(self, task: base_task.Task, train: bool,
+                     evaluate: bool) -> base_trainer.Trainer:
+    """Create trainer."""
+    with self.strategy.scope():
+      trainer = train_utils.create_trainer(
+          self.params,
+          task,
+          train=train,
+          evaluate=evaluate,
+          checkpoint_exporter=self._build_best_checkpoint_exporter())
+    return trainer
+  def _build_best_checkpoint_exporter(self):
+    return maybe_create_best_ckpt_exporter(self.params, self.model_dir)
+  def _maybe_build_checkpoint_manager(
+      self) -> Optional[tf.train.CheckpointManager]:
+    """Maybe create a CheckpointManager."""
+    assert self.trainer is not None
+    if self.trainer.checkpoint:
+      if self.model_dir is None:
+        raise ValueError('model_dir must be specified, but got None')
+      if (not self.strategy) or self.strategy.extended.should_checkpoint:
+        ckpt_path = self.model_dir
+        max_to_keep = self.params.trainer.max_to_keep
+      else:
+        # In multi worker training we need every worker to save checkpoint,
+        # because variables can trigger synchronization on read and
+        # synchronization needs all workers to participate. To avoid workers
+        # overriding each other we save to a temporary directory on non-chief
+        # workers.
+        ckpt_path = tempfile.mkdtemp()
+        max_to_keep = 1
+      checkpoint_manager = tf.train.CheckpointManager(
+          self.trainer.checkpoint,
+          directory=ckpt_path,
+          max_to_keep=max_to_keep,
+          step_counter=self.trainer.global_step,
+          checkpoint_interval=self.params.trainer.checkpoint_interval,
+          init_fn=self.trainer.initialize)
+    else:
+      checkpoint_manager = None
+    return checkpoint_manager
+  def _build_controller(
+      self,
+      trainer,
+      evaluator,
+      save_summary: bool = True,
+      train_actions: Optional[List[orbit.Action]] = None,
+      eval_actions: Optional[List[orbit.Action]] = None,
+      controller_cls=orbit.Controller,
+      enable_async_checkpointing: bool = False,
+  ) -> orbit.Controller:
+    """Builds a Orbit controler."""
+    train_actions = [] if not train_actions else train_actions
+    if trainer:
+      checkpoint_manager = self.checkpoint_manager
+      assert checkpoint_manager, 'Checkpoint manager required but undefined.'
+      train_actions += actions.get_train_actions(
+          self.params,
+          trainer,
+          self.model_dir,
+          checkpoint_manager=checkpoint_manager,
+      )
+    eval_actions = [] if not eval_actions else eval_actions
+    if evaluator:
+      eval_actions += actions.get_eval_actions(self.params, evaluator,
+                                               self.model_dir)
+    if save_summary:
+      eval_summary_dir = os.path.join(
+          self.model_dir, self.params.trainer.validation_summary_subdir
+      )
+    else:
+      eval_summary_dir = None
+    controller = controller_cls(
+        strategy=self.strategy,
+        trainer=trainer,
+        evaluator=evaluator,
+        global_step=self.trainer.global_step,
+        steps_per_loop=self.params.trainer.steps_per_loop,
+        checkpoint_manager=self.checkpoint_manager,
+        enable_async_checkpointing=enable_async_checkpointing,
+        summary_dir=os.path.join(self.model_dir, 'train')
+        if (save_summary)
+        else None,
+        eval_summary_dir=eval_summary_dir,
+        summary_interval=self.params.trainer.summary_interval
+        if (save_summary)
+        else None,
+        train_actions=train_actions,
+        eval_actions=eval_actions,
+        summary_manager=self._summary_manager
+        if hasattr(self, '_summary_manager')
+        else None,
+        eval_summary_manager=self._eval_summary_manager
+        if hasattr(self, '_eval_summary_manager')
+        else None,
+    )
+    return controller
+  def run(self) -> Tuple[tf_keras.Model, Mapping[str, Any]]:
+    """Run experiments by mode.
+    Returns:
+      A 2-tuple of (model, eval_logs).
+        model: `tf_keras.Model` instance.
+        eval_logs: returns eval metrics logs when run_post_eval is set to True,
+          otherwise, returns {}.
+    """
+    mode = self._mode
+    params = self.params
+    logging.info('Starts to execute mode: %s', mode)
+    with self.strategy.scope():
+      if mode == 'train' or mode == 'train_and_post_eval':
+        self.controller.train(steps=params.trainer.train_steps)
+      elif mode == 'train_and_eval':
+        self.controller.train_and_evaluate(
+            train_steps=params.trainer.train_steps,
+            eval_steps=params.trainer.validation_steps,
+            eval_interval=params.trainer.validation_interval)
+      elif mode == 'eval':
+        self.controller.evaluate(steps=params.trainer.validation_steps)
+      elif mode == 'continuous_eval':
+        def timeout_fn():
+          if self.trainer.global_step.numpy() >= params.trainer.train_steps:
+            return True
+          return False
+        self.controller.evaluate_continuously(
+            steps=params.trainer.validation_steps,
+            timeout=params.trainer.continuous_eval_timeout,
+            timeout_fn=timeout_fn)
+      else:
+        raise NotImplementedError('The mode is not implemented: %s' % mode)
+    num_params = train_utils.try_count_params(self.trainer.model)
+    if num_params is not None:
+      logging.info('Number of trainable params in model: %f Millions.',
+                   num_params / 10.**6)
+    flops = train_utils.try_count_flops(self.trainer.model)
+    if flops is not None:
+      logging.info('FLOPs (multi-adds) in model: %f Billions.',
+                   flops / 10.**9 / 2)
+    if self._run_post_eval or mode == 'train_and_post_eval':
+      with self.strategy.scope():
+        return self.trainer.model, self.controller.evaluate(
+            steps=params.trainer.validation_steps)
+    else:
+      return self.trainer.model, {}
+def run_experiment(
+    distribution_strategy: tf.distribute.Strategy,
+    task: base_task.Task,
+    mode: str,
+    params: config_definitions.ExperimentConfig,
+    model_dir: str,
+    run_post_eval: bool = False,
+    save_summary: bool = True,
+    train_actions: Optional[List[orbit.Action]] = None,
+    eval_actions: Optional[List[orbit.Action]] = None,
+    trainer: Optional[base_trainer.Trainer] = None,
+    controller_cls=orbit.Controller,
+    summary_manager: Optional[orbit.utils.SummaryManager] = None,
+    eval_summary_manager: Optional[orbit.utils.SummaryManager] = None,
+    enable_async_checkpointing: bool = False,
+) -> Tuple[tf_keras.Model, Mapping[str, Any]]:
+  """Runs train/eval configured by the experiment params.
+  Args:
+    distribution_strategy: A distribution distribution_strategy.
+    task: A Task instance.
+    mode: A 'str', specifying the mode. Can be 'train', 'eval', 'train_and_eval'
+      or 'continuous_eval'.
+    params: ExperimentConfig instance.
+    model_dir: A 'str', a path to store model checkpoints and summaries.
+    run_post_eval: Whether to run post eval once after training, metrics logs
+      are returned.
+    save_summary: Whether to save train and validation summary.
+    train_actions: Optional list of Orbit train actions.
+    eval_actions: Optional list of Orbit eval actions.
+    trainer: the base_trainer.Trainer instance. It should be created within the
+      strategy.scope().
+    controller_cls: The controller class to manage the train and eval process.
+      Must be a orbit.Controller subclass.
+    summary_manager: Instance of the summary manager to override default summary
+      manager.
+    eval_summary_manager: Instance of the eval summary manager to override
+      default eval summary manager.
+    enable_async_checkpointing: Optional boolean indicating whether to enable
+        async checkpoint saving.
+  Returns:
+    A 2-tuple of (model, eval_logs).
+      model: `tf_keras.Model` instance.
+      eval_logs: returns eval metrics logs when run_post_eval is set to True,
+        otherwise, returns {}.
+  """
+  runner = OrbitExperimentRunner(
+      distribution_strategy=distribution_strategy,
+      task=task,
+      mode=mode,
+      params=params,
+      model_dir=model_dir,
+      run_post_eval=run_post_eval,
+      save_summary=save_summary,
+      train_actions=train_actions,
+      eval_actions=eval_actions,
+      trainer=trainer,
+      controller_cls=controller_cls,
+      summary_manager=summary_manager,
+      eval_summary_manager=eval_summary_manager,
+      enable_async_checkpointing=enable_async_checkpointing,
+  )
+  return runner.run()

modeling/official/core/train_lib_test.py ADDED Viewed

	@@ -0,0 +1,280 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for train_ctl_lib."""
+import json
+import os
+from absl import flags
+from absl.testing import flagsaver
+from absl.testing import parameterized
+import numpy as np
+import tensorflow as tf, tf_keras
+from tensorflow.python.distribute import combinations
+from tensorflow.python.distribute import strategy_combinations
+from official.common import flags as tfm_flags
+# pylint: disable=unused-import
+from official.common import registry_imports
+# pylint: enable=unused-import
+from official.core import task_factory
+from official.core import train_lib
+from official.core import train_utils
+from official.utils.testing import mock_task
+FLAGS = flags.FLAGS
+tfm_flags.define_flags()
+class TrainTest(tf.test.TestCase, parameterized.TestCase):
+  def setUp(self):
+    super(TrainTest, self).setUp()
+    self._test_config = {
+        'trainer': {
+            'checkpoint_interval': 10,
+            'steps_per_loop': 10,
+            'summary_interval': 10,
+            'train_steps': 10,
+            'validation_steps': 5,
+            'validation_interval': 10,
+            'continuous_eval_timeout': 1,
+            'validation_summary_subdir': 'validation',
+            'optimizer_config': {
+                'optimizer': {
+                    'type': 'sgd',
+                },
+                'learning_rate': {
+                    'type': 'constant'
+                }
+            }
+        },
+    }
+  @combinations.generate(
+      combinations.combine(
+          distribution_strategy=[
+              strategy_combinations.default_strategy,
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy_gpu,
+          ],
+          flag_mode=['train', 'eval', 'train_and_eval'],
+          run_post_eval=[True, False]))
+  def test_end_to_end(self, distribution_strategy, flag_mode, run_post_eval):
+    model_dir = self.get_temp_dir()
+    flags_dict = dict(
+        experiment='mock',
+        mode=flag_mode,
+        model_dir=model_dir,
+        params_override=json.dumps(self._test_config))
+    with flagsaver.flagsaver(**flags_dict):
+      params = train_utils.parse_configuration(flags.FLAGS)
+      train_utils.serialize_config(params, model_dir)
+      with distribution_strategy.scope():
+        task = task_factory.get_task(params.task, logging_dir=model_dir)
+      _, logs = train_lib.run_experiment(
+          distribution_strategy=distribution_strategy,
+          task=task,
+          mode=flag_mode,
+          params=params,
+          model_dir=model_dir,
+          run_post_eval=run_post_eval)
+    if 'eval' in flag_mode:
+      self.assertTrue(
+          tf.io.gfile.exists(
+              os.path.join(model_dir,
+                           params.trainer.validation_summary_subdir)))
+    if run_post_eval:
+      self.assertNotEmpty(logs)
+    else:
+      self.assertEmpty(logs)
+    self.assertNotEmpty(
+        tf.io.gfile.glob(os.path.join(model_dir, 'params.yaml')))
+    if flag_mode == 'eval':
+      return
+    self.assertNotEmpty(
+        tf.io.gfile.glob(os.path.join(model_dir, 'checkpoint')))
+    # Tests continuous evaluation.
+    _, logs = train_lib.run_experiment(
+        distribution_strategy=distribution_strategy,
+        task=task,
+        mode='continuous_eval',
+        params=params,
+        model_dir=model_dir,
+        run_post_eval=run_post_eval)
+  @combinations.generate(
+      combinations.combine(
+          distribution_strategy=[
+              strategy_combinations.default_strategy,
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy_gpu,
+          ],
+          flag_mode=['train', 'eval', 'train_and_eval'],
+          run_post_eval=[True, False]))
+  def test_end_to_end_class(self, distribution_strategy, flag_mode,
+                            run_post_eval):
+    model_dir = self.get_temp_dir()
+    flags_dict = dict(
+        experiment='mock',
+        mode=flag_mode,
+        model_dir=model_dir,
+        params_override=json.dumps(self._test_config))
+    with flagsaver.flagsaver(**flags_dict):
+      params = train_utils.parse_configuration(flags.FLAGS)
+      train_utils.serialize_config(params, model_dir)
+      with distribution_strategy.scope():
+        task = task_factory.get_task(params.task, logging_dir=model_dir)
+      _, logs = train_lib.OrbitExperimentRunner(
+          distribution_strategy=distribution_strategy,
+          task=task,
+          mode=flag_mode,
+          params=params,
+          model_dir=model_dir,
+          run_post_eval=run_post_eval).run()
+    if 'eval' in flag_mode:
+      self.assertTrue(
+          tf.io.gfile.exists(
+              os.path.join(model_dir,
+                           params.trainer.validation_summary_subdir)))
+    if run_post_eval:
+      self.assertNotEmpty(logs)
+    else:
+      self.assertEmpty(logs)
+    self.assertNotEmpty(
+        tf.io.gfile.glob(os.path.join(model_dir, 'params.yaml')))
+    if flag_mode == 'eval':
+      return
+    self.assertNotEmpty(
+        tf.io.gfile.glob(os.path.join(model_dir, 'checkpoint')))
+    # Tests continuous evaluation.
+    _, logs = train_lib.OrbitExperimentRunner(
+        distribution_strategy=distribution_strategy,
+        task=task,
+        mode='continuous_eval',
+        params=params,
+        model_dir=model_dir,
+        run_post_eval=run_post_eval).run()
+  @combinations.generate(
+      combinations.combine(
+          distribution_strategy=[
+              strategy_combinations.default_strategy,
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy_gpu,
+          ],
+          flag_mode=['train', 'train_and_eval'],
+      ))
+  def test_recovery_nan_error(self, distribution_strategy, flag_mode):
+    model_dir = self.get_temp_dir()
+    flags_dict = dict(
+        experiment='mock',
+        mode=flag_mode,
+        model_dir=model_dir,
+        params_override=json.dumps(self._test_config))
+    with flagsaver.flagsaver(**flags_dict):
+      params = train_utils.parse_configuration(flags.FLAGS)
+      train_utils.serialize_config(params, model_dir)
+      with distribution_strategy.scope():
+        # task = task_factory.get_task(params.task, logging_dir=model_dir)
+        task = mock_task.MockTask(params.task, logging_dir=model_dir)
+        # Set the loss to NaN to trigger RunTimeError.
+        def build_losses(labels, model_outputs, aux_losses=None):
+          del labels, model_outputs
+          return tf.constant([np.nan], tf.float32) + aux_losses
+        task.build_losses = build_losses
+      with self.assertRaises(RuntimeError):
+        train_lib.OrbitExperimentRunner(
+            distribution_strategy=distribution_strategy,
+            task=task,
+            mode=flag_mode,
+            params=params,
+            model_dir=model_dir).run()
+  @combinations.generate(
+      combinations.combine(
+          distribution_strategy=[
+              strategy_combinations.default_strategy,
+              strategy_combinations.cloud_tpu_strategy,
+              strategy_combinations.one_device_strategy_gpu,
+          ],
+          flag_mode=['train'],
+      ))
+  def test_recovery(self, distribution_strategy, flag_mode):
+    loss_threshold = 1.0
+    model_dir = self.get_temp_dir()
+    flags_dict = dict(
+        experiment='mock',
+        mode=flag_mode,
+        model_dir=model_dir,
+        params_override=json.dumps(self._test_config))
+    with flagsaver.flagsaver(**flags_dict):
+      params = train_utils.parse_configuration(flags.FLAGS)
+      params.trainer.loss_upper_bound = loss_threshold
+      params.trainer.recovery_max_trials = 1
+      train_utils.serialize_config(params, model_dir)
+      with distribution_strategy.scope():
+        task = task_factory.get_task(params.task, logging_dir=model_dir)
+      # Saves a checkpoint for reference.
+      model = task.build_model()
+      checkpoint = tf.train.Checkpoint(model=model)
+      checkpoint_manager = tf.train.CheckpointManager(
+          checkpoint, self.get_temp_dir(), max_to_keep=2)
+      checkpoint_manager.save()
+      before_weights = model.get_weights()
+      def build_losses(labels, model_outputs, aux_losses=None):
+        del labels, model_outputs
+        return tf.constant([loss_threshold], tf.float32) + aux_losses
+      task.build_losses = build_losses
+      model, _ = train_lib.OrbitExperimentRunner(
+          distribution_strategy=distribution_strategy,
+          task=task,
+          mode=flag_mode,
+          params=params,
+          model_dir=model_dir).run()
+      after_weights = model.get_weights()
+      for left, right in zip(before_weights, after_weights):
+        self.assertAllEqual(left, right)
+  def test_parse_configuration(self):
+    model_dir = self.get_temp_dir()
+    flags_dict = dict(
+        experiment='mock',
+        mode='train',
+        model_dir=model_dir,
+        params_override=json.dumps(self._test_config))
+    with flagsaver.flagsaver(**flags_dict):
+      params = train_utils.parse_configuration(flags.FLAGS, lock_return=True)
+      with self.assertRaises(ValueError):
+        params.override({'task': {'init_checkpoint': 'Foo'}})
+      params = train_utils.parse_configuration(flags.FLAGS, lock_return=False)
+      params.override({'task': {'init_checkpoint': 'Bar'}})
+      self.assertEqual(params.task.init_checkpoint, 'Bar')
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/core/train_utils.py ADDED Viewed

	@@ -0,0 +1,610 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Training utils."""
+import dataclasses
+import inspect
+import json
+import os
+import pprint
+from typing import Any, Callable, Dict, List, Optional, Union
+from absl import logging
+import gin
+import numpy as np
+import orbit
+import tensorflow as tf, tf_keras
+# pylint: disable=g-direct-tensorflow-import
+from tensorflow.python.framework import ops
+from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2_as_graph
+# pylint: enable=g-direct-tensorflow-import
+from official.core import base_task
+from official.core import base_trainer
+from official.core import config_definitions
+from official.core import exp_factory
+from official.modeling import hyperparams
+BEST_CHECKPOINT_NAME = 'best_ckpt'
+def get_leaf_nested_dict(d: Dict[str, Any], keys: List[str]) -> Dict[str, Any]:
+  """Get leaf from a dictionary with arbitrary depth with a list of keys.
+  Args:
+    d: The dictionary to extract value from.
+    keys: The list of keys to extract values recursively.
+  Returns:
+    The value of the leaf.
+  Raises:
+    KeyError: If the value of keys extracted is a dictionary.
+  """
+  leaf = d
+  for k in keys:
+    if not isinstance(leaf, dict) or k not in leaf:
+      raise KeyError(
+          'Path not exist while traversing the dictionary: d with keys'
+          ': %s.' % keys)
+    leaf = leaf[k]
+  if isinstance(leaf, dict):
+    raise KeyError('The value extracted with keys: %s is not a leaf of the '
+                   'dictionary: %s.' % (keys, d))
+  return leaf
+def cast_leaf_nested_dict(d: Dict[str, Any],
+                          cast_fn: Callable[[Any], Any]) -> Dict[str, Any]:
+  """Cast the leaves of a dictionary with arbitrary depth in place.
+  Args:
+    d: The dictionary to extract value from.
+    cast_fn: The casting function.
+  Returns:
+    A dictionray with the same structure as d.
+  """
+  for key, value in d.items():
+    if isinstance(value, dict):
+      d[key] = cast_leaf_nested_dict(value, cast_fn)
+    else:
+      d[key] = cast_fn(value)
+  return d
+def _filter_leaf_nested_dict(
+    d: Dict[str, Any], predicate: Callable[[Any], bool]
+) -> Dict[str, Any]:
+  """Filters the leaves of a dictionary with arbitrary depth in place.
+  Args:
+    d: The dictionary to extract value from.
+    predicate: A function that will be called on every leave item. When the
+      function returns True the leave will be kept. Otherwise the leave will be
+      dropped.
+  Returns:
+    A new dictionray with filtered result.
+  """
+  result = {}
+  for key, value in d.items():
+    if isinstance(value, dict):
+      result[key] = _filter_leaf_nested_dict(value, predicate)
+    elif predicate(value):
+      result[key] = value
+  return result
+def maybe_create_best_ckpt_exporter(params: config_definitions.ExperimentConfig,
+                                    data_dir: str) -> Any:
+  """Maybe create a BestCheckpointExporter object, according to the config."""
+  export_subdir = params.trainer.best_checkpoint_export_subdir
+  metric_name = params.trainer.best_checkpoint_eval_metric
+  metric_comp = params.trainer.best_checkpoint_metric_comp
+  if data_dir and export_subdir and metric_name:
+    best_ckpt_dir = os.path.join(data_dir, export_subdir)
+    best_ckpt_exporter = BestCheckpointExporter(best_ckpt_dir, metric_name,
+                                                metric_comp)
+    logging.info(
+        'Created the best checkpoint exporter. '
+        'data_dir: %s, export_subdir: %s, metric_name: %s', data_dir,
+        export_subdir, metric_name)
+  else:
+    best_ckpt_exporter = None
+  return best_ckpt_exporter
+class BestCheckpointExporter:
+  """Keeps track of the best result, and saves its checkpoint.
+  Orbit will support an API for checkpoint exporter. This class will be used
+  together with orbit once this functionality is ready.
+  """
+  def __init__(self, export_dir: str, metric_name: str, metric_comp: str):
+    """Initialization.
+    Args:
+      export_dir: The directory that will contain exported checkpoints.
+      metric_name: Indicates which metric to look at, when determining which
+        result is better. If eval_logs being passed to maybe_export_checkpoint
+        is a nested dictionary, use `|` as a seperator for different layers.
+      metric_comp: Indicates how to compare results. Either `lower` or `higher`.
+    """
+    self._export_dir = export_dir
+    self._metric_name = metric_name.split('|')
+    self._metric_comp = metric_comp
+    if self._metric_comp not in ('lower', 'higher'):
+      raise ValueError('best checkpoint metric comp must be one of '
+                       'higher, lower. Got: {}'.format(self._metric_comp))
+    tf.io.gfile.makedirs(os.path.dirname(self.best_ckpt_logs_path))
+    self._best_ckpt_logs = self._maybe_load_best_eval_metric()
+    self._checkpoint_manager = None
+  def _get_checkpoint_manager(self, checkpoint):
+    """Gets an existing checkpoint manager or creates a new one."""
+    if self._checkpoint_manager is None or (self._checkpoint_manager.checkpoint
+                                            != checkpoint):
+      logging.info('Creates a new checkpoint manager.')
+      self._checkpoint_manager = tf.train.CheckpointManager(
+          checkpoint,
+          directory=self._export_dir,
+          max_to_keep=1,
+          checkpoint_name=BEST_CHECKPOINT_NAME)
+    return self._checkpoint_manager
+  def maybe_export_checkpoint(
+      self, checkpoint, eval_logs, global_step, write_logs=True) -> bool:
+    """Compare eval_logs with past eval_logs and export checkpoint if better."""
+    logging.info('[BestCheckpointExporter] received eval_logs: %s, at step: %d',
+                 eval_logs, global_step)
+    if self._best_ckpt_logs is None or self._new_metric_is_better(
+        self._best_ckpt_logs, eval_logs):
+      self._best_ckpt_logs = eval_logs
+      if write_logs:
+        self.export_best_eval_metric(self._best_ckpt_logs, global_step)
+      self._get_checkpoint_manager(checkpoint).save()
+      return True
+    return False
+  def _maybe_load_best_eval_metric(self):
+    if not tf.io.gfile.exists(self.best_ckpt_logs_path):
+      return None
+    with tf.io.gfile.GFile(self.best_ckpt_logs_path, 'r') as reader:
+      return json.loads(reader.read())
+  def _new_metric_is_better(self, old_logs, new_logs):
+    """Check if the metric in new_logs is better than the metric in old_logs."""
+    old_value = float(
+        orbit.utils.get_value(
+            get_leaf_nested_dict(old_logs, self._metric_name)))
+    new_value = float(
+        orbit.utils.get_value(
+            get_leaf_nested_dict(new_logs, self._metric_name)))
+    logging.info('[BestCheckpointExporter] comparing results. old: %f, new: %f',
+                 old_value, new_value)
+    if self._metric_comp == 'higher':
+      if new_value > old_value:
+        logging.info('[BestCheckpointExporter] '
+                     'the new number is better since it is higher.')
+        return True
+    else:  # self._metric_comp == 'lower':
+      if new_value < old_value:
+        logging.info('[BestCheckpointExporter] '
+                     'the new number is better since it is lower.')
+        return True
+    return False
+  def export_best_eval_metric(self, eval_logs, global_step):
+    """Export evaluation results of the best checkpoint into a json file."""
+    # eval_log_ext may contains non-scalar tensors, such as image data when
+    # `allow_image_summary` is True. Here we only keep scalar tensors.
+    eval_logs_ext = _filter_leaf_nested_dict(
+        eval_logs, lambda x: tf.rank(x) <= 1
+    )
+    eval_logs_ext['best_ckpt_global_step'] = global_step
+    eval_logs_ext = cast_leaf_nested_dict(
+        eval_logs_ext, lambda x: float(orbit.utils.get_value(x)))
+    # Saving json file is very fast.
+    with tf.io.gfile.GFile(self.best_ckpt_logs_path, 'w') as writer:
+      writer.write(json.dumps(eval_logs_ext, indent=4) + '\n')
+  @property
+  def best_ckpt_logs(self):
+    return self._best_ckpt_logs
+  @property
+  def best_ckpt_logs_path(self):
+    return os.path.join(self._export_dir, 'info.json')
+  @property
+  def best_ckpt_path(self):
+    """Returns the best ckpt path or None if there is no ckpt yet."""
+    return tf.train.latest_checkpoint(self._export_dir)
+def create_optimizer(task: base_task.Task,
+                     params: config_definitions.ExperimentConfig
+                     ) -> tf_keras.optimizers.Optimizer:
+  """A create optimizer util to be backward compatability with new args."""
+  if 'dp_config' in inspect.signature(task.create_optimizer).parameters:
+    dp_config = None
+    if hasattr(params.task, 'differential_privacy_config'):
+      dp_config = params.task.differential_privacy_config
+    optimizer = task.create_optimizer(
+        params.trainer.optimizer_config, params.runtime,
+        dp_config=dp_config)
+  else:
+    if hasattr(params.task, 'differential_privacy_config'
+              ) and params.task.differential_privacy_config is not None:
+      raise ValueError('Differential privacy config is specified but '
+                       'task.create_optimizer api does not accept it.')
+    optimizer = task.create_optimizer(
+        params.trainer.optimizer_config,
+        params.runtime)
+  return optimizer
+@gin.configurable
+def create_trainer(params: config_definitions.ExperimentConfig,
+                   task: base_task.Task,
+                   train: bool,
+                   evaluate: bool,
+                   checkpoint_exporter: Optional[BestCheckpointExporter] = None,
+                   trainer_cls=base_trainer.Trainer) -> base_trainer.Trainer:
+  """Create trainer."""
+  logging.info('Running default trainer.')
+  model = task.build_model()
+  optimizer = create_optimizer(task, params)
+  return trainer_cls(
+      params,
+      task,
+      model=model,
+      optimizer=optimizer,
+      train=train,
+      evaluate=evaluate,
+      checkpoint_exporter=checkpoint_exporter)
+@dataclasses.dataclass
+class ParseConfigOptions:
+  """Use this dataclass instead of FLAGS to customize parse_configuration()."""
+  experiment: str
+  config_file: List[str]
+  tpu: str = ''
+  tf_data_service: str = ''
+  params_override: str = ''
+  def __contains__(self, name):
+    return name in dataclasses.asdict(self)
+class ExperimentParser:
+  """Constructs the Experiment config from Flags or equivalent object.
+  Most of the cases, users only need to call the `parse()` function:
+  ```
+  builder = ExperimentParser(FLAGS)
+  params = builder.parse()
+  ```
+  The advanced users can modify the flow by calling the parse_*() functions
+  separately.
+  """
+  def __init__(self, flags_obj):
+    self._flags_obj = flags_obj
+  def parse(self):
+    """Overrall process of constructing Experiment config."""
+    params = self.base_experiment()
+    params = self.parse_config_file(params)
+    params = self.parse_runtime(params)
+    params = self.parse_data_service(params)
+    params = self.parse_params_override(params)
+    return params
+  def base_experiment(self):
+    """Get the base experiment config from --experiment field."""
+    if self._flags_obj.experiment is None:
+      raise ValueError('The flag --experiment must be specified.')
+    return exp_factory.get_exp_config(self._flags_obj.experiment)
+  def parse_config_file(self, params):
+    """Override the configs of params from the config_file."""
+    for config_file in self._flags_obj.config_file or []:
+      params = hyperparams.override_params_dict(
+          params, config_file, is_strict=True)
+    return params
+  def parse_runtime(self, params):
+    """Override the runtime configs of params from flags."""
+    # Override the TPU address and tf.data service address.
+    params.override({
+        'runtime': {
+            'tpu': self._flags_obj.tpu,
+        },
+    })
+    return params
+  def parse_data_service(self, params):
+    """Override the data service configs of params from flags."""
+    if ('tf_data_service' in self._flags_obj and
+        self._flags_obj.tf_data_service and
+        isinstance(params.task, config_definitions.TaskConfig)):
+      params.override({
+          'task': {
+              'train_data': {
+                  'tf_data_service_address': self._flags_obj.tf_data_service,
+              },
+              'validation_data': {
+                  'tf_data_service_address': self._flags_obj.tf_data_service,
+              }
+          }
+      })
+    return params
+  def parse_params_override(self, params):
+    # Get the second level of override from `--params_override`.
+    # `--params_override` is typically used as a further override over the
+    # template. For example, one may define a particular template for training
+    # ResNet50 on ImageNet in a config file and pass it via `--config_file`,
+    # then define different learning rates and pass it via `--params_override`.
+    if self._flags_obj.params_override:
+      params = hyperparams.override_params_dict(
+          params, self._flags_obj.params_override, is_strict=True)
+    return params
+def parse_configuration(flags_obj, lock_return=True, print_return=True):
+  """Parses ExperimentConfig from flags."""
+  params = ExperimentParser(flags_obj).parse()
+  params.validate()
+  if lock_return:
+    params.lock()
+  if print_return:
+    pp = pprint.PrettyPrinter()
+    logging.info('Final experiment parameters:\n%s',
+                 pp.pformat(params.as_dict()))
+  return params
+def serialize_config(params: config_definitions.ExperimentConfig,
+                     model_dir: str):
+  """Serializes and saves the experiment config."""
+  if model_dir is None:
+    raise ValueError('model_dir must be specified, but got None')
+  params_save_path = os.path.join(model_dir, 'params.yaml')
+  logging.info('Saving experiment configuration to %s', params_save_path)
+  tf.io.gfile.makedirs(model_dir)
+  hyperparams.save_params_dict_to_yaml(params, params_save_path)
+def save_gin_config(filename_suffix: str, model_dir: str):
+  """Serializes and saves the experiment config."""
+  gin_save_path = os.path.join(
+      model_dir, 'operative_config.{}.gin'.format(filename_suffix))
+  logging.info('Saving gin configurations to %s', gin_save_path)
+  tf.io.gfile.makedirs(model_dir)
+  with tf.io.gfile.GFile(gin_save_path, 'w') as f:
+    f.write(gin.operative_config_str())
+def read_global_step_from_checkpoint(ckpt_file_path):
+  """Read global step from checkpoint, or get global step from its filename."""
+  global_step = tf.Variable(-1, dtype=tf.int64)
+  ckpt = tf.train.Checkpoint(global_step=global_step)
+  try:
+    ckpt.restore(ckpt_file_path).expect_partial()
+    global_step_maybe_restored = global_step.numpy()
+  except tf.errors.InvalidArgumentError:
+    global_step_maybe_restored = -1
+  if global_step_maybe_restored == -1:
+    raise ValueError('global_step not found in checkpoint {}. '
+                     'If you want to run finetune eval jobs, you need to '
+                     'make sure that your pretrain model writes '
+                     'global_step in its checkpoints.'.format(ckpt_file_path))
+  global_step_restored = global_step.numpy()
+  logging.info('get global_step %d from checkpoint %s', global_step_restored,
+               ckpt_file_path)
+  return global_step_restored
+def write_json_summary(log_dir, global_step, eval_metrics):
+  """Dump evaluation metrics to json file."""
+  serializable_dict = {}
+  for name, value in eval_metrics.items():
+    if hasattr(value, 'numpy'):
+      serializable_dict[name] = str(value.numpy())
+    else:
+      serializable_dict[name] = str(value)
+  output_json = os.path.join(log_dir, 'metrics-{}.json'.format(global_step))
+  logging.info('Evaluation results at pretrain step %d: %s', global_step,
+               serializable_dict)
+  with tf.io.gfile.GFile(output_json, 'w') as writer:
+    writer.write(json.dumps(serializable_dict, indent=4) + '\n')
+def write_summary(summary_writer, global_step, eval_metrics):
+  """Write evaluation metrics to TF summary."""
+  numeric_dict = {}
+  for name, value in eval_metrics.items():
+    numeric_dict[name] = float(orbit.utils.get_value(value))
+  with summary_writer.as_default():
+    for name, value in numeric_dict.items():
+      tf.summary.scalar(name, value, step=global_step)
+    summary_writer.flush()
+def remove_ckpts(model_dir):
+  """Remove model checkpoints, so we can restart."""
+  ckpts = os.path.join(model_dir, 'ckpt-*')
+  logging.info('removing checkpoint files %s', ckpts)
+  for file_to_remove in tf.io.gfile.glob(ckpts):
+    tf.io.gfile.rmtree(file_to_remove)
+  file_to_remove = os.path.join(model_dir, 'checkpoint')
+  if tf.io.gfile.exists(file_to_remove):
+    tf.io.gfile.remove(file_to_remove)
+def write_model_params(model: Union[tf.Module, tf_keras.Model],
+                       output_path: str) -> None:
+  """Writes the model parameters and shapes to a file.
+  Args:
+    model: A model instance.
+    output_path: Output file path.
+  """
+  with tf.io.gfile.GFile(output_path, 'w') as f:
+    total_params = 0
+    for var in model.variables:
+      shape = tf.shape(var)
+      total_params += tf.math.reduce_prod(shape).numpy()
+      f.write(f'{var.name} {shape.numpy().tolist()}\n')
+    f.write(f'\nTotal params: {total_params}\n')
+def try_count_params(
+    model: Union[tf.Module, tf_keras.Model],
+    trainable_only: bool = False):
+  """Count the number of parameters if model is possible.
+  Args:
+    model: Try to count the number of params in this model.
+    trainable_only: Whether to calculate trainable params only. This flag is
+      not used when the model has `count_params` attribute.
+  Returns:
+    The number of parameters or None.
+  """
+  if hasattr(model, 'count_params'):
+    try:
+      return model.count_params()
+    except ValueError:
+      logging.info('Number of trainable params unknown, because the build() '
+                   'methods in keras layers were not called. This is probably '
+                   'because the model was not feed any input, e.g., the max '
+                   'train step already reached before this run.')
+      return None
+  else:
+    total_params = 0
+    variables = model.trainable_variables if trainable_only else model.variables
+    for var in variables:
+      shape = tf.shape(var)
+      total_params += tf.math.reduce_prod(shape).numpy()
+  return total_params
+def try_count_flops(model: Union[tf.Module, tf_keras.Model],
+                    inputs_kwargs: Optional[Dict[str, Any]] = None,
+                    output_path: Optional[str] = None):
+  """Counts and returns model FLOPs.
+  Args:
+    model: A model instance.
+    inputs_kwargs: An optional dictionary of argument pairs specifying inputs'
+      shape specifications to getting corresponding concrete function.
+    output_path: A file path to write the profiling results to.
+  Returns:
+    The model's FLOPs.
+  """
+  if hasattr(model, 'inputs'):
+    try:
+      # Get input shape and set batch size to 1.
+      if model.inputs:
+        inputs = [
+            tf.TensorSpec([1] + input.shape[1:], input.dtype)
+            for input in model.inputs
+        ]
+        concrete_func = tf.function(model).get_concrete_function(inputs)
+      # If model.inputs is invalid, try to use the input to get concrete
+      # function for model.call (subclass model).
+      else:
+        concrete_func = tf.function(model.call).get_concrete_function(
+            **inputs_kwargs)
+      frozen_func, _ = convert_variables_to_constants_v2_as_graph(concrete_func)
+      # Calculate FLOPs.
+      run_meta = tf.compat.v1.RunMetadata()
+      opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
+      if output_path is not None:
+        opts['output'] = f'file:outfile={output_path}'
+      else:
+        opts['output'] = 'none'
+      flops = tf.compat.v1.profiler.profile(
+          graph=frozen_func.graph, run_meta=run_meta, options=opts)
+      return flops.total_float_ops
+    except Exception as e:  # pylint: disable=broad-except
+      logging.info(
+          'Failed to count model FLOPs with error %s, because the build() '
+          'methods in keras layers were not called. This is probably because '
+          'the model was not feed any input, e.g., the max train step already '
+          'reached before this run.', e)
+      return None
+  return None
+@ops.RegisterStatistics('Einsum', 'flops')
+def _einsum_flops(graph, node):
+  """Calculates the compute resources needed for Einsum."""
+  assert len(node.input) == 2
+  x_shape = tf.compat.v1.graph_util.tensor_shape_from_node_def_name(
+      graph, node.input[0])
+  y_shape = tf.compat.v1.graph_util.tensor_shape_from_node_def_name(
+      graph, node.input[1])
+  x_shape.assert_is_fully_defined()
+  y_shape.assert_is_fully_defined()
+  x_shape = x_shape.as_list()
+  y_shape = y_shape.as_list()
+  equation = str(node.attr['equation'])
+  equation = (
+      equation.replace('s:', '')
+      .replace('"', '')
+      .replace(' ', '')
+      .replace('\n', '')
+  )
+  x_str = equation.split(',')[0]
+  y_r_str = equation.split(',')[1]
+  y_str = y_r_str.split('->')[0]
+  r_str = y_r_str.split('->')[1]
+  shape_dic = {}
+  contracted = set()
+  for indice in x_str + y_str:
+    if indice in x_str:
+      indice_dim = x_shape[x_str.find(indice)]
+    elif indice in y_str:
+      indice_dim = y_shape[y_str.find(indice)]
+    else:
+      raise ValueError('indice {} not found in inputs'.format(indice))
+    shape_dic[indice] = indice_dim
+    if indice not in r_str:
+      contracted.add(indice)
+  madds = np.prod([shape_dic[indice] for indice in r_str]) * (
+      np.prod([shape_dic[indice] for indice in contracted]))
+  flops = 2 * madds
+  return ops.OpStats('flops', flops)

modeling/official/core/train_utils_test.py ADDED Viewed

	@@ -0,0 +1,215 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tests for official.core.train_utils."""
+import json
+import os
+import pprint
+import numpy as np
+import tensorflow as tf, tf_keras
+from official.core import exp_factory
+from official.core import test_utils
+from official.core import train_utils
+from official.modeling import hyperparams
+@exp_factory.register_config_factory('foo')
+def foo():
+  """Multitask experiment for test."""
+  experiment_config = hyperparams.Config(
+      default_params={
+          'runtime': {
+              'tpu': 'fake',
+          },
+          'task': {
+              'model': {
+                  'model_id': 'bar',
+              },
+          },
+          'trainer': {
+              'train_steps': -1,
+              'validation_steps': -1,
+          },
+      })
+  return experiment_config
+class TrainUtilsTest(tf.test.TestCase):
+  def test_get_leaf_nested_dict(self):
+    d = {'a': {'i': {'x': 5}}}
+    self.assertEqual(train_utils.get_leaf_nested_dict(d, ['a', 'i', 'x']), 5)
+  def test_get_leaf_nested_dict_not_leaf(self):
+    with self.assertRaisesRegex(KeyError, 'The value extracted with keys.*'):
+      d = {'a': {'i': {'x': 5}}}
+      train_utils.get_leaf_nested_dict(d, ['a', 'i'])
+  def test_get_leaf_nested_dict_path_not_exist_missing_key(self):
+    with self.assertRaisesRegex(KeyError, 'Path not exist while traversing .*'):
+      d = {'a': {'i': {'x': 5}}}
+      train_utils.get_leaf_nested_dict(d, ['a', 'i', 'y'])
+  def test_get_leaf_nested_dict_path_not_exist_out_of_range(self):
+    with self.assertRaisesRegex(KeyError, 'Path not exist while traversing .*'):
+      d = {'a': {'i': {'x': 5}}}
+      train_utils.get_leaf_nested_dict(d, ['a', 'i', 'z'])
+  def test_get_leaf_nested_dict_path_not_exist_meets_leaf(self):
+    with self.assertRaisesRegex(KeyError, 'Path not exist while traversing .*'):
+      d = {'a': {'i': 5}}
+      train_utils.get_leaf_nested_dict(d, ['a', 'i', 'z'])
+  def test_cast_leaf_nested_dict(self):
+    d = {'a': {'i': {'x': '123'}}, 'b': 456.5}
+    d = train_utils.cast_leaf_nested_dict(d, int)
+    self.assertEqual(d['a']['i']['x'], 123)
+    self.assertEqual(d['b'], 456)
+  def test_write_model_params_keras_model(self):
+    inputs = np.zeros([2, 3])
+    model = test_utils.FakeKerasModel()
+    model(inputs)  # Must do forward pass to build the model.
+    filepath = os.path.join(self.create_tempdir(), 'model_params.txt')
+    train_utils.write_model_params(model, filepath)
+    actual = tf.io.gfile.GFile(filepath, 'r').read().splitlines()
+    expected = [
+        'fake_keras_model/dense/kernel:0 [3, 4]',
+        'fake_keras_model/dense/bias:0 [4]',
+        'fake_keras_model/dense_1/kernel:0 [4, 4]',
+        'fake_keras_model/dense_1/bias:0 [4]',
+        '',
+        'Total params: 36',
+    ]
+    self.assertEqual(actual, expected)
+  def test_write_model_params_module(self):
+    inputs = np.zeros([2, 3], dtype=np.float32)
+    model = test_utils.FakeModule(3, name='fake_module')
+    model(inputs)  # Must do forward pass to build the model.
+    filepath = os.path.join(self.create_tempdir(), 'model_params.txt')
+    train_utils.write_model_params(model, filepath)
+    actual = tf.io.gfile.GFile(filepath, 'r').read().splitlines()
+    expected = [
+        'fake_module/dense/b:0 [4]',
+        'fake_module/dense/w:0 [3, 4]',
+        'fake_module/dense_1/b:0 [4]',
+        'fake_module/dense_1/w:0 [4, 4]',
+        '',
+        'Total params: 36',
+    ]
+    self.assertEqual(actual, expected)
+  def test_construct_experiment_from_flags(self):
+    options = train_utils.ParseConfigOptions(
+        experiment='foo',
+        config_file=[],
+        tpu='bar',
+        tf_data_service='',
+        params_override='task.model.model_id=new,'
+        'trainer.train_steps=10,'
+        'trainer.validation_steps=11')
+    builder = train_utils.ExperimentParser(options)
+    params_from_obj = builder.parse()
+    params_from_func = train_utils.parse_configuration(options)
+    pp = pprint.PrettyPrinter()
+    self.assertEqual(
+        pp.pformat(params_from_obj.as_dict()),
+        pp.pformat(params_from_func.as_dict()))
+    self.assertEqual(params_from_obj.runtime.tpu, 'bar')
+    self.assertEqual(params_from_obj.task.model.model_id, 'new')
+    self.assertEqual(params_from_obj.trainer.train_steps, 10)
+    self.assertEqual(params_from_obj.trainer.validation_steps, 11)
+class BestCheckpointExporterTest(tf.test.TestCase):
+  def test_maybe_export(self):
+    model_dir = self.create_tempdir().full_path
+    best_ckpt_path = os.path.join(model_dir, 'best_ckpt-1')
+    metric_name = 'test_metric|metric_1'
+    exporter = train_utils.BestCheckpointExporter(
+        model_dir, metric_name, 'higher')
+    v = tf.Variable(1.0)
+    checkpoint = tf.train.Checkpoint(v=v)
+    ret = exporter.maybe_export_checkpoint(
+        checkpoint, {'test_metric': {'metric_1': 5.0}}, 100)
+    with self.subTest(name='Successful first save.'):
+      self.assertEqual(ret, True)
+      v_2 = tf.Variable(2.0)
+      checkpoint_2 = tf.train.Checkpoint(v=v_2)
+      checkpoint_2.restore(best_ckpt_path)
+      self.assertEqual(v_2.numpy(), 1.0)
+    v = tf.Variable(3.0)
+    checkpoint = tf.train.Checkpoint(v=v)
+    ret = exporter.maybe_export_checkpoint(
+        checkpoint, {'test_metric': {'metric_1': 6.0}}, 200)
+    with self.subTest(name='Successful better metic save.'):
+      self.assertEqual(ret, True)
+      v_2 = tf.Variable(2.0)
+      checkpoint_2 = tf.train.Checkpoint(v=v_2)
+      checkpoint_2.restore(best_ckpt_path)
+      self.assertEqual(v_2.numpy(), 3.0)
+    v = tf.Variable(5.0)
+    checkpoint = tf.train.Checkpoint(v=v)
+    ret = exporter.maybe_export_checkpoint(
+        checkpoint, {'test_metric': {'metric_1': 1.0}}, 300)
+    with self.subTest(name='Worse metic no save.'):
+      self.assertEqual(ret, False)
+      v_2 = tf.Variable(2.0)
+      checkpoint_2 = tf.train.Checkpoint(v=v_2)
+      checkpoint_2.restore(best_ckpt_path)
+      self.assertEqual(v_2.numpy(), 3.0)
+  def test_export_best_eval_metric(self):
+    model_dir = self.create_tempdir().full_path
+    metric_name = 'test_metric|metric_1'
+    exporter = train_utils.BestCheckpointExporter(model_dir, metric_name,
+                                                  'higher')
+    exporter.export_best_eval_metric({'test_metric': {'metric_1': 5.0}}, 100)
+    with tf.io.gfile.GFile(os.path.join(model_dir, 'info.json'),
+                           'rb') as reader:
+      metric = json.loads(reader.read())
+      self.assertAllEqual(
+          metric,
+          {'test_metric': {'metric_1': 5.0}, 'best_ckpt_global_step': 100.0})
+  def test_export_best_eval_metric_skips_non_scalar_values(self):
+    model_dir = self.create_tempdir().full_path
+    metric_name = 'test_metric|metric_1'
+    exporter = train_utils.BestCheckpointExporter(model_dir, metric_name,
+                                                  'higher')
+    image = tf.zeros(shape=[16, 8, 1])
+    eval_logs = {'test_metric': {'metric_1': 5.0, 'image': image}}
+    exporter.export_best_eval_metric(eval_logs, 100)
+    with tf.io.gfile.GFile(os.path.join(model_dir, 'info.json'),
+                           'rb') as reader:
+      metric = json.loads(reader.read())
+      self.assertAllEqual(
+          metric,
+          {'test_metric': {'metric_1': 5.0}, 'best_ckpt_global_step': 100.0})
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/legacy/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+Models in this `legacy` directory are mainly are used for benchmarking the
+models.
+Please note that the models in this `legacy` directory are not supported like
+the models in official/nlp and official/vision.

modeling/official/legacy/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

modeling/official/legacy/albert/README.md ADDED Viewed

	@@ -0,0 +1,4 @@

+# ALBERT (ALBERT: A Lite BERT for Self-supervised Learning of Language Representations)
+**WARNING**: This directory is deprecated.
+See `nlp/docs/MODEL_GARDEN.md` for the new ALBERT implementation.

modeling/official/legacy/albert/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

modeling/official/legacy/albert/configs.py ADDED Viewed

	@@ -0,0 +1,50 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""The ALBERT configurations."""
+import six
+from official.legacy.bert import configs
+class AlbertConfig(configs.BertConfig):
+  """Configuration for `ALBERT`."""
+  def __init__(self, num_hidden_groups=1, inner_group_num=1, **kwargs):
+    """Constructs AlbertConfig.
+    Args:
+      num_hidden_groups: Number of group for the hidden layers, parameters in
+        the same group are shared. Note that this value and also the following
+        'inner_group_num' has to be 1 for now, because all released ALBERT
+        models set them to 1. We may support arbitary valid values in future.
+      inner_group_num: Number of inner repetition of attention and ffn.
+      **kwargs: The remaining arguments are the same as above 'BertConfig'.
+    """
+    super(AlbertConfig, self).__init__(**kwargs)
+    # TODO(chendouble): 'inner_group_num' and 'num_hidden_groups' are always 1
+    # in the released ALBERT. Support other values in AlbertEncoder if needed.
+    if inner_group_num != 1 or num_hidden_groups != 1:
+      raise ValueError("We only support 'inner_group_num' and "
+                       "'num_hidden_groups' as 1.")
+  @classmethod
+  def from_dict(cls, json_object):
+    """Constructs a `AlbertConfig` from a Python dictionary of parameters."""
+    config = AlbertConfig(vocab_size=None)
+    for (key, value) in six.iteritems(json_object):
+      config.__dict__[key] = value
+    return config

modeling/official/legacy/bert/README.md ADDED Viewed

	@@ -0,0 +1,395 @@

+# BERT (Bidirectional Encoder Representations from Transformers)
+**WARNING**: We are on the way to deprecating most of the code in this directory.
+Please see
+[this link](../g3doc/tutorials/bert_new.md)
+for the new tutorial and use the new code in `nlp/modeling`. This README is
+still correct for this legacy implementation.
+The academic paper which describes BERT in detail and provides full results on a
+number of tasks can be found here: https://arxiv.org/abs/1810.04805.
+This repository contains TensorFlow 2.x implementation for BERT.
+## Contents
+  * [Contents](#contents)
+  * [Pre-trained Models](#pre-trained-models)
+    * [Restoring from Checkpoints](#restoring-from-checkpoints)
+  * [Set Up](#set-up)
+  * [Process Datasets](#process-datasets)
+  * [Fine-tuning with BERT](#fine-tuning-with-bert)
+    * [Cloud GPUs and TPUs](#cloud-gpus-and-tpus)
+    * [Sentence and Sentence-pair Classification Tasks](#sentence-and-sentence-pair-classification-tasks)
+    * [SQuAD 1.1](#squad-1.1)
+## Pre-trained Models
+We released both checkpoints and tf.hub modules as the pretrained models for
+fine-tuning. They are TF 2.x compatible and are converted from the checkpoints
+released in TF 1.x official BERT repository
+[google-research/bert](https://github.com/google-research/bert)
+in order to keep consistent with BERT paper.
+### Access to Pretrained Checkpoints
+Pretrained checkpoints can be found in the following links:
+**Note: We have switched BERT implementation
+to use Keras functional-style networks in [nlp/modeling](../modeling).
+The new checkpoints are:**
+*   **[`BERT-Large, Uncased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/wwm_uncased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Large, Cased (Whole Word Masking)`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/wwm_cased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Uncased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12.tar.gz)**:
+    12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Large, Uncased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/cased_L-12_H-768_A-12.tar.gz)**:
+    12-layer, 768-hidden, 12-heads , 110M parameters
+*   **[`BERT-Large, Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/cased_L-24_H-1024_A-16.tar.gz)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/cloud-tpu-checkpoints/bert/keras_bert/multi_cased_L-12_H-768_A-12.tar.gz)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+We recommend to host checkpoints on Google Cloud Storage buckets when you use
+Cloud GPU/TPU.
+### Restoring from Checkpoints
+`tf.train.Checkpoint` is used to manage model checkpoints in TF 2. To restore
+weights from provided pre-trained checkpoints, you can use the following code:
+```python
+init_checkpoint='the pretrained model checkpoint path.'
+model=tf.keras.Model() # Bert pre-trained model as feature extractor.
+checkpoint = tf.train.Checkpoint(model=model)
+checkpoint.restore(init_checkpoint)
+```
+Checkpoints featuring native serialized Keras models
+(i.e. model.load()/load_weights()) will be available soon.
+### Access to Pretrained hub modules.
+Pretrained tf.hub modules in TF 2.x SavedModel format can be found in the
+following links:
+*   **[`BERT-Large, Uncased (Whole Word Masking)`](https://tfhub.dev/tensorflow/bert_en_wwm_uncased_L-24_H-1024_A-16/)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Large, Cased (Whole Word Masking)`](https://tfhub.dev/tensorflow/bert_en_wwm_cased_L-24_H-1024_A-16/)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Uncased`](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/)**:
+    12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Large, Uncased`](https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Cased`](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/)**:
+    12-layer, 768-hidden, 12-heads , 110M parameters
+*   **[`BERT-Large, Cased`](https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/)**:
+    24-layer, 1024-hidden, 16-heads, 340M parameters
+*   **[`BERT-Base, Multilingual Cased`](https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/)**:
+    104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
+*   **[`BERT-Base, Chinese`](https://tfhub.dev/tensorflow/bert_zh_L-12_H-768_A-12/)**:
+    Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads,
+    110M parameters
+## Set Up
+```shell
+export PYTHONPATH="$PYTHONPATH:/path/to/models"
+```
+Install `tf-nightly` to get latest updates:
+```shell
+pip install tf-nightly-gpu
+```
+With TPU, GPU support is not necessary. First, you need to create a `tf-nightly`
+TPU with [ctpu tool](https://github.com/tensorflow/tpu/tree/master/tools/ctpu):
+```shell
+ctpu up -name <instance name> --tf-version=”nightly”
+```
+Second, you need to install TF 2 `tf-nightly` on your VM:
+```shell
+pip install tf-nightly
+```
+## Process Datasets
+### Pre-training
+There is no change to generate pre-training data. Please use the script
+[`../data/create_pretraining_data.py`](../data/create_pretraining_data.py)
+which is essentially branched from the [BERT research repo](https://github.com/google-research/bert)
+to get processed pre-training data and it adapts to TF2 symbols and python3
+compatibility.
+Running the pre-training script requires an input and output directory, as well as a vocab file.  Note that max_seq_length will need to match the sequence length parameter you specify when you run pre-training.
+Example shell script to call create_pretraining_data.py
+```
+export WORKING_DIR='local disk or cloud location'
+export BERT_DIR='local disk or cloud location'
+python models/official/nlp/data/create_pretraining_data.py \
+  --input_file=$WORKING_DIR/input/input.txt \
+  --output_file=$WORKING_DIR/output/tf_examples.tfrecord \
+  --vocab_file=$BERT_DIR/wwm_uncased_L-24_H-1024_A-16/vocab.txt \
+  --do_lower_case=True \
+  --max_seq_length=512 \
+  --max_predictions_per_seq=76 \
+  --masked_lm_prob=0.15 \
+  --random_seed=12345 \
+  --dupe_factor=5
+```
+### Fine-tuning
+To prepare the fine-tuning data for final model training, use the
+[`../data/create_finetuning_data.py`](../data/create_finetuning_data.py) script.
+Resulting datasets in `tf_record` format and training meta data should be later
+passed to training or evaluation scripts. The task-specific arguments are
+described in the following sections:
+* GLUE
+Users can download the
+[GLUE data](https://gluebenchmark.com/tasks) by running
+[this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e)
+and unpack it to some directory `$GLUE_DIR`.
+Also, users can download [Pretrained Checkpoint](#access-to-pretrained-checkpoints) and locate it on some directory `$BERT_DIR` instead of using checkpoints on Google Cloud Storage.
+```shell
+export GLUE_DIR=~/glue
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export TASK_NAME=MNLI
+export OUTPUT_DIR=gs://some_bucket/datasets
+python ../data/create_finetuning_data.py \
+ --input_data_dir=${GLUE_DIR}/${TASK_NAME}/ \
+ --vocab_file=${BERT_DIR}/vocab.txt \
+ --train_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_train.tf_record \
+ --eval_data_output_path=${OUTPUT_DIR}/${TASK_NAME}_eval.tf_record \
+ --meta_data_file_path=${OUTPUT_DIR}/${TASK_NAME}_meta_data \
+ --fine_tuning_task_type=classification --max_seq_length=128 \
+ --classification_task_name=${TASK_NAME}
+```
+* SQUAD
+The [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/) contains
+detailed information about the SQuAD datasets and evaluation.
+The necessary files can be found here:
+*   [train-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)
+*   [dev-v1.1.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)
+*   [evaluate-v1.1.py](https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py)
+*   [train-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)
+*   [dev-v2.0.json](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json)
+*   [evaluate-v2.0.py](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/)
+```shell
+export SQUAD_DIR=~/squad
+export SQUAD_VERSION=v1.1
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export OUTPUT_DIR=gs://some_bucket/datasets
+python ../data/create_finetuning_data.py \
+ --squad_data_file=${SQUAD_DIR}/train-${SQUAD_VERSION}.json \
+ --vocab_file=${BERT_DIR}/vocab.txt \
+ --train_data_output_path=${OUTPUT_DIR}/squad_${SQUAD_VERSION}_train.tf_record \
+ --meta_data_file_path=${OUTPUT_DIR}/squad_${SQUAD_VERSION}_meta_data \
+ --fine_tuning_task_type=squad --max_seq_length=384
+```
+Note: To create fine-tuning data with SQUAD 2.0, you need to add flag `--version_2_with_negative=True`.
+## Fine-tuning with BERT
+### Cloud GPUs and TPUs
+* Cloud Storage
+The unzipped pre-trained model files can also be found in the Google Cloud
+Storage folder `gs://cloud-tpu-checkpoints/bert/keras_bert`. For example:
+```shell
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export MODEL_DIR=gs://some_bucket/my_output_dir
+```
+Currently, users are able to access to `tf-nightly` TPUs and the following TPU
+script should run with `tf-nightly`.
+* GPU -> TPU
+Just add the following flags to `run_classifier.py` or `run_squad.py`:
+```shell
+  --distribution_strategy=tpu
+  --tpu=grpc://${TPU_IP_ADDRESS}:8470
+```
+### Sentence and Sentence-pair Classification Tasks
+This example code fine-tunes `BERT-Large` on the Microsoft Research Paraphrase
+Corpus (MRPC) corpus, which only contains 3,600 examples and can fine-tune in a
+few minutes on most GPUs.
+We use the `BERT-Large` (uncased_L-24_H-1024_A-16) as an example throughout the
+workflow.
+For GPU memory of 16GB or smaller, you may try to use `BERT-Base`
+(uncased_L-12_H-768_A-12).
+```shell
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export MODEL_DIR=gs://some_bucket/my_output_dir
+export GLUE_DIR=gs://some_bucket/datasets
+export TASK=MRPC
+python run_classifier.py \
+  --mode='train_and_eval' \
+  --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
+  --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
+  --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
+  --bert_config_file=${BERT_DIR}/bert_config.json \
+  --init_checkpoint=${BERT_DIR}/bert_model.ckpt \
+  --train_batch_size=4 \
+  --eval_batch_size=4 \
+  --steps_per_loop=1 \
+  --learning_rate=2e-5 \
+  --num_train_epochs=3 \
+  --model_dir=${MODEL_DIR} \
+  --distribution_strategy=mirrored
+```
+Alternatively, instead of specifying `init_checkpoint`, you can specify
+`hub_module_url` to employ a pre-trained BERT hub module, e.g.,
+` --hub_module_url=https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1`.
+After training a model, to get predictions from the classifier, you can set the
+`--mode=predict` and offer the test set tfrecords to `--eval_data_path`.
+The output will be created in file called test_results.tsv in the output folder.
+Each line will contain output for each sample, columns are the class
+probabilities.
+```shell
+python run_classifier.py \
+  --mode='predict' \
+  --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
+  --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
+  --bert_config_file=${BERT_DIR}/bert_config.json \
+  --eval_batch_size=4 \
+  --model_dir=${MODEL_DIR} \
+  --distribution_strategy=mirrored
+```
+To use TPU, you only need to switch the distribution strategy type to `tpu` with TPU
+information and use remote storage for model checkpoints.
+```shell
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export TPU_IP_ADDRESS='???'
+export MODEL_DIR=gs://some_bucket/my_output_dir
+export GLUE_DIR=gs://some_bucket/datasets
+export TASK=MRPC
+python run_classifier.py \
+  --mode='train_and_eval' \
+  --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
+  --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
+  --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
+  --bert_config_file=${BERT_DIR}/bert_config.json \
+  --init_checkpoint=${BERT_DIR}/bert_model.ckpt \
+  --train_batch_size=32 \
+  --eval_batch_size=32 \
+  --steps_per_loop=1000 \
+  --learning_rate=2e-5 \
+  --num_train_epochs=3 \
+  --model_dir=${MODEL_DIR} \
+  --distribution_strategy=tpu \
+  --tpu=grpc://${TPU_IP_ADDRESS}:8470
+```
+Note that, we specify `steps_per_loop=1000` for TPU, because running a loop of
+training steps inside a `tf.function` can significantly increase TPU utilization
+and callbacks will not be called inside the loop.
+### SQuAD 1.1
+The Stanford Question Answering Dataset (SQuAD) is a popular question answering
+benchmark dataset. See more on [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/).
+We use the `BERT-Large` (uncased_L-24_H-1024_A-16) as an example throughout the
+workflow.
+For GPU memory of 16GB or smaller, you may try to use `BERT-Base`
+(uncased_L-12_H-768_A-12).
+```shell
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export SQUAD_DIR=gs://some_bucket/datasets
+export MODEL_DIR=gs://some_bucket/my_output_dir
+export SQUAD_VERSION=v1.1
+python run_squad.py \
+  --input_meta_data_path=${SQUAD_DIR}/squad_${SQUAD_VERSION}_meta_data \
+  --train_data_path=${SQUAD_DIR}/squad_${SQUAD_VERSION}_train.tf_record \
+  --predict_file=${SQUAD_DIR}/dev-v1.1.json \
+  --vocab_file=${BERT_DIR}/vocab.txt \
+  --bert_config_file=${BERT_DIR}/bert_config.json \
+  --init_checkpoint=${BERT_DIR}/bert_model.ckpt \
+  --train_batch_size=4 \
+  --predict_batch_size=4 \
+  --learning_rate=8e-5 \
+  --num_train_epochs=2 \
+  --model_dir=${MODEL_DIR} \
+  --distribution_strategy=mirrored
+```
+Similarly, you can replace `init_checkpoint` FLAG with `hub_module_url` to
+specify a hub module path.
+`run_squad.py` writes the prediction for `--predict_file` by default. If you set
+the `--model=predict` and offer the SQuAD test data, the scripts will generate
+the prediction json file.
+To use TPU, you need to switch the distribution strategy type to `tpu` with TPU
+information.
+```shell
+export BERT_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+export TPU_IP_ADDRESS='???'
+export MODEL_DIR=gs://some_bucket/my_output_dir
+export SQUAD_DIR=gs://some_bucket/datasets
+export SQUAD_VERSION=v1.1
+python run_squad.py \
+  --input_meta_data_path=${SQUAD_DIR}/squad_${SQUAD_VERSION}_meta_data \
+  --train_data_path=${SQUAD_DIR}/squad_${SQUAD_VERSION}_train.tf_record \
+  --predict_file=${SQUAD_DIR}/dev-v1.1.json \
+  --vocab_file=${BERT_DIR}/vocab.txt \
+  --bert_config_file=${BERT_DIR}/bert_config.json \
+  --init_checkpoint=${BERT_DIR}/bert_model.ckpt \
+  --train_batch_size=32 \
+  --learning_rate=8e-5 \
+  --num_train_epochs=2 \
+  --model_dir=${MODEL_DIR} \
+  --distribution_strategy=tpu \
+  --tpu=grpc://${TPU_IP_ADDRESS}:8470
+```
+The dev set predictions will be saved into a file called predictions.json in the
+model_dir:
+```shell
+python $SQUAD_DIR/evaluate-v1.1.py $SQUAD_DIR/dev-v1.1.json ./squad/predictions.json
+```

modeling/official/legacy/bert/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

modeling/official/legacy/bert/bert_cloud_tpu.md ADDED Viewed

	@@ -0,0 +1,110 @@

+# BERT FineTuning with Cloud TPU: Sentence and Sentence-Pair Classification Tasks (TF 2.1)
+This tutorial shows you how to train the Bidirectional Encoder Representations from Transformers (BERT) model on Cloud TPU.
+## Set up Cloud Storage and Compute Engine VM
+1. [Open a cloud shell window](https://console.cloud.google.com/?cloudshell=true&_ga=2.11844148.-1612541229.1552429951)
+2. Create a variable for the project's id:
+```
+export PROJECT_ID=your-project_id
+```
+3. Configure `gcloud` command-line tool to use the project where you want to create Cloud TPU.
+```
+gcloud config set project ${PROJECT_ID}
+```
+4. Create a Cloud Storage bucket using the following command:
+```
+gsutil mb -p ${PROJECT_ID} -c standard -l europe-west4 -b on gs://your-bucket-name
+```
+This Cloud Storage bucket stores the data you use to train your model and the training results.
+5. Launch a Compute Engine VM and Cloud TPU using the ctpu up command.
+```
+ctpu up --tpu-size=v3-8 \
+ --machine-type=n1-standard-8 \
+ --zone=europe-west4-a \
+ --tf-version=2.1 [optional flags: --project, --name]
+```
+6. The configuration you specified appears. Enter y to approve or n to cancel.
+7. When the ctpu up command has finished executing, verify that your shell prompt has changed from username@project to username@tpuname. This change shows that you are now logged into your Compute Engine VM.
+```
+gcloud compute ssh vm-name --zone=europe-west4-a
+(vm)$ export TPU_NAME=vm-name
+```
+As you continue these instructions, run each command that begins with `(vm)$` in your VM session window.
+## Prepare the Dataset
+1. From your Compute Engine virtual machine (VM), install requirements.txt.
+```
+(vm)$ cd /usr/share/models
+(vm)$ sudo pip3 install -r official/requirements.txt
+```
+2. Optional: download download_glue_data.py
+This tutorial uses the General Language Understanding Evaluation (GLUE) benchmark to evaluate and analyze the performance of the model. The GLUE data is provided for this tutorial at gs://cloud-tpu-checkpoints/bert/classification.
+## Define parameter values
+Next, define several parameter values that are required when you train and evaluate your model:
+```
+(vm)$ export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"
+(vm)$ export STORAGE_BUCKET=gs://your-bucket-name
+(vm)$ export BERT_BASE_DIR=gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-24_H-1024_A-16
+(vm)$ export MODEL_DIR=${STORAGE_BUCKET}/bert-output
+(vm)$ export GLUE_DIR=gs://cloud-tpu-checkpoints/bert/classification
+(vm)$ export TASK=mnli
+```
+## Train the model
+From your Compute Engine VM, run the following command.
+```
+(vm)$ python3 official/nlp/bert/run_classifier.py \
+  --mode='train_and_eval' \
+  --input_meta_data_path=${GLUE_DIR}/${TASK}_meta_data \
+  --train_data_path=${GLUE_DIR}/${TASK}_train.tf_record \
+  --eval_data_path=${GLUE_DIR}/${TASK}_eval.tf_record \
+  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
+  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
+  --train_batch_size=32 \
+  --eval_batch_size=32 \
+  --learning_rate=2e-5 \
+  --num_train_epochs=3 \
+  --model_dir=${MODEL_DIR} \
+  --distribution_strategy=tpu \
+  --tpu=${TPU_NAME}
+```
+## Verify your results
+The training takes approximately 1 hour on a v3-8 TPU. When script completes, you should see results similar to the following:
+```
+Training Summary:
+{'train_loss': 0.28142181038856506,
+'last_train_metrics': 0.9467429518699646,
+'eval_metrics': 0.8599063158035278,
+'total_training_steps': 36813}
+```
+## Clean up
+To avoid incurring charges to your GCP account for the resources used in this topic:
+1. Disconnect from the Compute Engine VM:
+```
+(vm)$ exit
+```
+2. In your Cloud Shell, run ctpu delete with the --zone flag you used when you set up the Cloud TPU to delete your Compute Engine VM and your Cloud TPU:
+```
+$ ctpu delete --zone=your-zone
+```
+3. Run ctpu status specifying your zone to make sure you have no instances allocated to avoid unnecessary charges for TPU usage. The deletion might take several minutes. A response like the one below indicates there are no more allocated instances:
+```
+$ ctpu status --zone=your-zone
+```
+4. Run gsutil as shown, replacing your-bucket with the name of the Cloud Storage bucket you created for this tutorial:
+```
+$ gsutil rm -r gs://your-bucket
+```

modeling/official/legacy/bert/bert_models.py ADDED Viewed

	@@ -0,0 +1,365 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""BERT models that are compatible with TF 2.0."""
+import gin
+import tensorflow as tf, tf_keras
+import tensorflow_hub as hub
+from official.legacy.albert import configs as albert_configs
+from official.legacy.bert import configs
+from official.modeling import tf_utils
+from official.nlp.modeling import models
+from official.nlp.modeling import networks
+class BertPretrainLossAndMetricLayer(tf_keras.layers.Layer):
+  """Returns layer that computes custom loss and metrics for pretraining."""
+  def __init__(self, vocab_size, **kwargs):
+    super(BertPretrainLossAndMetricLayer, self).__init__(**kwargs)
+    self._vocab_size = vocab_size
+    self.config = {
+        'vocab_size': vocab_size,
+    }
+  def _add_metrics(self, lm_output, lm_labels, lm_label_weights,
+                   lm_example_loss, sentence_output, sentence_labels,
+                   next_sentence_loss):
+    """Adds metrics."""
+    masked_lm_accuracy = tf_keras.metrics.sparse_categorical_accuracy(
+        lm_labels, lm_output)
+    numerator = tf.reduce_sum(masked_lm_accuracy * lm_label_weights)
+    denominator = tf.reduce_sum(lm_label_weights) + 1e-5
+    masked_lm_accuracy = numerator / denominator
+    self.add_metric(
+        masked_lm_accuracy, name='masked_lm_accuracy', aggregation='mean')
+    self.add_metric(lm_example_loss, name='lm_example_loss', aggregation='mean')
+    if sentence_labels is not None:
+      next_sentence_accuracy = tf_keras.metrics.sparse_categorical_accuracy(
+          sentence_labels, sentence_output)
+      self.add_metric(
+          next_sentence_accuracy,
+          name='next_sentence_accuracy',
+          aggregation='mean')
+    if next_sentence_loss is not None:
+      self.add_metric(
+          next_sentence_loss, name='next_sentence_loss', aggregation='mean')
+  def call(self,
+           lm_output_logits,
+           sentence_output_logits,
+           lm_label_ids,
+           lm_label_weights,
+           sentence_labels=None):
+    """Implements call() for the layer."""
+    lm_label_weights = tf.cast(lm_label_weights, tf.float32)
+    lm_output_logits = tf.cast(lm_output_logits, tf.float32)
+    lm_prediction_losses = tf_keras.losses.sparse_categorical_crossentropy(
+        lm_label_ids, lm_output_logits, from_logits=True)
+    lm_numerator_loss = tf.reduce_sum(lm_prediction_losses * lm_label_weights)
+    lm_denominator_loss = tf.reduce_sum(lm_label_weights)
+    mask_label_loss = tf.math.divide_no_nan(lm_numerator_loss,
+                                            lm_denominator_loss)
+    if sentence_labels is not None:
+      sentence_output_logits = tf.cast(sentence_output_logits, tf.float32)
+      sentence_loss = tf_keras.losses.sparse_categorical_crossentropy(
+          sentence_labels, sentence_output_logits, from_logits=True)
+      sentence_loss = tf.reduce_mean(sentence_loss)
+      loss = mask_label_loss + sentence_loss
+    else:
+      sentence_loss = None
+      loss = mask_label_loss
+    batch_shape = tf.slice(tf.shape(lm_label_ids), [0], [1])
+    # TODO(hongkuny): Avoids the hack and switches add_loss.
+    final_loss = tf.fill(batch_shape, loss)
+    self._add_metrics(lm_output_logits, lm_label_ids, lm_label_weights,
+                      mask_label_loss, sentence_output_logits, sentence_labels,
+                      sentence_loss)
+    return final_loss
+@gin.configurable
+def get_transformer_encoder(bert_config,
+                            sequence_length=None,
+                            transformer_encoder_cls=None,
+                            output_range=None):
+  """Gets a 'TransformerEncoder' object.
+  Args:
+    bert_config: A 'modeling.BertConfig' or 'modeling.AlbertConfig' object.
+    sequence_length: [Deprecated].
+    transformer_encoder_cls: A EncoderScaffold class. If it is None, uses the
+      default BERT encoder implementation.
+    output_range: the sequence output range, [0, output_range). Default setting
+      is to return the entire sequence output.
+  Returns:
+    A encoder object.
+  """
+  del sequence_length
+  if transformer_encoder_cls is not None:
+    # TODO(hongkuny): evaluate if it is better to put cfg definition in gin.
+    embedding_cfg = dict(
+        vocab_size=bert_config.vocab_size,
+        type_vocab_size=bert_config.type_vocab_size,
+        hidden_size=bert_config.hidden_size,
+        max_seq_length=bert_config.max_position_embeddings,
+        initializer=tf_keras.initializers.TruncatedNormal(
+            stddev=bert_config.initializer_range),
+        dropout_rate=bert_config.hidden_dropout_prob,
+    )
+    hidden_cfg = dict(
+        num_attention_heads=bert_config.num_attention_heads,
+        intermediate_size=bert_config.intermediate_size,
+        intermediate_activation=tf_utils.get_activation(bert_config.hidden_act),
+        dropout_rate=bert_config.hidden_dropout_prob,
+        attention_dropout_rate=bert_config.attention_probs_dropout_prob,
+        kernel_initializer=tf_keras.initializers.TruncatedNormal(
+            stddev=bert_config.initializer_range),
+    )
+    kwargs = dict(
+        embedding_cfg=embedding_cfg,
+        hidden_cfg=hidden_cfg,
+        num_hidden_instances=bert_config.num_hidden_layers,
+        pooled_output_dim=bert_config.hidden_size,
+        pooler_layer_initializer=tf_keras.initializers.TruncatedNormal(
+            stddev=bert_config.initializer_range))
+    # Relies on gin configuration to define the Transformer encoder arguments.
+    return transformer_encoder_cls(**kwargs)
+  kwargs = dict(
+      vocab_size=bert_config.vocab_size,
+      hidden_size=bert_config.hidden_size,
+      num_layers=bert_config.num_hidden_layers,
+      num_attention_heads=bert_config.num_attention_heads,
+      intermediate_size=bert_config.intermediate_size,
+      activation=tf_utils.get_activation(bert_config.hidden_act),
+      dropout_rate=bert_config.hidden_dropout_prob,
+      attention_dropout_rate=bert_config.attention_probs_dropout_prob,
+      max_sequence_length=bert_config.max_position_embeddings,
+      type_vocab_size=bert_config.type_vocab_size,
+      embedding_width=bert_config.embedding_size,
+      initializer=tf_keras.initializers.TruncatedNormal(
+          stddev=bert_config.initializer_range))
+  if isinstance(bert_config, albert_configs.AlbertConfig):
+    return networks.AlbertEncoder(**kwargs)
+  else:
+    assert isinstance(bert_config, configs.BertConfig)
+    kwargs['output_range'] = output_range
+    return networks.BertEncoder(**kwargs)
+def pretrain_model(bert_config,
+                   seq_length,
+                   max_predictions_per_seq,
+                   initializer=None,
+                   use_next_sentence_label=True,
+                   return_core_pretrainer_model=False):
+  """Returns model to be used for pre-training.
+  Args:
+      bert_config: Configuration that defines the core BERT model.
+      seq_length: Maximum sequence length of the training data.
+      max_predictions_per_seq: Maximum number of tokens in sequence to mask out
+        and use for pretraining.
+      initializer: Initializer for weights in BertPretrainer.
+      use_next_sentence_label: Whether to use the next sentence label.
+      return_core_pretrainer_model: Whether to also return the `BertPretrainer`
+        object.
+  Returns:
+      A Tuple of (1) Pretraining model, (2) core BERT submodel from which to
+      save weights after pretraining, and (3) optional core `BertPretrainer`
+      object if argument `return_core_pretrainer_model` is True.
+  """
+  input_word_ids = tf_keras.layers.Input(
+      shape=(seq_length,), name='input_word_ids', dtype=tf.int32)
+  input_mask = tf_keras.layers.Input(
+      shape=(seq_length,), name='input_mask', dtype=tf.int32)
+  input_type_ids = tf_keras.layers.Input(
+      shape=(seq_length,), name='input_type_ids', dtype=tf.int32)
+  masked_lm_positions = tf_keras.layers.Input(
+      shape=(max_predictions_per_seq,),
+      name='masked_lm_positions',
+      dtype=tf.int32)
+  masked_lm_ids = tf_keras.layers.Input(
+      shape=(max_predictions_per_seq,), name='masked_lm_ids', dtype=tf.int32)
+  masked_lm_weights = tf_keras.layers.Input(
+      shape=(max_predictions_per_seq,),
+      name='masked_lm_weights',
+      dtype=tf.int32)
+  if use_next_sentence_label:
+    next_sentence_labels = tf_keras.layers.Input(
+        shape=(1,), name='next_sentence_labels', dtype=tf.int32)
+  else:
+    next_sentence_labels = None
+  transformer_encoder = get_transformer_encoder(bert_config, seq_length)
+  if initializer is None:
+    initializer = tf_keras.initializers.TruncatedNormal(
+        stddev=bert_config.initializer_range)
+  pretrainer_model = models.BertPretrainer(
+      network=transformer_encoder,
+      embedding_table=transformer_encoder.get_embedding_table(),
+      num_classes=2,  # The next sentence prediction label has two classes.
+      activation=tf_utils.get_activation(bert_config.hidden_act),
+      num_token_predictions=max_predictions_per_seq,
+      initializer=initializer,
+      output='logits')
+  outputs = pretrainer_model(
+      [input_word_ids, input_mask, input_type_ids, masked_lm_positions])
+  lm_output = outputs['masked_lm']
+  sentence_output = outputs['classification']
+  pretrain_loss_layer = BertPretrainLossAndMetricLayer(
+      vocab_size=bert_config.vocab_size)
+  output_loss = pretrain_loss_layer(lm_output, sentence_output, masked_lm_ids,
+                                    masked_lm_weights, next_sentence_labels)
+  inputs = {
+      'input_word_ids': input_word_ids,
+      'input_mask': input_mask,
+      'input_type_ids': input_type_ids,
+      'masked_lm_positions': masked_lm_positions,
+      'masked_lm_ids': masked_lm_ids,
+      'masked_lm_weights': masked_lm_weights,
+  }
+  if use_next_sentence_label:
+    inputs['next_sentence_labels'] = next_sentence_labels
+  keras_model = tf_keras.Model(inputs=inputs, outputs=output_loss)
+  if return_core_pretrainer_model:
+    return keras_model, transformer_encoder, pretrainer_model
+  else:
+    return keras_model, transformer_encoder
+def squad_model(bert_config,
+                max_seq_length,
+                initializer=None,
+                hub_module_url=None,
+                hub_module_trainable=True):
+  """Returns BERT Squad model along with core BERT model to import weights.
+  Args:
+    bert_config: BertConfig, the config defines the core Bert model.
+    max_seq_length: integer, the maximum input sequence length.
+    initializer: Initializer for the final dense layer in the span labeler.
+      Defaulted to TruncatedNormal initializer.
+    hub_module_url: TF-Hub path/url to Bert module.
+    hub_module_trainable: True to finetune layers in the hub module.
+  Returns:
+    A tuple of (1) keras model that outputs start logits and end logits and
+    (2) the core BERT transformer encoder.
+  """
+  if initializer is None:
+    initializer = tf_keras.initializers.TruncatedNormal(
+        stddev=bert_config.initializer_range)
+  if not hub_module_url:
+    bert_encoder = get_transformer_encoder(bert_config, max_seq_length)
+    return models.BertSpanLabeler(
+        network=bert_encoder, initializer=initializer), bert_encoder
+  input_word_ids = tf_keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_word_ids')
+  input_mask = tf_keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_mask')
+  input_type_ids = tf_keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_type_ids')
+  core_model = hub.KerasLayer(hub_module_url, trainable=hub_module_trainable)
+  pooled_output, sequence_output = core_model(
+      [input_word_ids, input_mask, input_type_ids])
+  bert_encoder = tf_keras.Model(
+      inputs={
+          'input_word_ids': input_word_ids,
+          'input_mask': input_mask,
+          'input_type_ids': input_type_ids,
+      },
+      outputs=[sequence_output, pooled_output],
+      name='core_model')
+  return models.BertSpanLabeler(
+      network=bert_encoder, initializer=initializer), bert_encoder
+def classifier_model(bert_config,
+                     num_labels,
+                     max_seq_length=None,
+                     final_layer_initializer=None,
+                     hub_module_url=None,
+                     hub_module_trainable=True):
+  """BERT classifier model in functional API style.
+  Construct a Keras model for predicting `num_labels` outputs from an input with
+  maximum sequence length `max_seq_length`.
+  Args:
+    bert_config: BertConfig or AlbertConfig, the config defines the core BERT or
+      ALBERT model.
+    num_labels: integer, the number of classes.
+    max_seq_length: integer, the maximum input sequence length.
+    final_layer_initializer: Initializer for final dense layer. Defaulted
+      TruncatedNormal initializer.
+    hub_module_url: TF-Hub path/url to Bert module.
+    hub_module_trainable: True to finetune layers in the hub module.
+  Returns:
+    Combined prediction model (words, mask, type) -> (one-hot labels)
+    BERT sub-model (words, mask, type) -> (bert_outputs)
+  """
+  if final_layer_initializer is not None:
+    initializer = final_layer_initializer
+  else:
+    initializer = tf_keras.initializers.TruncatedNormal(
+        stddev=bert_config.initializer_range)
+  if not hub_module_url:
+    bert_encoder = get_transformer_encoder(
+        bert_config, max_seq_length, output_range=1)
+    return models.BertClassifier(
+        bert_encoder,
+        num_classes=num_labels,
+        dropout_rate=bert_config.hidden_dropout_prob,
+        initializer=initializer), bert_encoder
+  input_word_ids = tf_keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_word_ids')
+  input_mask = tf_keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_mask')
+  input_type_ids = tf_keras.layers.Input(
+      shape=(max_seq_length,), dtype=tf.int32, name='input_type_ids')
+  bert_model = hub.KerasLayer(hub_module_url, trainable=hub_module_trainable)
+  pooled_output, _ = bert_model([input_word_ids, input_mask, input_type_ids])
+  output = tf_keras.layers.Dropout(rate=bert_config.hidden_dropout_prob)(
+      pooled_output)
+  output = tf_keras.layers.Dense(
+      num_labels, kernel_initializer=initializer, name='output')(
+          output)
+  return tf_keras.Model(
+      inputs={
+          'input_word_ids': input_word_ids,
+          'input_mask': input_mask,
+          'input_type_ids': input_type_ids
+      },
+      outputs=output), bert_model

modeling/official/legacy/bert/bert_models_test.py ADDED Viewed

	@@ -0,0 +1,106 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import tensorflow as tf, tf_keras
+from official.legacy.bert import bert_models
+from official.legacy.bert import configs as bert_configs
+from official.nlp.modeling import networks
+class BertModelsTest(tf.test.TestCase):
+  def setUp(self):
+    super(BertModelsTest, self).setUp()
+    self._bert_test_config = bert_configs.BertConfig(
+        attention_probs_dropout_prob=0.0,
+        hidden_act='gelu',
+        hidden_dropout_prob=0.0,
+        hidden_size=16,
+        initializer_range=0.02,
+        intermediate_size=32,
+        max_position_embeddings=128,
+        num_attention_heads=2,
+        num_hidden_layers=2,
+        type_vocab_size=2,
+        vocab_size=30522)
+  def test_pretrain_model(self):
+    model, encoder = bert_models.pretrain_model(
+        self._bert_test_config,
+        seq_length=5,
+        max_predictions_per_seq=2,
+        initializer=None,
+        use_next_sentence_label=True)
+    self.assertIsInstance(model, tf_keras.Model)
+    self.assertIsInstance(encoder, networks.BertEncoder)
+    # model has one scalar output: loss value.
+    self.assertEqual(model.output.shape.as_list(), [
+        None,
+    ])
+    # Expect two output from encoder: sequence and classification output.
+    self.assertIsInstance(encoder.output, list)
+    self.assertLen(encoder.output, 2)
+    # shape should be [batch size, hidden_size]
+    self.assertEqual(encoder.output[1].shape.as_list(), [None, 16])
+  def test_squad_model(self):
+    model, core_model = bert_models.squad_model(
+        self._bert_test_config,
+        max_seq_length=5,
+        initializer=None,
+        hub_module_url=None,
+        hub_module_trainable=None)
+    self.assertIsInstance(model, tf_keras.Model)
+    self.assertIsInstance(core_model, tf_keras.Model)
+    # Expect two output from model: start positions and end positions
+    self.assertIsInstance(model.output, list)
+    self.assertLen(model.output, 2)
+    # Expect two output from core_model: sequence and classification output.
+    self.assertIsInstance(core_model.output, list)
+    self.assertLen(core_model.output, 2)
+    # shape should be [batch size, None, hidden_size]
+    self.assertEqual(core_model.output[0].shape.as_list(), [None, None, 16])
+    # shape should be [batch size, hidden_size]
+    self.assertEqual(core_model.output[1].shape.as_list(), [None, 16])
+  def test_classifier_model(self):
+    model, core_model = bert_models.classifier_model(
+        self._bert_test_config,
+        num_labels=3,
+        max_seq_length=5,
+        final_layer_initializer=None,
+        hub_module_url=None,
+        hub_module_trainable=None)
+    self.assertIsInstance(model, tf_keras.Model)
+    self.assertIsInstance(core_model, tf_keras.Model)
+    # model has one classification output with num_labels=3.
+    self.assertEqual(model.output.shape.as_list(), [None, 3])
+    # Expect two output from core_model: sequence and classification output.
+    self.assertIsInstance(core_model.output, list)
+    self.assertLen(core_model.output, 2)
+    # shape should be [batch size, None, hidden_size]
+    self.assertEqual(core_model.output[0].shape.as_list(), [None, None, 16])
+    # shape should be [batch size, hidden_size]
+    self.assertEqual(core_model.output[1].shape.as_list(), [None, 16])
+if __name__ == '__main__':
+  tf.test.main()

modeling/official/legacy/bert/common_flags.py ADDED Viewed

	@@ -0,0 +1,125 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Defining common flags used across all BERT models/applications."""
+from absl import flags
+import tensorflow as tf, tf_keras
+from official.utils import hyperparams_flags
+from official.utils.flags import core as flags_core
+def define_common_bert_flags():
+  """Define common flags for BERT tasks."""
+  flags_core.define_base(
+      data_dir=False,
+      model_dir=True,
+      clean=False,
+      train_epochs=False,
+      epochs_between_evals=False,
+      stop_threshold=False,
+      batch_size=False,
+      num_gpu=True,
+      export_dir=False,
+      distribution_strategy=True,
+      run_eagerly=True)
+  flags_core.define_distribution()
+  flags.DEFINE_string('bert_config_file', None,
+                      'Bert configuration file to define core bert layers.')
+  flags.DEFINE_string(
+      'model_export_path', None,
+      'Path to the directory, where trainined model will be '
+      'exported.')
+  flags.DEFINE_string('tpu', '', 'TPU address to connect to.')
+  flags.DEFINE_string(
+      'init_checkpoint', None,
+      'Initial checkpoint (usually from a pre-trained BERT model).')
+  flags.DEFINE_integer('num_train_epochs', 3,
+                       'Total number of training epochs to perform.')
+  flags.DEFINE_integer(
+      'steps_per_loop', None,
+      'Number of steps per graph-mode loop. Only training step '
+      'happens inside the loop. Callbacks will not be called '
+      'inside. If not set the value will be configured depending on the '
+      'devices available.')
+  flags.DEFINE_float('learning_rate', 5e-5,
+                     'The initial learning rate for Adam.')
+  flags.DEFINE_float('end_lr', 0.0,
+                     'The end learning rate for learning rate decay.')
+  flags.DEFINE_string('optimizer_type', 'adamw',
+                      'The type of optimizer to use for training (adamw|lamb)')
+  flags.DEFINE_boolean(
+      'scale_loss', False,
+      'Whether to divide the loss by number of replica inside the per-replica '
+      'loss function.')
+  flags.DEFINE_boolean(
+      'use_keras_compile_fit', False,
+      'If True, uses Keras compile/fit() API for training logic. Otherwise '
+      'use custom training loop.')
+  flags.DEFINE_string(
+      'hub_module_url', None, 'TF-Hub path/url to Bert module. '
+      'If specified, init_checkpoint flag should not be used.')
+  flags.DEFINE_bool('hub_module_trainable', True,
+                    'True to make keras layers in the hub module trainable.')
+  flags.DEFINE_string(
+      'sub_model_export_name', None,
+      'If set, `sub_model` checkpoints are exported into '
+      'FLAGS.model_dir/FLAGS.sub_model_export_name.')
+  flags.DEFINE_bool('explicit_allreduce', False,
+                    'True to use explicit allreduce instead of the implicit '
+                    'allreduce in optimizer.apply_gradients(). If fp16 mixed '
+                    'precision training is used, this also enables allreduce '
+                    'gradients in fp16.')
+  flags.DEFINE_integer('allreduce_bytes_per_pack', 0,
+                       'Number of bytes of a gradient pack for allreduce. '
+                       'Should be positive integer, if set to 0, all '
+                       'gradients are in one pack. Breaking gradient into '
+                       'packs could enable overlap between allreduce and '
+                       'backprop computation. This flag only takes effect '
+                       'when explicit_allreduce is set to True.')
+  flags_core.define_log_steps()
+  # Adds flags for mixed precision and multi-worker training.
+  flags_core.define_performance(
+      num_parallel_calls=False,
+      inter_op=False,
+      intra_op=False,
+      synthetic_data=False,
+      max_train_steps=False,
+      dtype=True,
+      loss_scale=True,
+      all_reduce_alg=True,
+      num_packs=False,
+      tf_gpu_thread_mode=True,
+      datasets_num_private_threads=True,
+      enable_xla=True,
+      fp16_implementation=True,
+  )
+  # Adds gin configuration flags.
+  hyperparams_flags.define_gin_flags()
+def dtype():
+  return flags_core.get_tf_dtype(flags.FLAGS)
+def use_float16():
+  return flags_core.get_tf_dtype(flags.FLAGS) == tf.float16
+def get_loss_scale():
+  return flags_core.get_loss_scale(flags.FLAGS, default_for_fp16='dynamic')

modeling/official/legacy/bert/configs.py ADDED Viewed

	@@ -0,0 +1,104 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""The main BERT model and related functions."""
+import copy
+import json
+import six
+import tensorflow as tf, tf_keras
+class BertConfig(object):
+  """Configuration for `BertModel`."""
+  def __init__(self,
+               vocab_size,
+               hidden_size=768,
+               num_hidden_layers=12,
+               num_attention_heads=12,
+               intermediate_size=3072,
+               hidden_act="gelu",
+               hidden_dropout_prob=0.1,
+               attention_probs_dropout_prob=0.1,
+               max_position_embeddings=512,
+               type_vocab_size=16,
+               initializer_range=0.02,
+               embedding_size=None,
+               backward_compatible=True):
+    """Constructs BertConfig.
+    Args:
+      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
+      hidden_size: Size of the encoder layers and the pooler layer.
+      num_hidden_layers: Number of hidden layers in the Transformer encoder.
+      num_attention_heads: Number of attention heads for each attention layer in
+        the Transformer encoder.
+      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
+        layer in the Transformer encoder.
+      hidden_act: The non-linear activation function (function or string) in the
+        encoder and pooler.
+      hidden_dropout_prob: The dropout probability for all fully connected
+        layers in the embeddings, encoder, and pooler.
+      attention_probs_dropout_prob: The dropout ratio for the attention
+        probabilities.
+      max_position_embeddings: The maximum sequence length that this model might
+        ever be used with. Typically set this to something large just in case
+        (e.g., 512 or 1024 or 2048).
+      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
+        `BertModel`.
+      initializer_range: The stdev of the truncated_normal_initializer for
+        initializing all weight matrices.
+      embedding_size: (Optional) width of the factorized word embeddings.
+      backward_compatible: Boolean, whether the variables shape are compatible
+        with checkpoints converted from TF 1.x BERT.
+    """
+    self.vocab_size = vocab_size
+    self.hidden_size = hidden_size
+    self.num_hidden_layers = num_hidden_layers
+    self.num_attention_heads = num_attention_heads
+    self.hidden_act = hidden_act
+    self.intermediate_size = intermediate_size
+    self.hidden_dropout_prob = hidden_dropout_prob
+    self.attention_probs_dropout_prob = attention_probs_dropout_prob
+    self.max_position_embeddings = max_position_embeddings
+    self.type_vocab_size = type_vocab_size
+    self.initializer_range = initializer_range
+    self.embedding_size = embedding_size
+    self.backward_compatible = backward_compatible
+  @classmethod
+  def from_dict(cls, json_object):
+    """Constructs a `BertConfig` from a Python dictionary of parameters."""
+    config = BertConfig(vocab_size=None)
+    for (key, value) in six.iteritems(json_object):
+      config.__dict__[key] = value
+    return config
+  @classmethod
+  def from_json_file(cls, json_file):
+    """Constructs a `BertConfig` from a json file of parameters."""
+    with tf.io.gfile.GFile(json_file, "r") as reader:
+      text = reader.read()
+    return cls.from_dict(json.loads(text))
+  def to_dict(self):
+    """Serializes this instance to a Python dictionary."""
+    output = copy.deepcopy(self.__dict__)
+    return output
+  def to_json_string(self):
+    """Serializes this instance to a JSON string."""
+    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"

modeling/official/legacy/bert/export_tfhub.py ADDED Viewed

	@@ -0,0 +1,139 @@

+# Copyright 2023 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A script to export BERT as a TF-Hub SavedModel.
+This script is **DEPRECATED** for exporting BERT encoder models;
+see the error message in by main() for details.
+"""
+from typing import Text
+# Import libraries
+from absl import app
+from absl import flags
+from absl import logging
+import tensorflow as tf, tf_keras
+from official.legacy.bert import bert_models
+from official.legacy.bert import configs
+FLAGS = flags.FLAGS
+flags.DEFINE_string("bert_config_file", None,
+                    "Bert configuration file to define core bert layers.")
+flags.DEFINE_string("model_checkpoint_path", None,
+                    "File path to TF model checkpoint.")
+flags.DEFINE_string("export_path", None, "TF-Hub SavedModel destination path.")
+flags.DEFINE_string("vocab_file", None,
+                    "The vocabulary file that the BERT model was trained on.")
+flags.DEFINE_bool(
+    "do_lower_case", None, "Whether to lowercase. If None, "
+    "do_lower_case will be enabled if 'uncased' appears in the "
+    "name of --vocab_file")
+flags.DEFINE_enum("model_type", "encoder", ["encoder", "squad"],
+                  "What kind of BERT model to export.")
+def create_bert_model(bert_config: configs.BertConfig) -> tf_keras.Model:
+  """Creates a BERT keras core model from BERT configuration.
+  Args:
+    bert_config: A `BertConfig` to create the core model.
+  Returns:
+    A keras model.
+  """
+  # Adds input layers just as placeholders.
+  input_word_ids = tf_keras.layers.Input(
+      shape=(None,), dtype=tf.int32, name="input_word_ids")
+  input_mask = tf_keras.layers.Input(
+      shape=(None,), dtype=tf.int32, name="input_mask")
+  input_type_ids = tf_keras.layers.Input(
+      shape=(None,), dtype=tf.int32, name="input_type_ids")
+  transformer_encoder = bert_models.get_transformer_encoder(
+      bert_config, sequence_length=None)
+  sequence_output, pooled_output = transformer_encoder(
+      [input_word_ids, input_mask, input_type_ids])
+  # To keep consistent with legacy hub modules, the outputs are
+  # "pooled_output" and "sequence_output".
+  return tf_keras.Model(
+      inputs=[input_word_ids, input_mask, input_type_ids],
+      outputs=[pooled_output, sequence_output]), transformer_encoder
+def export_bert_tfhub(bert_config: configs.BertConfig,
+                      model_checkpoint_path: Text,
+                      hub_destination: Text,
+                      vocab_file: Text,
+                      do_lower_case: bool = None):
+  """Restores a tf_keras.Model and saves for TF-Hub."""
+  # If do_lower_case is not explicit, default to checking whether "uncased" is
+  # in the vocab file name
+  if do_lower_case is None:
+    do_lower_case = "uncased" in vocab_file
+    logging.info("Using do_lower_case=%s based on name of vocab_file=%s",
+                 do_lower_case, vocab_file)
+  core_model, encoder = create_bert_model(bert_config)
+  checkpoint = tf.train.Checkpoint(
+      model=encoder,  # Legacy checkpoints.
+      encoder=encoder)
+  checkpoint.restore(model_checkpoint_path).assert_existing_objects_matched()
+  core_model.vocab_file = tf.saved_model.Asset(vocab_file)
+  core_model.do_lower_case = tf.Variable(do_lower_case, trainable=False)
+  core_model.save(hub_destination, include_optimizer=False, save_format="tf")
+def export_bert_squad_tfhub(bert_config: configs.BertConfig,
+                            model_checkpoint_path: Text,
+                            hub_destination: Text,
+                            vocab_file: Text,
+                            do_lower_case: bool = None):
+  """Restores a tf_keras.Model for BERT with SQuAD and saves for TF-Hub."""
+  # If do_lower_case is not explicit, default to checking whether "uncased" is
+  # in the vocab file name
+  if do_lower_case is None:
+    do_lower_case = "uncased" in vocab_file
+    logging.info("Using do_lower_case=%s based on name of vocab_file=%s",
+                 do_lower_case, vocab_file)
+  span_labeling, _ = bert_models.squad_model(bert_config, max_seq_length=None)
+  checkpoint = tf.train.Checkpoint(model=span_labeling)
+  checkpoint.restore(model_checkpoint_path).assert_existing_objects_matched()
+  span_labeling.vocab_file = tf.saved_model.Asset(vocab_file)
+  span_labeling.do_lower_case = tf.Variable(do_lower_case, trainable=False)
+  span_labeling.save(hub_destination, include_optimizer=False, save_format="tf")
+def main(_):
+  bert_config = configs.BertConfig.from_json_file(FLAGS.bert_config_file)
+  if FLAGS.model_type == "encoder":
+    deprecation_note = (
+        "nlp/bert/export_tfhub is **DEPRECATED** for exporting BERT encoder "
+        "models. Please switch to nlp/tools/export_tfhub for exporting BERT "
+        "(and other) encoders with dict inputs/outputs conforming to "
+        "https://www.tensorflow.org/hub/common_saved_model_apis/text#transformer-encoders"
+    )
+    logging.error(deprecation_note)
+    print("\n\nNOTICE:", deprecation_note, "\n")
+    export_bert_tfhub(bert_config, FLAGS.model_checkpoint_path,
+                      FLAGS.export_path, FLAGS.vocab_file, FLAGS.do_lower_case)
+  elif FLAGS.model_type == "squad":
+    export_bert_squad_tfhub(bert_config, FLAGS.model_checkpoint_path,
+                            FLAGS.export_path, FLAGS.vocab_file,
+                            FLAGS.do_lower_case)
+  else:
+    raise ValueError("Unsupported model_type %s." % FLAGS.model_type)
+if __name__ == "__main__":
+  app.run(main)