CoCalc -- creating_tfrecords.py

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/keras_recipes/creating_tfrecords.py
³⁵⁰⁷ views
1
"""
2
Title: Creating TFRecords
3
Author: [Dimitre Oliveira](https://www.linkedin.com/in/dimitre-oliveira-7a1a0113a/)
4
Date created: 2021/02/27
5
Last modified: 2023/12/20
6
Description: Converting data to the TFRecord format.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
The TFRecord format is a simple format for storing a sequence of binary records.
14
Converting your data into TFRecord has many advantages, such as:
15

16
- **More efficient storage**: the TFRecord data can take up less space than the original
17
data; it can also be partitioned into multiple files.
18
- **Fast I/O**: the TFRecord format can be read with parallel I/O operations, which is
19
useful for [TPUs](https://www.tensorflow.org/guide/tpu) or multiple hosts.
20
- **Self-contained files**: the TFRecord data can be read from a single source—for
21
example, the [COCO2017](https://cocodataset.org/) dataset originally stores data in
22
two folders ("images" and "annotations").
23

24
An important use case of the TFRecord data format  is training on TPUs. First, TPUs are
25
fast enough to benefit from optimized I/O operations. In addition, TPUs require
26
data to be stored remotely (e.g. on Google Cloud Storage) and using the TFRecord format
27
makes it easier to load the data without batch-downloading.
28

29
Performance using the TFRecord format can be further improved if you also use
30
it with the [tf.data](https://www.tensorflow.org/guide/data) API.
31

32
In this example you will learn how to convert data of different types (image, text, and
33
numeric) into TFRecord.
34

35
**Reference**
36

37
- [TFRecord and tf.train.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord)
38

39

40
## Dependencies
41
"""
42

43
import os
44

45
os.environ["KERAS_BACKEND"] = "tensorflow"
46
import keras
47
import json
48
import pprint
49
import tensorflow as tf
50
import matplotlib.pyplot as plt
51

52
"""
53
## Download the COCO2017 dataset
54

55
We will be using the [COCO2017](https://cocodataset.org/) dataset, because it has many
56
different types of features, including images, floating point data, and lists.
57
It will serve as a good example of how to encode different features into the TFRecord
58
format.
59

60
This dataset has two sets of fields: images and annotation meta-data.
61

62
The images are a collection of JPG files and the meta-data are stored in a JSON file
63
which, according to the [official site](https://cocodataset.org/#format-data),
64
contains the following properties:
65

66
```
67
id: int,
68
image_id: int,
69
category_id: int,
70
segmentation: RLE or [polygon], object segmentation mask
71
bbox: [x,y,width,height], object bounding box coordinates
72
area: float, area of the bounding box
73
iscrowd: 0 or 1, is single object or a collection
74
```
75
"""
76

77
root_dir = "datasets"
78
tfrecords_dir = "tfrecords"
79
images_dir = os.path.join(root_dir, "val2017")
80
annotations_dir = os.path.join(root_dir, "annotations")
81
annotation_file = os.path.join(annotations_dir, "instances_val2017.json")
82
images_url = "http://images.cocodataset.org/zips/val2017.zip"
83
annotations_url = (
84
    "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
85
)
86

87
# Download image files
88
if not os.path.exists(images_dir):
89
    image_zip = keras.utils.get_file(
90
        "images.zip",
91
        cache_dir=os.path.abspath("."),
92
        origin=images_url,
93
        extract=True,
94
    )
95
    os.remove(image_zip)
96

97
# Download caption annotation files
98
if not os.path.exists(annotations_dir):
99
    annotation_zip = keras.utils.get_file(
100
        "captions.zip",
101
        cache_dir=os.path.abspath("."),
102
        origin=annotations_url,
103
        extract=True,
104
    )
105
    os.remove(annotation_zip)
106

107
print("The COCO dataset has been downloaded and extracted successfully.")
108

109
with open(annotation_file, "r") as f:
110
    annotations = json.load(f)["annotations"]
111

112
print(f"Number of images: {len(annotations)}")
113

114
"""
115
### Contents of the COCO2017 dataset
116
"""
117

118
pprint.pprint(annotations[60])
119

120
"""
121
## Parameters
122

123
`num_samples` is the number of data samples on each TFRecord file.
124

125
`num_tfrecords` is total number of TFRecords that we will create.
126
"""
127

128
num_samples = 4096
129
num_tfrecords = len(annotations) // num_samples
130
if len(annotations) % num_samples:
131
    num_tfrecords += 1  # add one record if there are any remaining samples
132

133
if not os.path.exists(tfrecords_dir):
134
    os.makedirs(tfrecords_dir)  # creating TFRecords output folder
135

136
"""
137
## Define TFRecords helper functions
138
"""
139

140

141
def image_feature(value):
142
    """Returns a bytes_list from a string / byte."""
143
    return tf.train.Feature(
144
        bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()])
145
    )
146

147

148
def bytes_feature(value):
149
    """Returns a bytes_list from a string / byte."""
150
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode()]))
151

152

153
def float_feature(value):
154
    """Returns a float_list from a float / double."""
155
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
156

157

158
def int64_feature(value):
159
    """Returns an int64_list from a bool / enum / int / uint."""
160
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
161

162

163
def float_feature_list(value):
164
    """Returns a list of float_list from a float / double."""
165
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))
166

167

168
def create_example(image, path, example):
169
    feature = {
170
        "image": image_feature(image),
171
        "path": bytes_feature(path),
172
        "area": float_feature(example["area"]),
173
        "bbox": float_feature_list(example["bbox"]),
174
        "category_id": int64_feature(example["category_id"]),
175
        "id": int64_feature(example["id"]),
176
        "image_id": int64_feature(example["image_id"]),
177
    }
178
    return tf.train.Example(features=tf.train.Features(feature=feature))
179

180

181
def parse_tfrecord_fn(example):
182
    feature_description = {
183
        "image": tf.io.FixedLenFeature([], tf.string),
184
        "path": tf.io.FixedLenFeature([], tf.string),
185
        "area": tf.io.FixedLenFeature([], tf.float32),
186
        "bbox": tf.io.VarLenFeature(tf.float32),
187
        "category_id": tf.io.FixedLenFeature([], tf.int64),
188
        "id": tf.io.FixedLenFeature([], tf.int64),
189
        "image_id": tf.io.FixedLenFeature([], tf.int64),
190
    }
191
    example = tf.io.parse_single_example(example, feature_description)
192
    example["image"] = tf.io.decode_jpeg(example["image"], channels=3)
193
    example["bbox"] = tf.sparse.to_dense(example["bbox"])
194
    return example
195

196

197
"""
198
## Generate data in the TFRecord format
199

200
Let's generate the COCO2017 data in the TFRecord format. The format will be
201
`file_{number}.tfrec` (this is optional, but including the number sequences in the file
202
names can make counting easier).
203
"""
204

205
for tfrec_num in range(num_tfrecords):
206
    samples = annotations[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
207

208
    with tf.io.TFRecordWriter(
209
        tfrecords_dir + "/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
210
    ) as writer:
211
        for sample in samples:
212
            image_path = f"{images_dir}/{sample['image_id']:012d}.jpg"
213
            image = tf.io.decode_jpeg(tf.io.read_file(image_path))
214
            example = create_example(image, image_path, sample)
215
            writer.write(example.SerializeToString())
216

217
"""
218
## Explore one sample from the generated TFRecord
219
"""
220

221
raw_dataset = tf.data.TFRecordDataset(f"{tfrecords_dir}/file_00-{num_samples}.tfrec")
222
parsed_dataset = raw_dataset.map(parse_tfrecord_fn)
223

224
for features in parsed_dataset.take(1):
225
    for key in features.keys():
226
        if key != "image":
227
            print(f"{key}: {features[key]}")
228

229
    print(f"Image shape: {features['image'].shape}")
230
    plt.figure(figsize=(7, 7))
231
    plt.imshow(features["image"].numpy())
232
    plt.show()
233

234
"""
235
## Train a simple model using the generated TFRecords
236

237
Another advantage of TFRecord is that you are able to add many features to it and later
238
use only a few of them, in this case, we are going to use only `image` and `category_id`.
239

240
"""
241

242
"""
243

244
## Define dataset helper functions
245
"""
246

247

248
def prepare_sample(features):
249
    image = keras.ops.image.resize(features["image"], size=(224, 224))
250
    return image, features["category_id"]
251

252

253
def get_dataset(filenames, batch_size):
254
    dataset = (
255
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
256
        .map(parse_tfrecord_fn, num_parallel_calls=AUTOTUNE)
257
        .map(prepare_sample, num_parallel_calls=AUTOTUNE)
258
        .shuffle(batch_size * 10)
259
        .batch(batch_size)
260
        .prefetch(AUTOTUNE)
261
    )
262
    return dataset
263

264

265
train_filenames = tf.io.gfile.glob(f"{tfrecords_dir}/*.tfrec")
266
batch_size = 32
267
epochs = 1
268
steps_per_epoch = 50
269
AUTOTUNE = tf.data.AUTOTUNE
270

271
input_tensor = keras.layers.Input(shape=(224, 224, 3), name="image")
272
model = keras.applications.EfficientNetB0(
273
    input_tensor=input_tensor, weights=None, classes=91
274
)
275

276

277
model.compile(
278
    optimizer=keras.optimizers.Adam(),
279
    loss=keras.losses.SparseCategoricalCrossentropy(),
280
    metrics=[keras.metrics.SparseCategoricalAccuracy()],
281
)
282

283

284
model.fit(
285
    x=get_dataset(train_filenames, batch_size),
286
    epochs=epochs,
287
    steps_per_epoch=steps_per_epoch,
288
    verbose=1,
289
)
290

291
"""
292
## Conclusion
293

294
This example demonstrates that instead of reading images and annotations from different
295
sources you can have your data coming from a single source thanks to TFRecord.
296
This process can make storing and reading data simpler and more efficient.
297
For more information, you can go to the [TFRecord and
298
tf.train.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord) tutorial.
299
"""
300

301
Product

Resources

Company