Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/keras_recipes/creating_tfrecords.py
3507 views
1
"""
2
Title: Creating TFRecords
3
Author: [Dimitre Oliveira](https://www.linkedin.com/in/dimitre-oliveira-7a1a0113a/)
4
Date created: 2021/02/27
5
Last modified: 2023/12/20
6
Description: Converting data to the TFRecord format.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
The TFRecord format is a simple format for storing a sequence of binary records.
14
Converting your data into TFRecord has many advantages, such as:
15
16
- **More efficient storage**: the TFRecord data can take up less space than the original
17
data; it can also be partitioned into multiple files.
18
- **Fast I/O**: the TFRecord format can be read with parallel I/O operations, which is
19
useful for [TPUs](https://www.tensorflow.org/guide/tpu) or multiple hosts.
20
- **Self-contained files**: the TFRecord data can be read from a single source—for
21
example, the [COCO2017](https://cocodataset.org/) dataset originally stores data in
22
two folders ("images" and "annotations").
23
24
An important use case of the TFRecord data format is training on TPUs. First, TPUs are
25
fast enough to benefit from optimized I/O operations. In addition, TPUs require
26
data to be stored remotely (e.g. on Google Cloud Storage) and using the TFRecord format
27
makes it easier to load the data without batch-downloading.
28
29
Performance using the TFRecord format can be further improved if you also use
30
it with the [tf.data](https://www.tensorflow.org/guide/data) API.
31
32
In this example you will learn how to convert data of different types (image, text, and
33
numeric) into TFRecord.
34
35
**Reference**
36
37
- [TFRecord and tf.train.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord)
38
39
40
## Dependencies
41
"""
42
43
import os
44
45
os.environ["KERAS_BACKEND"] = "tensorflow"
46
import keras
47
import json
48
import pprint
49
import tensorflow as tf
50
import matplotlib.pyplot as plt
51
52
"""
53
## Download the COCO2017 dataset
54
55
We will be using the [COCO2017](https://cocodataset.org/) dataset, because it has many
56
different types of features, including images, floating point data, and lists.
57
It will serve as a good example of how to encode different features into the TFRecord
58
format.
59
60
This dataset has two sets of fields: images and annotation meta-data.
61
62
The images are a collection of JPG files and the meta-data are stored in a JSON file
63
which, according to the [official site](https://cocodataset.org/#format-data),
64
contains the following properties:
65
66
```
67
id: int,
68
image_id: int,
69
category_id: int,
70
segmentation: RLE or [polygon], object segmentation mask
71
bbox: [x,y,width,height], object bounding box coordinates
72
area: float, area of the bounding box
73
iscrowd: 0 or 1, is single object or a collection
74
```
75
"""
76
77
root_dir = "datasets"
78
tfrecords_dir = "tfrecords"
79
images_dir = os.path.join(root_dir, "val2017")
80
annotations_dir = os.path.join(root_dir, "annotations")
81
annotation_file = os.path.join(annotations_dir, "instances_val2017.json")
82
images_url = "http://images.cocodataset.org/zips/val2017.zip"
83
annotations_url = (
84
"http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
85
)
86
87
# Download image files
88
if not os.path.exists(images_dir):
89
image_zip = keras.utils.get_file(
90
"images.zip",
91
cache_dir=os.path.abspath("."),
92
origin=images_url,
93
extract=True,
94
)
95
os.remove(image_zip)
96
97
# Download caption annotation files
98
if not os.path.exists(annotations_dir):
99
annotation_zip = keras.utils.get_file(
100
"captions.zip",
101
cache_dir=os.path.abspath("."),
102
origin=annotations_url,
103
extract=True,
104
)
105
os.remove(annotation_zip)
106
107
print("The COCO dataset has been downloaded and extracted successfully.")
108
109
with open(annotation_file, "r") as f:
110
annotations = json.load(f)["annotations"]
111
112
print(f"Number of images: {len(annotations)}")
113
114
"""
115
### Contents of the COCO2017 dataset
116
"""
117
118
pprint.pprint(annotations[60])
119
120
"""
121
## Parameters
122
123
`num_samples` is the number of data samples on each TFRecord file.
124
125
`num_tfrecords` is total number of TFRecords that we will create.
126
"""
127
128
num_samples = 4096
129
num_tfrecords = len(annotations) // num_samples
130
if len(annotations) % num_samples:
131
num_tfrecords += 1 # add one record if there are any remaining samples
132
133
if not os.path.exists(tfrecords_dir):
134
os.makedirs(tfrecords_dir) # creating TFRecords output folder
135
136
"""
137
## Define TFRecords helper functions
138
"""
139
140
141
def image_feature(value):
142
"""Returns a bytes_list from a string / byte."""
143
return tf.train.Feature(
144
bytes_list=tf.train.BytesList(value=[tf.io.encode_jpeg(value).numpy()])
145
)
146
147
148
def bytes_feature(value):
149
"""Returns a bytes_list from a string / byte."""
150
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value.encode()]))
151
152
153
def float_feature(value):
154
"""Returns a float_list from a float / double."""
155
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
156
157
158
def int64_feature(value):
159
"""Returns an int64_list from a bool / enum / int / uint."""
160
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
161
162
163
def float_feature_list(value):
164
"""Returns a list of float_list from a float / double."""
165
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
166
167
168
def create_example(image, path, example):
169
feature = {
170
"image": image_feature(image),
171
"path": bytes_feature(path),
172
"area": float_feature(example["area"]),
173
"bbox": float_feature_list(example["bbox"]),
174
"category_id": int64_feature(example["category_id"]),
175
"id": int64_feature(example["id"]),
176
"image_id": int64_feature(example["image_id"]),
177
}
178
return tf.train.Example(features=tf.train.Features(feature=feature))
179
180
181
def parse_tfrecord_fn(example):
182
feature_description = {
183
"image": tf.io.FixedLenFeature([], tf.string),
184
"path": tf.io.FixedLenFeature([], tf.string),
185
"area": tf.io.FixedLenFeature([], tf.float32),
186
"bbox": tf.io.VarLenFeature(tf.float32),
187
"category_id": tf.io.FixedLenFeature([], tf.int64),
188
"id": tf.io.FixedLenFeature([], tf.int64),
189
"image_id": tf.io.FixedLenFeature([], tf.int64),
190
}
191
example = tf.io.parse_single_example(example, feature_description)
192
example["image"] = tf.io.decode_jpeg(example["image"], channels=3)
193
example["bbox"] = tf.sparse.to_dense(example["bbox"])
194
return example
195
196
197
"""
198
## Generate data in the TFRecord format
199
200
Let's generate the COCO2017 data in the TFRecord format. The format will be
201
`file_{number}.tfrec` (this is optional, but including the number sequences in the file
202
names can make counting easier).
203
"""
204
205
for tfrec_num in range(num_tfrecords):
206
samples = annotations[(tfrec_num * num_samples) : ((tfrec_num + 1) * num_samples)]
207
208
with tf.io.TFRecordWriter(
209
tfrecords_dir + "/file_%.2i-%i.tfrec" % (tfrec_num, len(samples))
210
) as writer:
211
for sample in samples:
212
image_path = f"{images_dir}/{sample['image_id']:012d}.jpg"
213
image = tf.io.decode_jpeg(tf.io.read_file(image_path))
214
example = create_example(image, image_path, sample)
215
writer.write(example.SerializeToString())
216
217
"""
218
## Explore one sample from the generated TFRecord
219
"""
220
221
raw_dataset = tf.data.TFRecordDataset(f"{tfrecords_dir}/file_00-{num_samples}.tfrec")
222
parsed_dataset = raw_dataset.map(parse_tfrecord_fn)
223
224
for features in parsed_dataset.take(1):
225
for key in features.keys():
226
if key != "image":
227
print(f"{key}: {features[key]}")
228
229
print(f"Image shape: {features['image'].shape}")
230
plt.figure(figsize=(7, 7))
231
plt.imshow(features["image"].numpy())
232
plt.show()
233
234
"""
235
## Train a simple model using the generated TFRecords
236
237
Another advantage of TFRecord is that you are able to add many features to it and later
238
use only a few of them, in this case, we are going to use only `image` and `category_id`.
239
240
"""
241
242
"""
243
244
## Define dataset helper functions
245
"""
246
247
248
def prepare_sample(features):
249
image = keras.ops.image.resize(features["image"], size=(224, 224))
250
return image, features["category_id"]
251
252
253
def get_dataset(filenames, batch_size):
254
dataset = (
255
tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
256
.map(parse_tfrecord_fn, num_parallel_calls=AUTOTUNE)
257
.map(prepare_sample, num_parallel_calls=AUTOTUNE)
258
.shuffle(batch_size * 10)
259
.batch(batch_size)
260
.prefetch(AUTOTUNE)
261
)
262
return dataset
263
264
265
train_filenames = tf.io.gfile.glob(f"{tfrecords_dir}/*.tfrec")
266
batch_size = 32
267
epochs = 1
268
steps_per_epoch = 50
269
AUTOTUNE = tf.data.AUTOTUNE
270
271
input_tensor = keras.layers.Input(shape=(224, 224, 3), name="image")
272
model = keras.applications.EfficientNetB0(
273
input_tensor=input_tensor, weights=None, classes=91
274
)
275
276
277
model.compile(
278
optimizer=keras.optimizers.Adam(),
279
loss=keras.losses.SparseCategoricalCrossentropy(),
280
metrics=[keras.metrics.SparseCategoricalAccuracy()],
281
)
282
283
284
model.fit(
285
x=get_dataset(train_filenames, batch_size),
286
epochs=epochs,
287
steps_per_epoch=steps_per_epoch,
288
verbose=1,
289
)
290
291
"""
292
## Conclusion
293
294
This example demonstrates that instead of reading images and annotations from different
295
sources you can have your data coming from a single source thanks to TFRecord.
296
This process can make storing and reading data simpler and more efficient.
297
For more information, you can go to the [TFRecord and
298
tf.train.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord) tutorial.
299
"""
300
301