Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/examples/keras_rs/sequential_retrieval.py
3507 views
1
"""
2
Title: Sequential retrieval [GRU4Rec]
3
Author: [Abheesht Sharma](https://github.com/abheesht17/), [Fabien Hertschuh](https://github.com/hertschuh/)
4
Date created: 2025/04/28
5
Last modified: 2025/04/28
6
Description: Recommend movies using a GRU-based sequential retrieval model.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Introduction
12
13
In this example, we are going to build a sequential retrieval model. Sequential
14
recommendation is a popular model that looks at a sequence of items that users
15
have interacted with previously and then predicts the next item. Here, the order
16
of the items within each sequence matters. So, we are going to use a recurrent
17
neural network to model the sequential relationship. For more details,
18
please refer to the [GRU4Rec](https://arxiv.org/abs/1511.06939) paper.
19
20
Let's begin by choosing JAX as the backend we want to run on, and import all
21
the necessary libraries.
22
"""
23
24
"""shell
25
pip install -q keras-rs
26
"""
27
28
import os
29
30
os.environ["KERAS_BACKEND"] = "jax" # `"tensorflow"`/`"torch"`
31
32
import collections
33
import os
34
import random
35
36
import keras
37
import pandas as pd
38
import tensorflow as tf # Needed only for the dataset
39
40
import keras_rs
41
42
"""
43
Let's also define all important variables/hyperparameters below.
44
"""
45
46
DATA_DIR = "./raw/data/"
47
48
# MovieLens-specific variables
49
MOVIELENS_1M_URL = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"
50
MOVIELENS_ZIP_HASH = "a6898adb50b9ca05aa231689da44c217cb524e7ebd39d264c56e2832f2c54e20"
51
52
RATINGS_FILE_NAME = "ratings.dat"
53
MOVIES_FILE_NAME = "movies.dat"
54
55
# Data processing args
56
MAX_CONTEXT_LENGTH = 10
57
MIN_SEQUENCE_LENGTH = 3
58
59
RATINGS_DATA_COLUMNS = ["UserID", "MovieID", "Rating", "Timestamp"]
60
MOVIES_DATA_COLUMNS = ["MovieID", "Title", "Genres"]
61
MIN_RATING = 2
62
63
# Training/model args
64
BATCH_SIZE = 4096
65
TEST_BATCH_SIZE = 2048
66
EMBEDDING_DIM = 32
67
NUM_EPOCHS = 5
68
LEARNING_RATE = 0.005
69
70
"""
71
## Dataset
72
73
Next, we need to prepare our dataset. Like we did in the
74
[basic retrieval](/keras_rs/examples/basic_retrieval/)
75
example, we are going to use the MovieLens dataset.
76
77
The dataset preparation step is fairly involved. The original ratings dataset
78
contains `(user, movie ID, rating, timestamp)` tuples (among other columns,
79
which are not important for this example). Since we are dealing with sequential
80
retrieval, we need to create movie sequences for every user, where the sequences
81
are ordered by timestamp.
82
83
Let's start by downloading and reading the dataset.
84
"""
85
86
# Download the MovieLens dataset.
87
if not os.path.exists(DATA_DIR):
88
os.makedirs(DATA_DIR)
89
90
path_to_zip = keras.utils.get_file(
91
fname="ml-1m.zip",
92
origin=MOVIELENS_1M_URL,
93
file_hash=MOVIELENS_ZIP_HASH,
94
hash_algorithm="sha256",
95
extract=True,
96
cache_dir=DATA_DIR,
97
)
98
movielens_extracted_dir = os.path.join(
99
os.path.dirname(path_to_zip),
100
"ml-1m_extracted",
101
"ml-1m",
102
)
103
104
105
# Read the dataset.
106
def read_data(data_directory, min_rating=None):
107
"""Read movielens ratings.dat and movies.dat file
108
into dataframe.
109
"""
110
111
ratings_df = pd.read_csv(
112
os.path.join(data_directory, RATINGS_FILE_NAME),
113
sep="::",
114
names=RATINGS_DATA_COLUMNS,
115
encoding="unicode_escape",
116
engine="python",
117
)
118
ratings_df["Timestamp"] = ratings_df["Timestamp"].apply(int)
119
120
# Remove movies with `rating < min_rating`.
121
if min_rating is not None:
122
ratings_df = ratings_df[ratings_df["Rating"] >= min_rating]
123
124
movies_df = pd.read_csv(
125
os.path.join(data_directory, MOVIES_FILE_NAME),
126
sep="::",
127
names=MOVIES_DATA_COLUMNS,
128
encoding="unicode_escape",
129
engine="python",
130
)
131
return ratings_df, movies_df
132
133
134
ratings_df, movies_df = read_data(
135
data_directory=movielens_extracted_dir, min_rating=MIN_RATING
136
)
137
138
# Need to know #movies so as to define embedding layers.
139
movies_count = movies_df["MovieID"].max()
140
141
"""
142
Now that we have read the dataset, let's create sequences of movies
143
for every user. Here is the function for doing just that.
144
"""
145
146
147
def get_movie_sequence_per_user(ratings_df):
148
"""Get movieID sequences for every user."""
149
sequences = collections.defaultdict(list)
150
151
for user_id, movie_id, rating, timestamp in ratings_df.values:
152
sequences[user_id].append(
153
{
154
"movie_id": movie_id,
155
"timestamp": timestamp,
156
"rating": rating,
157
}
158
)
159
160
# Sort movie sequences by timestamp for every user.
161
for user_id, context in sequences.items():
162
context.sort(key=lambda x: x["timestamp"])
163
sequences[user_id] = context
164
165
return sequences
166
167
168
"""
169
We need to do some filtering and processing before we proceed
170
with training the model:
171
172
1. Form sequences of all lengths up to
173
`min(user_sequence_length, MAX_CONTEXT_LENGTH)`. So, every user
174
will have multiple sequences corresponding to it.
175
2. Get labels, i.e., Given a sequence of length `n`, the first
176
`n-1` tokens will be fed to the model as input, and the label
177
will be the last token.
178
3. Remove all user sequences with less than `MIN_SEQUENCE_LENGTH`
179
movies.
180
4. Pad all sequences to `MAX_CONTEXT_LENGTH`.
181
182
An important point to note is how we form the train-test splits. We do not
183
form the entire dataset of sequences and then split it into train and test.
184
Instead, for every user, we take the last sequence to be part of the test set,
185
and all other sequences to be part of the train set. This is to prevent data
186
leakage.
187
"""
188
189
190
def generate_examples_from_user_sequences(sequences):
191
"""Generates sequences for all users, with padding, truncation, etc."""
192
193
def generate_examples_from_user_sequence(sequence):
194
"""Generates examples for a single user sequence."""
195
196
train_examples = []
197
test_examples = []
198
for label_idx in range(1, len(sequence)):
199
start_idx = max(0, label_idx - MAX_CONTEXT_LENGTH)
200
context = sequence[start_idx:label_idx]
201
202
# Padding
203
while len(context) < MAX_CONTEXT_LENGTH:
204
context.append(
205
{
206
"movie_id": 0,
207
"timestamp": 0,
208
"rating": 0.0,
209
}
210
)
211
212
label_movie_id = int(sequence[label_idx]["movie_id"])
213
context_movie_id = [int(movie["movie_id"]) for movie in context]
214
215
example = {
216
"context_movie_id": context_movie_id,
217
"label_movie_id": label_movie_id,
218
}
219
220
if label_idx == len(sequence) - 1:
221
test_examples.append(example)
222
else:
223
train_examples.append(example)
224
225
return train_examples, test_examples
226
227
all_train_examples = []
228
all_test_examples = []
229
for sequence in sequences.values():
230
if len(sequence) < MIN_SEQUENCE_LENGTH:
231
continue
232
233
user_train_examples, user_test_example = generate_examples_from_user_sequence(
234
sequence
235
)
236
237
all_train_examples.extend(user_train_examples)
238
all_test_examples.extend(user_test_example)
239
240
return all_train_examples, all_test_examples
241
242
243
"""
244
Let's split the dataset into train and test sets. Also, we need to
245
change the format of the dataset dictionary so as to enable conversion
246
to a `tf.data.Dataset` object.
247
"""
248
sequences = get_movie_sequence_per_user(ratings_df)
249
train_examples, test_examples = generate_examples_from_user_sequences(sequences)
250
251
252
def list_of_dicts_to_dict_of_lists(list_of_dicts):
253
"""Convert list of dictionaries to dictionary of lists for
254
`tf.data` conversion.
255
"""
256
dict_of_lists = collections.defaultdict(list)
257
for dictionary in list_of_dicts:
258
for key, value in dictionary.items():
259
dict_of_lists[key].append(value)
260
return dict_of_lists
261
262
263
train_examples = list_of_dicts_to_dict_of_lists(train_examples)
264
test_examples = list_of_dicts_to_dict_of_lists(test_examples)
265
266
train_ds = tf.data.Dataset.from_tensor_slices(train_examples).map(
267
lambda x: (x["context_movie_id"], x["label_movie_id"])
268
)
269
test_ds = tf.data.Dataset.from_tensor_slices(test_examples).map(
270
lambda x: (x["context_movie_id"], x["label_movie_id"])
271
)
272
273
"""
274
We need to batch our datasets. We also user `cache()` and `prefetch()`
275
for better performance.
276
"""
277
train_ds = train_ds.batch(BATCH_SIZE).cache().prefetch(tf.data.AUTOTUNE)
278
test_ds = test_ds.batch(TEST_BATCH_SIZE).cache().prefetch(tf.data.AUTOTUNE)
279
280
"""
281
Let's print out one batch.
282
"""
283
284
for sample in train_ds.take(1):
285
print(sample)
286
287
"""
288
## Model and Training
289
290
In the basic retrieval example, we used one query tower for the
291
user, and the candidate tower for the candidate movie. We are
292
going to use a two-tower architecture here as well. However,
293
we use the query tower with a Gated Recurrent Unit (GRU) layer
294
to encode the sequence of historical movies, and keep the same
295
candidate tower for the candidate movie.
296
297
Note: Take a look at how the labels are defined. The label tensor
298
(of shape `(batch_size, batch_size)`) contains one-hot vectors. The idea
299
is: for every sample, consider movie IDs corresponding to other samples in
300
the batch as negatives.
301
"""
302
303
304
class SequentialRetrievalModel(keras.Model):
305
"""Create the sequential retrieval model.
306
307
Args:
308
movies_count: Total number of unique movies in the dataset.
309
embedding_dimension: Output dimension for movie embedding tables.
310
"""
311
312
def __init__(
313
self,
314
movies_count,
315
embedding_dimension=128,
316
**kwargs,
317
):
318
super().__init__(**kwargs)
319
# Our query tower, simply an embedding table followed by
320
# a GRU unit. This encodes sequence of historical movies.
321
self.query_model = keras.Sequential(
322
[
323
keras.layers.Embedding(movies_count + 1, embedding_dimension),
324
keras.layers.GRU(embedding_dimension),
325
]
326
)
327
328
# Our candidate tower, simply an embedding table.
329
self.candidate_model = keras.layers.Embedding(
330
movies_count + 1, embedding_dimension
331
)
332
333
# The layer that performs the retrieval.
334
self.retrieval = keras_rs.layers.BruteForceRetrieval(k=10, return_scores=False)
335
self.loss_fn = keras.losses.CategoricalCrossentropy(
336
from_logits=True,
337
)
338
339
def build(self, input_shape):
340
self.query_model.build(input_shape)
341
self.candidate_model.build(input_shape)
342
343
# In this case, the candidates are directly the movie embeddings.
344
# We take a shortcut and directly reuse the variable.
345
self.retrieval.candidate_embeddings = self.candidate_model.embeddings
346
self.retrieval.build(input_shape)
347
super().build(input_shape)
348
349
def call(self, inputs, training=False):
350
query_embeddings = self.query_model(inputs)
351
result = {
352
"query_embeddings": query_embeddings,
353
}
354
355
if not training:
356
# Skip the retrieval of top movies during training as the
357
# predictions are not used.
358
result["predictions"] = self.retrieval(query_embeddings)
359
return result
360
361
def compute_loss(self, x, y, y_pred, sample_weight, training=True):
362
candidate_id = y
363
query_embeddings = y_pred["query_embeddings"]
364
candidate_embeddings = self.candidate_model(candidate_id)
365
366
num_queries = keras.ops.shape(query_embeddings)[0]
367
num_candidates = keras.ops.shape(candidate_embeddings)[0]
368
369
# One-hot vectors for labels.
370
labels = keras.ops.eye(num_queries, num_candidates)
371
372
# Compute the affinity score by multiplying the two embeddings.
373
scores = keras.ops.matmul(
374
query_embeddings, keras.ops.transpose(candidate_embeddings)
375
)
376
377
return self.loss_fn(labels, scores, sample_weight)
378
379
380
"""
381
Let's instantiate, compile and train our model.
382
"""
383
384
model = SequentialRetrievalModel(
385
movies_count=movies_count, embedding_dimension=EMBEDDING_DIM
386
)
387
388
# Compile.
389
model.compile(optimizer=keras.optimizers.AdamW(learning_rate=LEARNING_RATE))
390
391
# Train.
392
model.fit(
393
train_ds,
394
validation_data=test_ds,
395
epochs=NUM_EPOCHS,
396
)
397
398
"""
399
## Making predictions
400
401
Now that we have a model, we would like to be able to make predictions.
402
403
So far, we have only handled movies by id. Now is the time to create a mapping
404
keyed by movie IDs to be able to surface the titles.
405
"""
406
407
movie_id_to_movie_title = dict(zip(movies_df["MovieID"], movies_df["Title"]))
408
movie_id_to_movie_title[0] = "" # Because id 0 is not in the dataset.
409
410
"""
411
We then simply use the Keras `model.predict()` method. Under the hood, it calls
412
the `BruteForceRetrieval` layer to perform the actual retrieval.
413
414
Note that this model can retrieve movies already watched by the user. We could
415
easily add logic to remove them if that is desirable.
416
"""
417
418
print("\n==> Movies the user has watched:")
419
movie_sequence = test_ds.unbatch().take(1)
420
for element in movie_sequence:
421
for movie_id in element[0][:-1]:
422
print(movie_id_to_movie_title[movie_id.numpy()], end=", ")
423
print(movie_id_to_movie_title[element[0][-1].numpy()])
424
425
predictions = model.predict(movie_sequence.batch(1))
426
predictions = keras.ops.convert_to_numpy(predictions["predictions"])
427
428
print("\n==> Recommended movies for the above sequence:")
429
for movie_id in predictions[0]:
430
print(movie_id_to_movie_title[movie_id])
431
432