CoCalc -- sequential

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/keras_rs/sequential_retrieval.py
³⁵⁰⁷ views
1
"""
2
Title: Sequential retrieval [GRU4Rec]
3
Author: [Abheesht Sharma](https://github.com/abheesht17/), [Fabien Hertschuh](https://github.com/hertschuh/)
4
Date created: 2025/04/28
5
Last modified: 2025/04/28
6
Description: Recommend movies using a GRU-based sequential retrieval model.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
In this example, we are going to build a sequential retrieval model. Sequential
14
recommendation is a popular model that looks at a sequence of items that users
15
have interacted with previously and then predicts the next item. Here, the order
16
of the items within each sequence matters. So, we are going to use a recurrent
17
neural network to model the sequential relationship. For more details,
18
please refer to the [GRU4Rec](https://arxiv.org/abs/1511.06939) paper.
19

20
Let's begin by choosing JAX as the backend we want to run on, and import all
21
the necessary libraries.
22
"""
23

24
"""shell
25
pip install -q keras-rs
26
"""
27

28
import os
29

30
os.environ["KERAS_BACKEND"] = "jax"  # `"tensorflow"`/`"torch"`
31

32
import collections
33
import os
34
import random
35

36
import keras
37
import pandas as pd
38
import tensorflow as tf  # Needed only for the dataset
39

40
import keras_rs
41

42
"""
43
Let's also define all important variables/hyperparameters below.
44
"""
45

46
DATA_DIR = "./raw/data/"
47

48
# MovieLens-specific variables
49
MOVIELENS_1M_URL = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"
50
MOVIELENS_ZIP_HASH = "a6898adb50b9ca05aa231689da44c217cb524e7ebd39d264c56e2832f2c54e20"
51

52
RATINGS_FILE_NAME = "ratings.dat"
53
MOVIES_FILE_NAME = "movies.dat"
54

55
# Data processing args
56
MAX_CONTEXT_LENGTH = 10
57
MIN_SEQUENCE_LENGTH = 3
58

59
RATINGS_DATA_COLUMNS = ["UserID", "MovieID", "Rating", "Timestamp"]
60
MOVIES_DATA_COLUMNS = ["MovieID", "Title", "Genres"]
61
MIN_RATING = 2
62

63
# Training/model args
64
BATCH_SIZE = 4096
65
TEST_BATCH_SIZE = 2048
66
EMBEDDING_DIM = 32
67
NUM_EPOCHS = 5
68
LEARNING_RATE = 0.005
69

70
"""
71
## Dataset
72

73
Next, we need to prepare our dataset. Like we did in the
74
[basic retrieval](/keras_rs/examples/basic_retrieval/)
75
example, we are going to use the MovieLens dataset.
76

77
The dataset preparation step is fairly involved. The original ratings dataset
78
contains `(user, movie ID, rating, timestamp)` tuples (among other columns,
79
which are not important for this example). Since we are dealing with sequential
80
retrieval, we need to create movie sequences for every user, where the sequences
81
are ordered by timestamp.
82

83
Let's start by downloading and reading the dataset.
84
"""
85

86
# Download the MovieLens dataset.
87
if not os.path.exists(DATA_DIR):
88
    os.makedirs(DATA_DIR)
89

90
path_to_zip = keras.utils.get_file(
91
    fname="ml-1m.zip",
92
    origin=MOVIELENS_1M_URL,
93
    file_hash=MOVIELENS_ZIP_HASH,
94
    hash_algorithm="sha256",
95
    extract=True,
96
    cache_dir=DATA_DIR,
97
)
98
movielens_extracted_dir = os.path.join(
99
    os.path.dirname(path_to_zip),
100
    "ml-1m_extracted",
101
    "ml-1m",
102
)
103

104

105
# Read the dataset.
106
def read_data(data_directory, min_rating=None):
107
    """Read movielens ratings.dat and movies.dat file
108
    into dataframe.
109
    """
110

111
    ratings_df = pd.read_csv(
112
        os.path.join(data_directory, RATINGS_FILE_NAME),
113
        sep="::",
114
        names=RATINGS_DATA_COLUMNS,
115
        encoding="unicode_escape",
116
        engine="python",
117
    )
118
    ratings_df["Timestamp"] = ratings_df["Timestamp"].apply(int)
119

120
    # Remove movies with `rating < min_rating`.
121
    if min_rating is not None:
122
        ratings_df = ratings_df[ratings_df["Rating"] >= min_rating]
123

124
    movies_df = pd.read_csv(
125
        os.path.join(data_directory, MOVIES_FILE_NAME),
126
        sep="::",
127
        names=MOVIES_DATA_COLUMNS,
128
        encoding="unicode_escape",
129
        engine="python",
130
    )
131
    return ratings_df, movies_df
132

133

134
ratings_df, movies_df = read_data(
135
    data_directory=movielens_extracted_dir, min_rating=MIN_RATING
136
)
137

138
# Need to know #movies so as to define embedding layers.
139
movies_count = movies_df["MovieID"].max()
140

141
"""
142
Now that we have read the dataset, let's create sequences of movies
143
for every user. Here is the function for doing just that.
144
"""
145

146

147
def get_movie_sequence_per_user(ratings_df):
148
    """Get movieID sequences for every user."""
149
    sequences = collections.defaultdict(list)
150

151
    for user_id, movie_id, rating, timestamp in ratings_df.values:
152
        sequences[user_id].append(
153
            {
154
                "movie_id": movie_id,
155
                "timestamp": timestamp,
156
                "rating": rating,
157
            }
158
        )
159

160
    # Sort movie sequences by timestamp for every user.
161
    for user_id, context in sequences.items():
162
        context.sort(key=lambda x: x["timestamp"])
163
        sequences[user_id] = context
164

165
    return sequences
166

167

168
"""
169
We need to do some filtering and processing before we proceed
170
with training the model:
171

172
1. Form sequences of all lengths up to
173
   `min(user_sequence_length, MAX_CONTEXT_LENGTH)`. So, every user
174
   will have multiple sequences corresponding to it.
175
2. Get labels, i.e., Given a sequence of length `n`, the first
176
   `n-1` tokens will be fed to the model as input, and the label
177
   will be the last token.
178
3. Remove all user sequences with less than `MIN_SEQUENCE_LENGTH`
179
   movies.
180
4. Pad all sequences to `MAX_CONTEXT_LENGTH`.
181

182
An important point to note is how we form the train-test splits. We do not
183
form the entire dataset of sequences and then split it into train and test.
184
Instead, for every user, we take the last sequence to be part of the test set,
185
and all other sequences to be part of the train set. This is to prevent data
186
leakage.
187
"""
188

189

190
def generate_examples_from_user_sequences(sequences):
191
    """Generates sequences for all users, with padding, truncation, etc."""
192

193
    def generate_examples_from_user_sequence(sequence):
194
        """Generates examples for a single user sequence."""
195

196
        train_examples = []
197
        test_examples = []
198
        for label_idx in range(1, len(sequence)):
199
            start_idx = max(0, label_idx - MAX_CONTEXT_LENGTH)
200
            context = sequence[start_idx:label_idx]
201

202
            # Padding
203
            while len(context) < MAX_CONTEXT_LENGTH:
204
                context.append(
205
                    {
206
                        "movie_id": 0,
207
                        "timestamp": 0,
208
                        "rating": 0.0,
209
                    }
210
                )
211

212
            label_movie_id = int(sequence[label_idx]["movie_id"])
213
            context_movie_id = [int(movie["movie_id"]) for movie in context]
214

215
            example = {
216
                "context_movie_id": context_movie_id,
217
                "label_movie_id": label_movie_id,
218
            }
219

220
            if label_idx == len(sequence) - 1:
221
                test_examples.append(example)
222
            else:
223
                train_examples.append(example)
224

225
        return train_examples, test_examples
226

227
    all_train_examples = []
228
    all_test_examples = []
229
    for sequence in sequences.values():
230
        if len(sequence) < MIN_SEQUENCE_LENGTH:
231
            continue
232

233
        user_train_examples, user_test_example = generate_examples_from_user_sequence(
234
            sequence
235
        )
236

237
        all_train_examples.extend(user_train_examples)
238
        all_test_examples.extend(user_test_example)
239

240
    return all_train_examples, all_test_examples
241

242

243
"""
244
Let's split the dataset into train and test sets. Also, we need to
245
change the format of the dataset dictionary so as to enable conversion
246
to a `tf.data.Dataset` object.
247
"""
248
sequences = get_movie_sequence_per_user(ratings_df)
249
train_examples, test_examples = generate_examples_from_user_sequences(sequences)
250

251

252
def list_of_dicts_to_dict_of_lists(list_of_dicts):
253
    """Convert list of dictionaries to dictionary of lists for
254
    `tf.data` conversion.
255
    """
256
    dict_of_lists = collections.defaultdict(list)
257
    for dictionary in list_of_dicts:
258
        for key, value in dictionary.items():
259
            dict_of_lists[key].append(value)
260
    return dict_of_lists
261

262

263
train_examples = list_of_dicts_to_dict_of_lists(train_examples)
264
test_examples = list_of_dicts_to_dict_of_lists(test_examples)
265

266
train_ds = tf.data.Dataset.from_tensor_slices(train_examples).map(
267
    lambda x: (x["context_movie_id"], x["label_movie_id"])
268
)
269
test_ds = tf.data.Dataset.from_tensor_slices(test_examples).map(
270
    lambda x: (x["context_movie_id"], x["label_movie_id"])
271
)
272

273
"""
274
We need to batch our datasets. We also user `cache()` and `prefetch()`
275
for better performance.
276
"""
277
train_ds = train_ds.batch(BATCH_SIZE).cache().prefetch(tf.data.AUTOTUNE)
278
test_ds = test_ds.batch(TEST_BATCH_SIZE).cache().prefetch(tf.data.AUTOTUNE)
279

280
"""
281
Let's print out one batch.
282
"""
283

284
for sample in train_ds.take(1):
285
    print(sample)
286

287
"""
288
## Model and Training
289

290
In the basic retrieval example, we used one query tower for the
291
user, and the candidate tower for the candidate movie. We are
292
going to use a two-tower architecture here as well. However,
293
we use the query tower with a Gated Recurrent Unit (GRU) layer
294
to encode the sequence of historical movies, and keep the same
295
candidate tower for the candidate movie.
296

297
Note: Take a look at how the labels are defined. The label tensor
298
(of shape `(batch_size, batch_size)`) contains one-hot vectors. The idea
299
is: for every sample, consider movie IDs corresponding to other samples in
300
the batch as negatives.
301
"""
302

303

304
class SequentialRetrievalModel(keras.Model):
305
    """Create the sequential retrieval model.
306

307
    Args:
308
      movies_count: Total number of unique movies in the dataset.
309
      embedding_dimension: Output dimension for movie embedding tables.
310
    """
311

312
    def __init__(
313
        self,
314
        movies_count,
315
        embedding_dimension=128,
316
        **kwargs,
317
    ):
318
        super().__init__(**kwargs)
319
        # Our query tower, simply an embedding table followed by
320
        # a GRU unit. This encodes sequence of historical movies.
321
        self.query_model = keras.Sequential(
322
            [
323
                keras.layers.Embedding(movies_count + 1, embedding_dimension),
324
                keras.layers.GRU(embedding_dimension),
325
            ]
326
        )
327

328
        # Our candidate tower, simply an embedding table.
329
        self.candidate_model = keras.layers.Embedding(
330
            movies_count + 1, embedding_dimension
331
        )
332

333
        # The layer that performs the retrieval.
334
        self.retrieval = keras_rs.layers.BruteForceRetrieval(k=10, return_scores=False)
335
        self.loss_fn = keras.losses.CategoricalCrossentropy(
336
            from_logits=True,
337
        )
338

339
    def build(self, input_shape):
340
        self.query_model.build(input_shape)
341
        self.candidate_model.build(input_shape)
342

343
        # In this case, the candidates are directly the movie embeddings.
344
        # We take a shortcut and directly reuse the variable.
345
        self.retrieval.candidate_embeddings = self.candidate_model.embeddings
346
        self.retrieval.build(input_shape)
347
        super().build(input_shape)
348

349
    def call(self, inputs, training=False):
350
        query_embeddings = self.query_model(inputs)
351
        result = {
352
            "query_embeddings": query_embeddings,
353
        }
354

355
        if not training:
356
            # Skip the retrieval of top movies during training as the
357
            # predictions are not used.
358
            result["predictions"] = self.retrieval(query_embeddings)
359
        return result
360

361
    def compute_loss(self, x, y, y_pred, sample_weight, training=True):
362
        candidate_id = y
363
        query_embeddings = y_pred["query_embeddings"]
364
        candidate_embeddings = self.candidate_model(candidate_id)
365

366
        num_queries = keras.ops.shape(query_embeddings)[0]
367
        num_candidates = keras.ops.shape(candidate_embeddings)[0]
368

369
        # One-hot vectors for labels.
370
        labels = keras.ops.eye(num_queries, num_candidates)
371

372
        # Compute the affinity score by multiplying the two embeddings.
373
        scores = keras.ops.matmul(
374
            query_embeddings, keras.ops.transpose(candidate_embeddings)
375
        )
376

377
        return self.loss_fn(labels, scores, sample_weight)
378

379

380
"""
381
Let's instantiate, compile and train our model.
382
"""
383

384
model = SequentialRetrievalModel(
385
    movies_count=movies_count, embedding_dimension=EMBEDDING_DIM
386
)
387

388
# Compile.
389
model.compile(optimizer=keras.optimizers.AdamW(learning_rate=LEARNING_RATE))
390

391
# Train.
392
model.fit(
393
    train_ds,
394
    validation_data=test_ds,
395
    epochs=NUM_EPOCHS,
396
)
397

398
"""
399
## Making predictions
400

401
Now that we have a model, we would like to be able to make predictions.
402

403
So far, we have only handled movies by id. Now is the time to create a mapping
404
keyed by movie IDs to be able to surface the titles.
405
"""
406

407
movie_id_to_movie_title = dict(zip(movies_df["MovieID"], movies_df["Title"]))
408
movie_id_to_movie_title[0] = ""  # Because id 0 is not in the dataset.
409

410
"""
411
We then simply use the Keras `model.predict()` method. Under the hood, it calls
412
the `BruteForceRetrieval` layer to perform the actual retrieval.
413

414
Note that this model can retrieve movies already watched by the user. We could
415
easily add logic to remove them if that is desirable.
416
"""
417

418
print("\n==> Movies the user has watched:")
419
movie_sequence = test_ds.unbatch().take(1)
420
for element in movie_sequence:
421
    for movie_id in element[0][:-1]:
422
        print(movie_id_to_movie_title[movie_id.numpy()], end=", ")
423
    print(movie_id_to_movie_title[element[0][-1].numpy()])
424

425
predictions = model.predict(movie_sequence.batch(1))
426
predictions = keras.ops.convert_to_numpy(predictions["predictions"])
427

428
print("\n==> Recommended movies for the above sequence:")
429
for movie_id in predictions[0]:
430
    print(movie_id_to_movie_title[movie_id])
431

432
Product

Resources

Company