→ Horizontal Recommendations

60 Minute Project

Introduction

For each book summary I publish, I recommend three similar book summaries. Creating these horizontal paths through content catalogs has long been considered best practice. Initially, I selected recommendations manually, choosing relevant books from the back catalog. However, this approach has limitations. Early on, recommendations were challenging due to a lack of options. Now, with more books under my belt, it's becoming difficult to make the best selections. Time to implement an automated recommendation system to improve this process…

Theory

Embeddings can turn text (book summaries in this case) into numerical representations that capture their meaning. Each embedding is a vector: a long list of numbers that places the text as a point in a high-dimensional space. In this semantic vector space, texts with similar meanings end up close together, even if they use different words. These vectors are created by machine learning models trained on vast amounts of text, which learn to represent language in a way that reflects concepts, tone, and context. I used OpenAI's 'text-embedding-3-small' model.

Cosine similarity helps determine which texts are similar based on the embeddings. It measures how closely two vectors point in the same direction in the embedding vector space. A cosine similarity close to 1 means the summaries are very similar; closer to 0 means they're unrelated. This approach lets us recommend similar content based on meaning, not just keywords (if you liked this, you'll probably like that).

Code

The code is pretty straightforward. I suspect most decent LLMs could zero-shot this…

Generate Embeddings

import numpy as np
from tqdm import tqdm

def embed(text, model="text-embedding-3-small", max_chars=3000):
    text = text[:max_chars]  # Truncate long summaries
    resp = client.embeddings.create(input=[text], model=model)
    return resp.data[0].embedding

tqdm.pandas()
df["embedding"] = df["Book Summary"].progress_apply(embed)

Compute Cosine Similarity & Print Results

from sklearn.metrics.pairwise import cosine_similarity

emb_matrix = np.vstack(df["embedding"].values)
sim = cosine_similarity(emb_matrix)

recs = []
for idx, row in df.iterrows():
    top = np.argsort(sim[idx])[::-1][1:4]  # Skip self-match
    recs.append({
        "Book Title": row["Book Title"],
        "Recommendation 1": df.iloc[top[0]]["Book Title"],
        "Recommendation 2": df.iloc[top[1]]["Book Title"],
        "Recommendation 3": df.iloc[top[2]]["Book Title"],
    })

rec_df = pd.DataFrame(recs)
rec_df.head()

Data Visualised

Lastly, I had a feeling the recommendation data wouldn’t be ugly. It was fun to visualise, I simplified the implementation to a single file so it was easy to deploy with Vercel.

‣

Visualisation Code