60 Minute Project
Introduction
For each book summary I publish, I recommend three similar book summaries. Creating these horizontal paths through content catalogs has long been considered best practice. Initially, I selected recommendations manually, choosing relevant books from the back catalog. However, this approach has limitations. Early on, recommendations were challenging due to a lack of options. Now, with more books under my belt, it's becoming difficult to make the best selections. Time to implement an automated recommendation system to improve this process…
Theory
Embeddings can turn text (book summaries in this case) into numerical representations that capture their meaning. Each embedding is a vector: a long list of numbers that places the text as a point in a high-dimensional space. In this semantic vector space, texts with similar meanings end up close together, even if they use different words. These vectors are created by machine learning models trained on vast amounts of text, which learn to represent language in a way that reflects concepts, tone, and context. I used OpenAI's 'text-embedding-3-small' model.
Cosine similarity helps determine which texts are similar based on the embeddings. It measures how closely two vectors point in the same direction in the embedding vector space. A cosine similarity close to 1 means the summaries are very similar; closer to 0 means they're unrelated. This approach lets us recommend similar content based on meaning, not just keywords (if you liked this, you'll probably like that).
Code
The code is pretty straightforward. I suspect most decent LLMs could zero-shot this…
Generate Embeddings
import numpy as np
from tqdm import tqdm
def embed(text, model="text-embedding-3-small", max_chars=3000):
text = text[:max_chars] # Truncate long summaries
resp = client.embeddings.create(input=[text], model=model)
return resp.data[0].embedding
tqdm.pandas()
df["embedding"] = df["Book Summary"].progress_apply(embed)
Compute Cosine Similarity & Print Results
from sklearn.metrics.pairwise import cosine_similarity
emb_matrix = np.vstack(df["embedding"].values)
sim = cosine_similarity(emb_matrix)
recs = []
for idx, row in df.iterrows():
top = np.argsort(sim[idx])[::-1][1:4] # Skip self-match
recs.append({
"Book Title": row["Book Title"],
"Recommendation 1": df.iloc[top[0]]["Book Title"],
"Recommendation 2": df.iloc[top[1]]["Book Title"],
"Recommendation 3": df.iloc[top[2]]["Book Title"],
})
rec_df = pd.DataFrame(recs)
rec_df.head()
Data Visualised
Lastly, I had a feeling the recommendation data wouldn’t be ugly. It was fun to visualise, I simplified the implementation to a single file so it was easy to deploy with Vercel.