Netflix: recommending movies to retain customers
Objectives
- understand the problem of recommending movies
- develop an algorithm for the subproblem of distance function
- implement the algorithm in a language
- practice implementation with loops and arrays
Prerequisites
- basic math operators, loops, single-dimensional array
Reading
Materials
- Titles of 10 movies that students might have watched
- Activity before this lesson: ask the students to rate the movies (1 to 5, 0 if no rating)
- Collect the ratings into movieRatings.txt: each line has a student name--one word, followed by 10 ratings, separated by a space(s) (sample input file)
- Program with the distance function not defined: Netflix.java [sample solution--complete program]
Activities
- Discuss the problem of recommending movies: netflix.com (predicting the rating of a movie that a customer has not rated and recommending movies with the highest predicted ratings)
- Discuss the importance to netflix: netflixprize.com (if movie recommendation is accurate, the customers have a better experience and are more likely to stay with the company)
- Discuss the abstraction of using only moving ratings (1-5) for recommendation (no other information such as titles, genre, actors/actresses)
- Form small groups (3-4 students each) and ask each group to discuss how they would solve the problem
- Discuss one algorithm called "nearest neighbor" (find the most similar or least different customer and predict with the rating he/she has)
- Discuss how to define the distance (or similarity) between two customers in terms of ratings as a function ( distance(ratings1, ratings2) )
- Ask what the desirable properties are for the distance function:
- distance is larger when ratings are more different for the same movie
- distance is larger when more ratings are different for all movies
- Ask what the different cases are for one movie and two customers:
- if a movie is rated by both customers ... (easy case: distance is *absolute* difference)
- if a movie is rated by only one customer ... (assume the unknown rating is a 3, ie neutral)
- if a movie is not rated by both customers ... (ignore, zero)
- Ask how to handle multiple movies (add the distance for each movie to get the total, ignoring the one we're trying to predict)
- Ask them to implement the distance function
- Using different input values, check if their output matches the output from the sample solution.
Assessment
- Besides titles, genre, actors/actresses, what other details did the abstraction ignore? (directors, length, MPAA ratings, age/gender/... of customer...)
- Besides data for movies, how can the algorithm be applied to other kinds of data and recommendation? (products on amazon/walmart, toys, books, songs, ...)
- What could be another distance function between two customers in terms of ratings? (Many possibilities: Euclidean distance: sqrt(sum of squared distance)--each movie is one dimension, ...)
- The nearest neighbor algorithm can be generalized to k-nearest neighbor algorithm, how you combine ratings from k neighbors? (simple average, weighted average by distance)
- Implement combineRatings(ratings, distances, k), where ratings and distances are arrays of size k.