Collaborative Filtering

Collaborative filtering is a way recommendation systems filter information by using the preferences of other people. It uses the assumption that if person A has similar preferences to person B on items they have both reviewed, then person A is likely to have a similar preference to person B on an item only person B has reviewed.

Alice Bob Chris Devin

Justin Bieber + - + +

Red Hot Chili Peppers - + + +

The Beatles + + + +

One Direction + + - ?

Above is a table representing the music tastes of a group of people ("+" represents like and "-" represents dislike). We are trying to determine whether Devin likes One Direction or not. A collaborative filtering system might determine that Devin's preferences most closely resemble Chris' and determines that Devin probably does not like One Direction.

Collaborative filtering is used by many recommendation systems in various fields, including music, shopping, financial data, and social networks and by various services (YouTube, Reddit, Last.fm). Any service that uses a recommendation system most likely employs collaborative filtering.

Overview

With the growth of the internet and faster computers, large amounts of data are being stored by companies all over the world. This has resulted in the rise of a subfield within computer science called big data, which focuses on the problems of storing and analyzing vast troves of data. Collaborative filtering is a way of extracting useful information from this data, in a general process called information filtering. The algorithm compares a user with other similar users (in terms of preferences) and recommends a specific product or action based on these similarities.

Specifically, a collaborative filtering scheme uses the following steps:

A user expresses preferences of items, usually by rating them.
The algorithm finds other users with similar tastes to the given user.
The system recommends items that the user has not yet rated (thus, likely being new to the user) and that are highly rated by users similar to the given user.

A caveat in this process includes active participation by the users of the service. In order to be recommended an item, the user must have liked or disliked items in the past; otherwise, the system will not provide good recommendations.

Differences in Implementation

There are two main ways of implementing collaborative filtering - a memory-based approach and a model-based approach.

The memory-based approach calculates similarity ( $s_{i,j}$ ) between users $i$ and $j$ in the following manner:

$s_{i,j} = \sum\limits_{p \in P} w_p \operatorname{simil}(p_i, p_j)$

$p$ is a given preference in the set of preferences $P$ .
$w_p$ is a weight associated with a given preference to determine its relevance. For example, two people rating a popular rock song highly could be more relevant than rating an obscure song highly, and thus could be assigned a higher weight.
$\operatorname{simil}$ is a user-defined similarity function that takes in two user preferences ( $p_i$ and $p_j$ ) and determines how similar the preferences are to each other (higher means more similar). For example, a simple but effective similarity function is one that returns 1 if the preferences are the same, and returns 0 if they are not.

The similarity calculation as a whole acts as a variant of the k-nearest neighbors algorithm by determining which users are the closest neighbors to the given user by preferences. The benefits of the memory-based approach are that it is easy to implement and very effective. However, some issues include data sparsity, which occurs when there is not enough data to make good recommendations, and scalability, since with a large dataset more computation is required to determine similarity.

The model-based approach uses machine learning techniques to find patterns in the dataset. For example, neural networks can be used to find trends among item preferences. Advantages to this approach include easy scalability, but can lead to expensive model building for the neural network.

Many collaborative filtering systems use a hybrid approach, which is a combination of the memory-based and model-based approaches. Though such systems are expensive and complex to implement, they overcome the shortcomings of each of the above approaches.

Animation showing a memory-based approach to collaborative filtering^[1]

Challenges

There are various known challenges regarding collaborative filtering. The following are some notable problems.

Lack of data - In order to start recommending, the system must first obtain a sufficient number of a user's past preferences. This delays the usefulness of the recommendation system until this number is met. Additionally, new products must be rated by a sufficient number of people before being recommended. These issues cause a delay in the usefulness of collaborative filtering to users.
Scalability - With more and more people and more and more preferences, the collaborative filtering algorithm becomes computationally intensive and can take a lot of time to return a result. There are many ways to combat this issue, from using subsets of data to returning a result after a certain similarity threshold is reached.
Synonyms - Collaborative filtering may not be able to distinguish two products that are the same (for example, different names for the same item on Amazon or a song and a fan cover of the same song). In this instance, the algorithm may recommend the same product unknowingly and will unnecessarily perform extra computation in processing that item rather than avoiding it.
Shilling attacks - Users can rate their own products highly and rate competitors' products poorly - this can cause the filtering algorithm to prefer one product over its competitors', a huge problem for services that guarantee fairness to its users. Additionally, recommendations within a community will prefer one point of view over another. Examples of this include Reddit, Facebook, and Buzzfeed - where the top-rated links will be biased towards the community's preferences.
Preferring old products - Because old products have more data associated with them over new products, the algorithm might recommend these old products, rather than the new products users are looking for. This defeats the purpose for some applications, where the user might use the service to find new content.
Gray sheep - Some users have preferences that do not consistently agree with any group of people (gray sheep). These users do not find collaborative filtering particularly helpful when determining their wants.

Implementation

The following is an implementation of a simple collaborative filtering algorithm for the first example on this page.

# List of people mapped to their preferences
# 1 represents "like" and 0 represents "dislike"
# -1 represents an unknown preference
person_to_preferences = {
    "Alice" : [1, 0, 1, 1],
    "Bob": [0, 1, 1, 1],
    "Chris": [1, 1, 1, 0],
    "Devin": [1, 1, 1, -1]
}

# Person and preference index to determine
person_to_recommend = "Devin"
preference_index = 3

def collaborative_filtering():
    # Keep track of similarity score for each person
    person_to_score = {}

    person_to_recommend_preferences = person_to_preferences[person_to_recommend]

    for person in person_to_preferences:
        # Continue if the person is the same as the person to recommend to
        if person == person_to_recommend:
            continue

        # Initialize similarity score to 0
        person_to_score[person] = 0

        preferences = person_to_preferences[person]

        for i in range(len(preferences)):
            # Continue if this is the preference we are trying to recommend
            if person_to_recommend_preferences[i] == -1:
                continue

            # Add a "point" if the person shares similar tastes with the person to recommend to
            if preferences[i] == person_to_recommend_preferences[i]:
                person_to_score[person] += 1

    # Determine the person with the most similar preferences
    max_score = -1
    max_person = ""

    for person in person_to_score:
        if person_to_score[person] > max_score:
            max_score = person_to_score[person]
            max_person = person

    return person_to_preferences[max_person][preference_index]

References

Moshanin, . Collaborative_filtering.gif. Retrieved June 2, 2016, from https://commons.wikimedia.org/wiki/File:Collaborative_filtering.gif

Contents