8000 GitHub - leealbert95/PageRank: A personal implementation of Google's PageRank algorithm. Uses NodeJS to store the state of the adjacency list and a Python script to handle matrix multiplication
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A personal implementation of Google's PageRank algorithm. Uses NodeJS to store the state of the adjacency list and a Python script to handle matrix multiplication

Notifications You must be signed in to change notification settings

leealbert95/PageRank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PageRank

A simple implementation of Google's PageRank algorithm. Uses NodeJS server to store the adjacency list state and Python to calculate the rank of each page.

DISCLAIMER: I do not claim to be an expert in PageRank, its use cases, etc. This is a project that I pursued for fun to see the PageRank algorithm in action. This README is intended as a cursory discussion about the history of PageRank, the algorithm itself, and its pros and cons.

What is PageRank?

  Whenever you do a Google search, and you see the results appear on the page, do you ever wonder how/why the results are ordered the way they are? This is, in part, due to Google's PageRank algorithm! The algorithm, to put it simply, analyzes the number of incoming links to each page on the web, and assigns each page a rank determined by the rank of each page linking to it. PageRank follows a probabilistic model of a web surfer, who visits a certain page, and then clicks a link on that page at random. In essence, the rank of each page represents the probability that any web surfer will land on it. Although it is conceptually simple, PageRank is widely considered to be one of the most important algorithms ever invented in the tech industry.

The History of PageRank

  Although PageRank is popularly associated with Larry Page and Sergey Brin, the founders of Google, an early version of the eigenvalue problem was explored in 1976 by Gabriel Pinski and Francis Narin in their attempt to rank scientific journals. Thomas Saaty explored the same problem in 1977 in his concept of the Analytic Heirarchy Process which weighted decisions, and Bradley Love and Steven Sloman used it in 1995 as a cognitive model for concepts, too.

  However, it wasn't until 1996 that Larry Page and Sergey Brin would adopt the algorithm for use in a search engine. As part of a Stanford research project about implementing a new type of search engine, Sergey Brin conceptualized the idea of ranking a web page by the number of links pointing to it. Their first paper on the subject was published in 1998, which included the intitial prototype of the Google search engine, and Page and Brin also patented the process (Stanford University holds the patent, but granted Google exclusive licensing rights in exchange for 1.8 million of its shares). Since then, the algorithm has stayed an integral component of Google's search engine throughout the years, although it is not the only criteria Google uses to rank its search results (In part due to the many vulnerabilities of the algorithm, which you can read about below).

  As an algorithm applicable to any graph in general, PageRank has been used in many other applications besides search engines, including analysis of road networks, neuroscience, bibliometrics, literature, and even sports! Within the tech industry, variants of the PageRank algorithm remain in use today; Twitter uses it to suggest users to follow, and Facebook implements it behind the scenes as part of its suggested friends algorithm.

The Algorithm Explained

  Let's start by visualizing the World Wide Web as a series of vertices and edges in a directed graph. Each vertex represents a webpage, and each edge represents a link from that page to another. Each edge also carries a weight which is influenced by its origin page's rank.

For this example, imagine the World Wide Web has only five pages, with their links shown below:

  To calculate each page's rank, the weights of each incoming link must be calculated using the other page's rank. But in order to do that, we have to calculate each of the other pages' ranks, then the ranks of the pages with links to those other pages, and so on. When you consider the fact that links can be cyclical, too, this task seems practically impossible!

  Fortunately, we do have a very effective and relatively simple way of doing this calculation! This is where linear algebra comes to the rescue. We can create a matrix representation of this graph and use iterative vector multiplication to achieve our result.

Here is the example graph represented as an adjacency list:

[[1],[4],[0,1,3],[],[1]]

Which corresponds to the adjacency matrix below (We will use 0-based indexing for all the matrices):

[[0,0,1,0,0],
 [1,0,1,0,1],
 [0,0,0,0,0],
 [0,0,1,0,0],
 [0,1,0,0,0]]

  Each entry matrix[i][j] represents a link from page j to page i. If the link exists, it is given a value of 1. If not, it is given a value of 0.

From this, we can create the stochastic matrix:

[[0, 0, 0.33, 0.2, 0],
 [1, 0, 0.33, 0.2, 1],
 [0, 0, 0.00, 0.2, 0],
 [0, 0, 0.33, 0.2, 0],
 [0, 1, 0.00, 0.2, 0]]

  What is a stochastic matrix? It is a matrix in which either all the entries of each row or column add up to 1. In this case, the entries of each column add up to 1, so this matrix is column-stochastic. This type of matrix represents a probability distribution, in which each matrix[i][j] represents the probability that a web surfer who is currently on page j will visit page i.

  You may be wondering why column 3 has 0.2 for all its entries, even though Page 3 has no outgoing links! This is because in the PageRank algorithm, a page with no outgoing links is assumed to equally distribute its importance among all the existing pages in the network. Effectively, this is the same as saying that the page has links to every page in the network, including itself. The justification for this reasoning is that, using the web surfer model, a user who visits a page with no outbound links will type the url of any of the existing webpages with equal probability. In this example, since there are five pages total in the network, a user on Page 3 will have a 1/5 chance of visiting any of the other pages, hence the 0.2.

  To save computational time, I create the stochastic matrix directly from the adjacency list. I also use Python's NumPy library for all matrix operations (You can see the full example code at server/example_pagerank.py). Here is the function in my Python code that creates the stochastic matrix.:

def create_stochastic(adj_list):
  n = len(adj_list)
  matrix = np.zeros((n, n)) #initialize a matrix of zeroes (np is shorthand for NumPy)
  for col, line in enumerate(adj_list):
    if len(line) > 0:
      #case where page has existing links
      for index in line:
        matrix[index][col] = 1/(len(line))
    else:
      #case where page does not have any links
      for index in range(n):
        matrix[index][col] = 1/n
  return matrix

  After we create the stochastic matrix, we must now consider the damping factor. According to Page and Brin, the PageRank problem assumes that at some point, the user will grow bored and eventually stop clicking. The damping factor is a constant between 0 and 1, and represents the probability that the user will continue clicking at any moment. In their paper, Page and Brin used a damping factor of 0.85, which is what I use in my implementation, too. Here's a good explanation about the purpose of the damping factor.

The pagerank calculation of any page is as follows:

  Where PR = pagerank, N = number of pages in the web, d = damping factor, T = page with link to A, and L = number of outbound links from T.

We can then adapt this formula using the stochastic matrix we created earlier to create the transition matrix:

Where A = transition matrix, E = NxN matrix of 1's, and S = stochastic matrix

Here is my Python function for creating the transition matrix:

def create_transitional(stoch_matrix):
  n = len(stoch_matrix)
  d = 0.85 #damping factor
  E = np.ones((n, n)) #create matrix of 1's
  part1 = np.multiply(((1-d)/n), E)
  part2 = np.multiply(d, stoch_matrix)
  transition_matrix = np.add(part1, part2)
  print(transition_matrix)
  print('\n')
  return transition_matrix

The resulting output for this example:

  With this transition matrix (notice that it is still stochastic!), we can now multiply it by an initial column vector p0 where p0 has length N, and each p0[i] has value 1/N for a total sum of 1. This vector p happens to be the sole eigenvector of the transition matrix with eigenvalue 1. You can read about eigenvectors and eigenvalues here.

  The result of this multiplication will be a new column vector p1, since multiplying an N x N matrix with a N x 1 matrix (the initial vector) yields an N x 1 matrix. We can multiply the transition matrix with this new vector again, and repeat the process, which can be described with the recursive formula below:

Each of the entries of each new vector will begin converging to a single value after several iterations. We can stop the process after the difference between all pn[i] and pn-1[i] is within a set error bound.

Here is my Python code that computes and counts the iterations:

def calculate_pagerank(transition_matrix):
  n = len(transition_matrix)
  err_bound = 0.005 #personal error bound that I set for this project, not sure what Page and Brin used for their implementation
  v1 = np.ones(n)
  v1 = np.multiply((1/n), v1)
  v2 = np.matmul(transition_matrix, v1)
  count = 1
  print(v2)
  print('\n')
  while not within_err_bound(v1, v2, err_bound): 
    #keep iterating multiplication until difference between v1 and v2 for all entries is under err bound
    v1 = v2
    v2 = np.matmul(transition_matrix, v1)
    count += 1
    print(v2)
    print('\n')
  return {'vector': v2.tolist(), 'iterations': count}

def within_err_bound(v1, v2, err_bound):
  diff_vector = np.subtract(v2, v1)
  for diff in diff_vector:
    if abs(diff) > err_bound:
      return False
  return True

  You can see the part of the console output for this example here, since it's a bit too long to fit on this page. Even in the first few iterations, you can see the values changing by a smaller amount each time. This particular example takes 22 iterations until the differences fall below the error bound. The last vector output in this iterative process is the final pagerank:

  Given this vector, we can see that Page 1 has the highest rank. This is expected because Page 1 has the most inbound links of any page. But notice that Page 4 has a very high rank, too, despite having only one inbound link! This is because this single inbound link comes from Page 1, which happens to be the most important page. Since this link is Page 1's sole outbound link, Page 1 transfers all of its influence through it (If you were a prospective applicant looking for a new job, one recommendation from Larry Page himself would be more impactful than ten recommendations from your coworkers!). Remember that not only does the amount of inbound links matter, but so does the weight of each link!

The Downsides of PageRank

  Though PageRank is a brilliant algorithm for ranking webpages, it is not an entirely reliable method by itself due to its many weaknesses. Here are a few reasons:

Bias Towards Older Pages

  Since pages that are indexed earlier in the network will have likely accumulated more inbound links than newly registered pages, these older pages will have higher PageRanks than the newer ones. This is problematic for search engines because many users look for newer content (searching news articles for a certain topic, for example). Newer pages will be obscured by the older, more established ones, and the algorithm may yield an inaccurate result for a query.

Link Farms

  A link farm is a group of websites that link to every other site in the group. Doing so helps increase the PageRank of all the member pages of the group. This artificial inflation of member pages' PageRanks allows the entity that owns the link farm to generate higher revenue from advertising on the pages. Google did target private blog networks in September 2014 with manual action ranking penalties. Link farms constitute a form of spamdexing, which is the deliberate manipulation of search engine indexing.

Spoofing

  The PageRank shown in the Google Toolbar was easily manipulable, because redirection from one page to another via a HTTP 302 response would cause the original page to gain the PageRank of the redirected page. By redirecting to a well-known website, any page could thus acquire a large PageRank. This technique is known as spoofing. In March 2016, Google pulled the PageRank feature from the Toolbar, and shut down the API.

Sources

https://en.wikipedia.org/wiki/PageRank

https://arxiv.org/pdf/1407.5107v1.pdf

http://home.ie.cuhk.edu.hk/~wkshum/papers/pagerank.pdf

http://www.stat.cmu.edu/~ryantibs/datamining/lectures/03-pr.pdf

https://en.wikipedia.org/wiki/Link_farm

About

A personal implementation of Google's PageRank algorithm. Uses NodeJS to store the state of the adjacency list and a Python script to handle matrix multiplication

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0