Sunday 7 February 2016

An insight on PageRank.

PAGERANK


PageRank is an algorithm used by Google Search to rank websites in their search engine results. It is a way of measuring the importance of website pages. According to Google:   PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other website.You enter a search query. Google retrieves a list of links that match the query in an arbitrary order.

How is PageRank calculated?

Each web page has 2 attributes, Authority and Hub.
  • Authority of a page A is a count of the number of pages that point to, or cite the page A.
  • Hub of a page A is a count of the number of pages that A cites or points to. PageRank algorithm relies on the votes cast to each page to rank the pages.


How are votes cast to a web-page?



If a page B cites or has a hyperlink to page A, it is considered as a vote from page B to page A. However, PageRank also takes into account the rank of the page B that casts the vote to page A. The votes cast by a high-ranked page has more weightage than the votes cast by a low-ranked page. Also, since a page may point to many other pages, it's vote must be shared among all those pages.


PageRank is based on incoming links, but not just on the number of them – relevance and quality are important (in terms of the PageRank of sites, which link to a given site).

Equation :

PR(A) = (1-d) + d(PR(t1)/C(t1) + … + PR(tn)/C(tn)).
In the equation ‘t1 – tn’ are pages linking to page A, ‘C’ is the number of outbound links that a page has and ‘d’ is a damping factor, usually set to 0.85.

Explaining the Terms :

  1. PR(Tn) - Each page has a notion of its own self-importance. That’s “PR(T1)”for the first page in the web all the way up to “PR(Tn)” for the last page
  2. C(Tn) - Each page spreads its vote out evenly amongst all of it’s outgoing links. The count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page n, and so on for all pages.
  3. PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page A will get is “PR(Tn)/C(Tn)”
  4. d(). - All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is “damped down” by multiplying it by d.
  5. (1 - d) - The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web pages' PageRanks will be one”.It also means that if a page has no links to it (no backlinks) even then it will still get a small PR of 0.15 (i.e. 1 – 0.85).
  • Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
  • PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web.
We can think of it in a simpler way: a page’s PageRank = 0.15 + 0.85 * (a “share” of the PageRank of every page that links to it). “share” = the linking page’s PageRank divided by the number of outbound links on the page. A page “votes” an amount of PageRank onto each page that it links to. The amount of PageRank that it has to vote with is a little less than its own PageRank value (its own value * 0.85). This value is shared equally between all the pages that it links to.
Not all links weight the same when it comes to PR.
If you had a web page with a PR8 and had 1 link on it, the site linked to would get a fair amount of PR value. But, if you had 100 links on that page, each individual link would only get a fraction of the value.
If the PageRank value differences between PR1, PR2,.....PR10 were equal then that conclusion would hold up, But the values between PR1 and PR10 (the maximum) are set on a logarithmic scale, If so, it means that it takes a lot more additional PageRank for a page to move up to the next PageRank level that it did to move up from the previous PageRank level. The result is that it reverses the previous conclusion, so that a link from a PR8 page that has lots of outbound links is worth more than a link from a PR4 page that has only a few outbound links.

Lets take the simplest example network: two pages, each pointing to the other:

d
= 0.85
PR(A)
= (1 – d) + d(PR(B)/1)
PR(B)
= (1 – d) + d(PR(A)/1)
i.e.
PR(A)
= 0.15 + 0.85 * 1
= 1
PR(B)
= 0.15 + 0.85 * 1
= 1

This was just after one iteration and luckily it converged. Now lets take another example with three pages.

Link page A to both B and C. Also link pages B and C to A. Starting with PR1 all round, after 1 iteration the results are:- 
pagerank, page rankPR(A) = 1.85
PR(B) = 0.575
PR(C) = 0.575
and after 100 iterations, the results are:- 
PR(A) = 1.459459
PR(B) = 0.7702703
PR(C) = 0.7702703




Finally

PageRank is, in fact, very simple. But when a simple calculation is applied hundreds
(or billions) of times over the results can seem complicated.
PageRank is also only part of the story about what results get displayed high up in a 
Google listing. For example there’s some evidence to suggest that Google is paying a
lot of attention these days to the text in a link’s anchor when deciding the 
relevance of a target page – perhaps more so than the page’s PR.

Links :

  1. https://en.wikipedia.org/wiki/PageRank
  2. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
  3. http://www.webworkshop.net/pagerank.html


2 comments: