Near Duplicate Content Useful Sources Discovered

Whilst researching for my recent Masters I came across a number of useful resources on the subject of near-duplicates and how search engines might handle them when crawling.

Making search engines play ‘spot the difference’ can have some dire consequences on websites if not handled properly.  It’s an area that has intrigued me for some time (hence my dissertation research allowed for an opportunity to root around in academic and practitioner writings to dig up as much as I could find).

Whilst something of a random list on its own there’s plenty of interesting pieces here if anyone wants to venture down the rabbit hole.

Rather than listing these in author order, which would be the normal Harvard referencing style, I’ve listed these in chronological order so we can hopefully see how these have evolved over time.  Where there are two or more papers, blog posts or other resources for a year I have then listed the references in alphabetical order of author surname.

It feels as though this will tell the story a little of how the challenge of an exploding web of content / URLs has been handled by researchers and search engines over time in a quest to keep costs per query manageable when crawling, indexing and serving results to search engine users, whilst also looking to prevent users from being annoyed at seeing multiple URLs the same ranking together in search engine results.

Some of these are search engine researchers and some of them are academic researchers so a good mix of studies from both the relevance (industry) and rigour (academia) sides of the fence.

There are also some blog posts and sections from Google’s webmaster support pages here designed to help with issues which can arise from near-duplicate or duplicate content on websites which may have many URL parameters generated by such functionality as faceted navigation.

REFERENCES TO PAPERS RELEVANT TO NEAR-DUPLICATE CONTENT

Year: 1981

Fingerprinting by random polynomials

Rabin, M.O., 1981. Fingerprinting by random polynomials (pp. 15-18). Center for Research in Computing Techn., Aiken Computation Laboratory, Univ..

Summary: To follow

Peer Reviewed: Yes

Year: 1993

Some applications of Rabin’s fingerprinting method

Broder, A.Z., 1993. Some applications of Rabin’s fingerprinting method. In Sequences II (pp. 143-152). Springer New York.

Summary: To follow

Peer Reviewed: Yes

Year: 1997

On the resemblance and containment of documents in Compression and Complexity of Sequences

Broder, A.Z., 1997, June. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings (pp. 21-29). IEEE.

Summary: To follow

Peer Reviewed: Yes

Syntactic clustering of the web

Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8-13), pp.1157-1166.

Summary: To follow

Peer Reviewed: Yes

Optimal robot scheduling for web search engines

Coffman, E.G., Liu, Z. and Weber, R.R., 1997. Optimal robot scheduling for web search engines (Doctoral dissertation, INRIA).

Summary: To follow

Peer Reviewed: Yes

Year: 1998

Efficient crawling through URL ordering

Cho, J., Garcia-Molina, H. and Page, L., 1998. Efficient crawling through URL ordering.

Summary: To follow

Peer Reviewed: Yes

Year: 1999

Copy detection mechanisms for digital documents

Brin, S., Davis, J. and Garcia-Molina, H., 1995, June. Copy detection mechanisms for digital documents. In ACM SIGMOD Record (Vol. 24, No. 2, pp. 398-409). ACM.

Summary: To follow

Peer reviewed: Yes

 

Computing Iceberg Queries Efficiently

Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R. and Ullman, J.D., 1999, November. Computing Iceberg Queries Efficiently. In Internaational Conference on Very Large Databases (VLDB’98), New York, August 1998. Stanford InfoLab.

Summary: To follow

Peer reviewed: Yes

 

A Knowledge-Based Approach to Organizing Retrieved Documents

Pratt, W., Hearst, M.A. and Fagan, L.M., 1999, July. A Knowledge-Based Approach to Organizing Retrieved Documents. In AAAI/IAAI (pp. 80-85).

Summary: To follow

Peer reviewed: Yes

Year: 2000

How dynamic is the Web?

Brewington, B.E. and Cybenko, G., 2000. How dynamic is the Web?. Computer Networks, 33(1), pp.257-276.

Summary: To follow

Peer reviewed: Yes

 

Keeping up with the changing web

Brian, B. and George, C., 2000. Keeping up with the changing web. Ieee Computer, 33(5), pp.52-58.

Summary: To follow

Peer reviewed: Yes

 

Focused Crawling Using Context Graphs

Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L. and Gori, M., 2000, September. Focused Crawling Using Context Graphs. In VLDB (pp. 527-534).

Summary: To follow

Peer reviewed: Yes

Year: 2001

An adaptive model for optimizing performance of an incremental web crawler

Edwards, J., McCurley, K. and Tomlin, J., 2001, April. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the 10th international conference on World Wide Web (pp. 106-113). ACM.

Summary: To follow

Peer reviewed: Yes

Year: 2002

 

Challenges in web search engines

Henzinger, M.R., Motwani, R. and Silverstein, C., 2002, September. Challenges in web search engines. In ACM SIGIR Forum (Vol. 36, No. 2, pp. 11-22). ACM.

Summary: To follow

Peer reviewed: Yes

 

Optimal crawling strategies for web search engines

Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J. and Ozsen, L., 2002, May. Optimal crawling strategies for web search engines. In Proceedings of the 11th international conference on World Wide Web (pp. 136-147). ACM.

Summary: To follow

Peer reviewed: Yes

Year: 2003

 

Effective page refresh policies for web crawlers

Cho, J. and Garcia-Molina, H., 2003. Effective page refresh policies for web crawlers. ACM Transactions on Database Systems (TODS), 28(4), pp.390-426.

Summary: To follow

Peer reviewed: Yes

 

Online duplicate document detection: signature reliability in a dynamic retrieval environment

Conrad, J.G., Guo, X.S. and Schriber, C.P., 2003, November. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In Proceedings of the twelfth international conference on Information and knowledge management (pp. 443-452). ACM.

Summary: To follow

Peer reviewed: Yes

 

A large-scale study of the evolution of web pages

Fetterly, D., Manasse, M., Najork, M. and Wiener, J., 2003, May. A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on World Wide Web (pp. 669-678). ACM.

Summary: To follow

Peer reviewed: Yes

 

On the evolution of clusters of near-duplicate web pages

Fetterly, D., Manasse, M. and Najork, M., 2003. On the evolution of clusters of near-duplicate web pages. Journal of Web Engineering, 2(4), pp.228-246.

Summary: To follow

Peer reviewed: Yes

 

Detecting duplicate and near-duplicate files

Pugh, W. and Henzinger, M.H., Google, Inc., 2003. Detecting duplicate and near-duplicate files. U.S. Patent 6,658,423.

Summary: To follow

Peer reviewed: Yes

Year: 2004

 

Spam, damn spam, and statistics

Fetterly, D., Manasse, M. and Najork, M., 2004, June. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004 (pp. 1-6). ACM.

Summary: To follow

Peer reviewed: Yes

 

Algorithmic challenges in web search engines

Henzinger, M.R., 2004. Algorithmic challenges in web search engines. Internet Mathematics, 1(1), pp.115-123.

Summary: To follow

Peer reviewed: Yes

Year: 2005

Crawling a country: better strategies than breadth-first for web page ordering

Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A., 2005, May. Crawling a country: better strategies than breadth-first for web page ordering. In Special interest tracks and posters of the 14th international conference on World Wide Web (pp. 864-872). ACM.

Summary: To follow

Peer reviewed: Yes

 

Detecting phrase-level duplication on the world wide web

Fetterly, D., Manasse, M. and Najork, M., 2005, August. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 170-177). ACM.

Summary: To follow

Peer reviewed: Yes

 

Web spam taxonomy

Gyongyi, Z. and Garcia-Molina, H., 2005. Web spam taxonomy. In First international workshop on adversarial information retrieval on the web (AIRWeb 2005).

Summary: To follow

Peer reviewed: Yes

Year: 2006

SEO advice: url canonicalization

Cutts, M., 2006. Gadgets, Google, and SEO Blog – SEO advice: url canonicalization. [ONLINE] Available at: https://www.mattcutts.com/blog/seo-advice-url-canonicalization/. [Accessed 27 March 2017].

Summary: To follow

Peer reviewed: No

 

Detecting spam web pages through content analysis

Ntoulas, A., Najork, M., Manasse, M. and Fetterly, D., 2006, May. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web (pp. 83-92). ACM.

Summary: To follow

Peer reviewed: Yes

 

Solving Different URLs with Similar Text (DUST)

Slawski, B 2006. Solving Different URLs with Similar Text (DUST) – SEO by the Sea. [ONLINE] Available at: http://www.seobythesea.com/2006/09/solving-different-urls-with-similar-text-dust/. [Accessed 04 February 2017].

Summary: To follow

Peer reviewed: Yes

Year: 2007

 

Scaling up all pairs similarity search

Bayardo, R.J., Ma, Y. and Srikant, R., 2007, May. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web (pp. 131-140). ACM.

Summary: To follow

Peer reviewed: Yes

 

Detecting near-duplicates for web crawling

Manku, Gurmeet Singh, Arvind Jain, and Anish Das Sarma., 2007. Detecting near-duplicates for web crawling. Proceedings of the 16th international conference on World Wide Web. ACM, 2007.

Summary: To follow

Peer reviewed: Yes

 

Joint optimization of wrapper generation and template detection

Zheng, S., Song, R., Wen, J.R. and Wu, D., 2007, August. Joint optimization of wrapper generation and template detection. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 894-902). ACM.

Summary: To follow

Peer reviewed: Yes

Year: 2008

 

We knew the web was big…

Alpert, J – Google, 2008 – Official Google Blog. 2008. Official Google Blog: We knew the web was big… . [ONLINE] Available at: https://googleblog.blogspot.co.uk/2008/07/we-knew-web-was-big.html. [Accessed 02 January 2017].

Summary: To follow

Peer reviewed: No

 

iRobot: An intelligent crawler for Web forums

Cai, R., Yang, J.M., Lai, W., Wang, Y. and Zhang, L., 2008, April. iRobot: An intelligent crawler for Web forums. In Proceedings of the 17th international conference on World Wide Web (pp. 447-456). ACM.

Summary: To follow

Peer reviewed: Yes

 

Detecting duplicate and near-duplicate files

Pugh, W. and Henzinger, M.H., Google, Inc., 2008. Detecting duplicate and near-duplicate files. U.S. Patent 7,366,718.

Summary: To follow

Peer reviewed: No

 

Recrawl scheduling based on information longevity

Olston, C. and Pandey, S., 2008, April. Recrawl scheduling based on information longevity. In Proceedings of the 17th international conference on World Wide Web (pp. 437-446). ACM.

Year: 2009

Do not crawl in the dust: different urls with similar text

Bar-Yossef, Z., Keidar, I. and Schonfeld, U., 2009. Do not crawl in the dust: different urls with similar text. ACM Transactions on the Web (TWEB), 3(1), p.3.

Summary: To follow

Peer reviewed: Yes

Incorporating site-level knowledge for incremental crawling of web forums

Yang, J.M., Cai, R., Wang, C., Huang, H., Zhang, L. and Ma, W.Y., 2009, June. Incorporating site-level knowledge for incremental crawling of web forums: A list-wise strategy. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1375-1384). ACM.

Summary: To follow

Peer reviewed: Yes

Google, Yahoo & Microsoft Unite On “Canonical Tag” To Reduce Duplicate Content Clutter

Fox, V., 2009 Search Engine Land. 2009. Google, Yahoo & Microsoft Unite On “Canonical Tag” To Reduce Duplicate Content Clutter. [ONLINE] Available at: http://searchengineland.com/canonical-tag-16537. [Accessed 11 June 2017].

Summary

Search Engine Land’s Vanessa Fox reports on the collaboration between Yahoo, Google & Microsoft on the creation of a common ‘canonical tag’ as a means to fight against issues caused by duplicate and near duplicate content.  This is significant as previously the 3 major search engines had only joined forces twice before.  Once on agreement regarding the common use of XML sitemaps and once on agreement regarding the use of robots.txt (albeit some robots directives are not used by all 3 of the search engines).

Peer reviewed: No

 

Specify Your Canonical

Kupke J., Ohye M., 2009, Feb. Official Google Webmaster Central Blog. 2009. Official Google Webmaster Central Blog: Specify your canonical . [ONLINE] Available at: https://webmasters.googleblog.com/2009/02/specify-your-canonical.html. [Accessed 11 June 2017].

Summary

Google announces the introduction of the canonical tag as a means of indicating the preferred URI amongst several outputting the same content and explains how webmasters can use this to indicate to search engines that there are several versions of the same content and give a strong hint to pass all signals (including any value from links) to the one preferred URI.

Peer reviewed: No

Year: 2010

Web crawler scheduler that utilizes sitemaps from websites

Brawer, S.B., Ibel, M., Keller, R.M. and Shivakumar, N., Google Inc., 2010. Web crawler scheduler that utilizes sitemaps from websites. U.S. Patent 7,769,742.

Matt Cutts Interviewed by Eric Enge

Cutts, M., 2010 – Stone Temple Consulting, 2010 – Matt Cutts Interviewed by Eric Enge. [ONLINE] Available at: https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/. [Accessed 26 October 2016].

Joint optimization of wrapper generation and template detection

Wen, J.R., Wan, M., Song, R., Ma, W.Y. and Zeng, S., Microsoft Corporation, 2010. Joint optimization of wrapper generation and template detection. U.S. Patent 7,660,804.

Year: 2011

 

 

View-all in search results

Benjia Li & Joachim Kupke, 2011.  Official Google Webmaster Central Blog. 2011. Official Google Webmaster Central Blog: View-all in search results . [ONLINE] Available at: https://webmasters.googleblog.com/2011/09/view-all-in-search-results.html. [Accessed 06 February 2017].

Summary: To follow

Peer reviewed: No

Indicate paginated content

Google Search Console Help – Indicate paginated content – Search Console Help. Publication date unknown. Indicate paginated content – Search Console Help. [ONLINE] Available at: https://support.google.com/webmasters/answer/1663744?hl=en. [Accessed 27 March 2017]. (Date unknown but estimated at 2011)

Summary: To follow

Peer reviewed: No

Detecting duplicate and near-duplicate files

Henzinger, M.H., Google Inc., 2011. Detecting duplicate and near-duplicate files. U.S. Patent 8,015,162.

Summary: To follow

Peer reviewed: No

Demystifying the “duplicate content penalty”

Moskwa, S., 2011 – Official Google Webmaster Central Blog. 2011. Official Google Webmaster Central Blog: Demystifying the “duplicate content penalty” . [ONLINE] Available at: https://webmasters.googleblog.com/2008/09/demystifying-duplicate-content-penalty.html. [Accessed 26 February 2017].

Summary: To follow

Peer reviewed: No

 

Detecting duplicate and near-duplicate files

Pugh, W. and Henzinger, M.H., Henzinger Monika H, 2011. Detecting duplicate and near-duplicate files. U.S. Patent Application 13/313,913.

Summary: To follow

Peer reviewed: No

 

Efficient similarity joins for near-duplicate detection

Xiao, C., Wang, W., Lin, X., Yu, J.X. and Wang, G., 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3), p.15.

Summary: To follow

Peer reviewed: Yes

Year: 2012

RFC 6596 – The Canonical Link Relation

RFC 6596 – The Canonical Link Relation. 2017. RFC 6596 – The Canonical Link Relation. [ONLINE] Available at: https://tools.ietf.org/html/rfc6596. [Accessed 10 June 2017].

Summary: To follow – Request for comments submitted to the Internet Engineering Task Force by Google’s M Ohye and J Kupke

Peer reviewed: No

 

Use canonical URLs

Google – Use canonical URLs – Search Console Help. First publication date unknown. Use canonical URLs – Search Console Help. [ONLINE] Available at: https://support.google.com/webmasters/answer/139066?hl=en. [Accessed 04 March 2017]. (Date unknown but estimated at 2012 following the notification of ‘The Canonical Link Relation’ as RFC 6596 (Request for Comments) to the Internet Engineering Task Force by Google’s M Ohye and J Kupke)

 

Novel approaches to crawling important pages early

Alam, M.H., Ha, J. and Lee, S., 2012. Novel approaches to crawling important pages early. Knowledge and Information Systems, 33(3), pp.707-734.

Summary: To follow

Peer reviewed: Yes

 

Focused crawling: a new approach to topic-specific Web resource discovery

Chakrabarti, S., Van den Berg, M. and Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11), pp.1623-1640.

Summary: To follow

Peer reviewed: Yes

Pagination with rel=”next” and rel=”prev”

Google, 2012 – Official Google Webmaster Central Blog. 2012. Official Google Webmaster Central Blog: Video about pagination with rel=“next” and rel=“prev” . [ONLINE] Available at: https://webmasters.googleblog.com/2012/03/video-about-pagination-with-relnext-and.html. [Accessed 26 October 2016].

Summary: To follow

Peer reviewed: No

Detecting Duplicate and Near-Duplicate Files

Henzinger, M.H., Google Inc., 2012. Detecting Duplicate and Near-Duplicate Files. U.S. Patent Application 13/225,342.

Summary: To follow

Peer reviewed: No

Year: 2013

 

How does Google handle duplicate content?

Cutts, M., 2013 – YouTube. 2013. How does Google handle duplicate content? – YouTube. [ONLINE] Available at: https://www.youtube.com/watch?v=mQZY7EmjbMA. [Accessed 26 February 2017].

Summary: To follow

Peer reviewed: No

 

Google’s Matt Cutts: 25-30% Of The Web’s Content Is Duplicate Content & That’s Okay

Schwartz, B., Search Engine Land. 2013. Google’s Matt Cutts: 25-30% Of The Web’s Content Is Duplicate Content & That’s Okay. [ONLINE] Available at: http://searchengineland.com/googles-matt-cutts-25-30-of-the-webs-content-is-duplicate-content-thats-okay-180063. [Accessed 26 February 2017].

Summary: To follow

Peer reviewed: No

 

Inside Search: Billions of times a day in the blink of an eye

Google 2013 – Inside Search. 2013. Inside Search: Billions of times a day in the blink of an eye . [ONLINE] Available at: https://search.googleblog.com/2013/03/billions-of-times-day-in-blink-of-eye.html. [Accessed 29 January 2017].

Summary: To follow

Peer reviewed: No

Crawl Optimization: You Are What Googlebot Eats

Kohn. A.J. – Blind Five Year Old. 2013. Crawl Optimization: You Are What Googlebot Eats. Kohn, A.J. – [ONLINE] Available at: http://www.blindfiveyearold.com/crawl-optimization. [Accessed 26 October 2016].

Summary: To follow

Peer reviewed: No

Scheduler for Search Engine Crawler

Randall, K.H., Google Inc., 2014. Scheduler for Search Engine Crawler. U.S. Patent Application 14/325,211.

Summary: To follow

Peer reviewed: No

 

5 common mistakes with rel=canonical

Scott A., 2013 – Official Google Webmaster Central Blog. 2013. 5 common mistakes with rel=canonical. [ONLINE] Available at: https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html. [Accessed 4 March 2017].

Summary: To follow

Peer reviewed: No

Current challenges in web crawling

Shestakov, D., 2013, July. Current challenges in web crawling. In International Conference on Web Engineering (pp. 518-521). Springer Berlin Heidelberg.

Summary: To follow

Peer reviewed: Yes

 

Year: 2014

 

Faceted navigation best (and 5 of the worst) practices

Google, 2014 – Official Google Webmaster Central Blog. 2014. Official Google Webmaster Central Blog: Faceted navigation best (and 5 of the worst) practices . [ONLINE] Available at: https://webmasters.googleblog.com/2014/02/faceted-navigation-best-and-5-of-worst.html. [Accessed 26 October 2016].

Summary: To follow

Peer reviewed: No

Year: 2015

What’s Wrong with my Site?

Raventools, 2015 – What’s Wrong with my Site?. 2015. What’s Wrong with my Site?. [ONLINE] Available at: https://raventools.com/studies/onpageseo/#duplicate. [Accessed 26 February 2017].

Summary: To follow

Peer reviewed: No

How Google May Use Schema Vocabulary to Reduce Duplicate Content

Slawski, B, 2015.  How Google May Use Schema Vocabulary to Reduce Duplicate Content in Search Results. [ONLINE] Available at: http://www.seobythesea.com/2015/10/how-google-may-use-schema-vocabulary-to-reduce-duplicate-content-in-search-results/. [Accessed 26 October 2016].

Summary: To follow

Peer reviewed: No

Year: 2016

Going Deep with SEO Tags

Illyes, G, 2016. Virtual Keynote 2: Gary Illyes & Eric Enge – Going Deep with SEO Tags – YouTube. [ONLINE] Available at: https://www.youtube.com/watch?v=GVKcMU7YNOQ. [Accessed 26 February 2017].

Summary: To follow

Peer reviewed: No

Detecting Duplicate and Near-Duplicate Files

Pugh, W. and Henzinger, M.H., Google Inc., 2016. Detecting duplicate and near-duplicate files. U.S. Patent 9,275,143.

Summary: To follow

Peer reviewed: No

Google Says Any 30x Redirect Passes PageRank But 301s Help With Canonicalization

Schwartz, B., 2016 Seroundtable.com. 2016. Google Says Any 30x Redirect Passes PageRank But 301s Help With Canonicalization. [ONLINE] Available at: https://www.seroundtable.com/google-300-redirect-pagerank-301-canonicalization-22500.html. [Accessed 27 March 2017].

Summary: To follow

Peer reviewed: No

Google Sitemaps Should Contain URLs of Pages You Want Indexed, Not Variations

Schwartz, B., Seroundtable.com. 2016. Google Sitemaps Should Contain URLs Of Pages You Want Indexed, Not Variations. [ONLINE] Available at: https://www.seroundtable.com/google-sitemaps-no-variations-23125.html. [Accessed 28 March 2017].

Summary: To follow

Peer reviewed: No

Google’s Crawl Budget Works Differently Than You & I Think

Schwarz, B., 2016.  Seroundtable.com. 2016. Google’s Crawl Budget Works Differently Than You & I Think. [ONLINE] Available at: https://www.seroundtable.com/google-crawl-budget-different-23108.html. [Accessed 26 February 2017].

Summary: To follow

Peer reviewed: No

Infrequent Google Crawling Is a Sign of a Low Quality Site

Slegg, J, 2016 – The SEM Post. 2016. Infrequent Google Crawling Is a Sign of a Low Quality Site. [ONLINE] Available at: http://www.thesempost.com/infrequent-crawling-sign-low-quality-site/. [Accessed 03 February 2017].

Summary

 

Peer reviewed: No

The Myth of the Duplicate Content Penalty

Stox, P., 2016 – Search Engine Land. 2016. The myth of the duplicate content penalty. [ONLINE] Available at: http://searchengineland.com/myth-duplicate-content-penalty-259657. [Accessed 26 February 2017].

Summary

 

Peer reviewed: No

Year: 2017

How Search Works – The Story – Inside Search

Google US. 2017. How Search Works – The Story – Inside Search – Google . [ONLINE] Available at: https://www.google.com/insidesearch/howsearchworks/thestory/. [Accessed 29 January 2017].

Summary: To follow

Peer reviewed: No

Google Knows of 130 Trillion Pages On The Web

Schwartz, B., Seroundtable.com. 2017. Google Knows of 130 Trillion Pages On The Web. [ONLINE] Available at: https://www.seroundtable.com/google-130-trillion-pages-22985.html. [Accessed 29 January 2017].

Summary: To follow

Peer reviewed: No