Olaf Behrendt
about
explorations

Estimating the Size of the German Bing Index

Abstract

How many pages does Microsoft's search engine Bing.com hold in its index? Following the idea of Maurice de Kunder we can roughly estimate the size of Bing's index being 300 million pages.

Approach

Following Maurice de Kunder's approach we perform queries for a set of query terms,

Next we can compare the two frequencies and estimate the size of the unkown collection. When the assumption holds that the distributional properties of the two collections are rather similar, our estimates should also be quite accurate. For example, if the term 'der' (german masculine definite article corresponding to 'the' in the english language) occurs in 80% of the pages in the representative collection and Bing showes 210 million results, then we can estimatiotimate Bing's index size to be about 210 x 80% = 262.5 million pages. To get a more reliable estimate we not only use high frequency terms but also terms with lower frequency and average the results.

Limitations

There are some short comings when comparing to Maurice' method:

Example Queries

An example of my queries using Bing and de.wikipedia.org as the reference collection (Using Google search with special command site:de.wikipedia.org):

koennen_wiki

koennen_bing

Experimental Results

Here are the query terms and frequencies results

Term Bing (in thousands) de.wikipedia.org 100 200 300
hoch 33300 128,0 38% 77% 115%
negativ 2960 19,6 66% 132% 199%
positiv 5280 25,7 49% 97% 146%
unter 51300 626,0 122% 244% 366%
falsch 5910 24,8 42% 84% 126%
drei 21000 479 228% 456% 684%
sieben 5440 113 208% 415% 623%
können 51100 504,0 99% 197% 296%
kräftig 2270 6,2 27% 55% 82%
sehr 29600 230,0 78% 155% 233%
gehen 16600 81 49% 98% 146%
gegangen 1510 15,4 102% 204% 306%
darunter 3110 115 370% 740% 1109%
dahinter 1170 16,7 143% 285% 428%

And finally a net chart to visualize the results from the above table. As you can see 300 million seems to be a good estimate for the german index of Bing, provided that my assumptions hold (most terms are well above 100% with the only exception of "kräftig"):

net chart of query terms
Net chart of query terms

Created 2014-02-16 Olaf Behrendt Last modified: 2018-04-16