Estimating the Size of the German Bing Index

Abstract

How many pages does Microsoft's search engine Bing.com hold in its index? Following the idea of Maurice de Kunder we can roughly estimate the size of Bing's index being 300 million pages.

Approach

Following Maurice de Kunder's approach we perform queries for a set of query terms,

frist on a representative collection of web pages where the size is known and
second using the Bing search engine to retrieve the number of pages a certain query terms appears in.

Next we can compare the two frequencies and estimate the size of the unkown collection. When the assumption holds that the distributional properties of the two collections are rather similar, our estimates should also be quite accurate. For example, if the term 'der' (german masculine definite article corresponding to 'the' in the english language) occurs in 80% of the pages in the representative collection and Bing showes 210 million results, then we can estimatiotimate Bing's index size to be about 210 x 80% = 262.5 million pages. To get a more reliable estimate we not only use high frequency terms but also terms with lower frequency and average the results.

Limitations

There are some short comings when comparing to Maurice' method:

Number of terms: Since I only want get a rough estimate of the size of Bing's index, I only picked 14 terms more or less randomly, trying to get a mixture of different frequencies. In contrast, Maurice used 50 terms "selected evenly across logarithmic frequency intervals", for details see How is the size of the World Wide Web (The Internet) estimated?.
Choice of representative collection: Another limitation of my estimation procedure probably is, that I use the German wikipedia as the representative collection. The reason for my choice is simple: I can easily find out the number of occurrences of a term in the wikipedia by using the following query with google: "der" site:de.wikipedia.org, which gives 1,260,000 results. Again Maurice used a representative sample of the web, where wikipedia might well have different distributional properties.

Example Queries

An example of my queries using Bing and de.wikipedia.org as the reference collection (Using Google search with special command site:de.wikipedia.org):

koennen_wiki

koennen_bing

Experimental Results

Here are the query terms and frequencies results

Term	Bing (in thousands)	de.wikipedia.org	100	200	300
hoch	33300	128,0	38%	77%	115%
negativ	2960	19,6	66%	132%	199%
positiv	5280	25,7	49%	97%	146%
unter	51300	626,0	122%	244%	366%
falsch	5910	24,8	42%	84%	126%
drei	21000	479	228%	456%	684%
sieben	5440	113	208%	415%	623%
können	51100	504,0	99%	197%	296%
kräftig	2270	6,2	27%	55%	82%
sehr	29600	230,0	78%	155%	233%
gehen	16600	81	49%	98%	146%
gegangen	1510	15,4	102%	204%	306%
darunter	3110	115	370%	740%	1109%
dahinter	1170	16,7	143%	285%	428%

And finally a net chart to visualize the results from the above table. As you can see 300 million seems to be a good estimate for the german index of Bing, provided that my assumptions hold (most terms are well above 100% with the only exception of "kräftig"):

net chart of query terms — Net chart of query terms

Created 2014-02-16 Olaf Behrendt Last modified: 2018-04-16