Googlefighting as a statistics teaching example?

There was a lively discussion of my entry with googlefights between Clinton and Bush, so I thought it might be worth saying how this could be used for a project in an intro stats class.

Mark points out that the ratio of hits for “bush hitler” compared to “clinton hitler” is lower than the ratio of hits for “bush” compared to “clinton.” But Boris says, “I think any fair-minded review of the ‘bush hitler’ and ‘clinton hitler’ searches would show that there are far more credible ones on the former.”

Sampling to the rescue

How to evaluate Boris’s claim? We certainly can’t go around checking 2.3 million websites! The simplest approach would be to take simple random samples of the “bush hitler” and “clinton hitler” hits and score each of them for “credibility” (in Boris’s words) in some way. And similarly take simple random samples of the “bush” and “clinton” hits.

But is simple random sampling an option in Google? Does Google provide a list of all 2.3 million hits to sample from? It certainly provides the first 100 right away–actually, you can set preferences to see 100 hits per page, so quick clicking will give the first 1000. So students could start with a SRS from these. The next step would be to see how the probability of a hit being “credible” changes with position on the list, . . . Really lots to do here for an enterprising group of students.

In general, I’m more in favor of students collecting their own data rather than using the web (see Section 11.4 of our book) but here they have to do a little work, what with the sampling and the content evaluation, so it seems that it could be reasonable.

If anyone tries it out with their students, I’d be curious to hear how it works out.

7 thoughts on “Googlefighting as a statistics teaching example?

  1. Unfortunately, there's nothing random about the way Google posts their results. The way their PageRank (tm?) algorithm works, the links that appear first are the ones that relate the most to the words you typed in and the ones that are most often clicked when you type those words in. In fact, I would imagine your last two posts have changed those rankings a little, if people have been clicking.

    So, because of that, it seems that the results are returned sorted from the most relevant to the least, according to Google, instead of randomly.

  2. Mark,

    That's where the next step comes in, of modeling the probability that a hit is credible as a function of position on the list. But then one would have to be able to go deeply into that list.

    Playing around with google, it appears that it only gives 1000 accessible links. I didn't realize this–I'd just assumed that one could skip thru, 1000 at a time, and go as deeply in the link list as desired. Is there a way of getting more than 1000 total links from a google search?

  3. Hi Andrew,

    Here's what I tried:

    If you look at the query string in the address bar, you'll see a piece that says "start=xxx". I typed 9900 in that and got the following error:

    "Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 9900.)"

    So, no, there isn't. But, even in looking at the first 1,000, you should be aware that the total number of hits isn't inclusive of multiple hits on the same page, which you probably knew. So, by including those, you change the makeup of the dataset.

    I guess the question becomes, do multiple hits on a site mean that site is a more "legitimate" source of info on a given topic than another? For instance, when including "multiples" on the query of the name Bush, news sites make up the bulk of the hits because of multiple stories about Bush. The same would probably not be the case for, say, the word apple.

  4. I wonder if we could ask Google directly. There's so many geeks there, I'm sure they'd be interested in chatting with us, as long as they don't reveal the propietary PageRank secrets. Hell, the two founders used to be PhD students…

    By the way, has anyone noticed how badly the relevance of Google results degrade further down lists for typical searches? The top bunch are usually spot on, but the 50th or 100th seem wildly off. Just intuition, of course.

  5. It seems computationally laborious to play this google research game. But I like the more general idea of google research games. I do a lot of google research. I am teaching a big undergraduate research methods class now and have not talked at all about using the internet, especially google for research.

    What would an Intro to Google from a Research Methods/Statistics 101 perspective look like?

  6. Google only gives you less than 0.0001% of the search results for queries on the biggest and most interesting search terms, like searching on "Google" itself. This promotes groupthink (or should that be google-think?) where everyone sees the same small subset of information on a subject and won't let anyone validate their counting.
    Yes, those with hours on their hands can try to tweak search queries with obscure words to try to peek into the "tail" of the results but where is the service in that?

    The 1000 result limitation is a great way for Google to manage its image, terrible way to service its self-stated mission "to organize the world's information and make it universally accessible and useful", unless they think letting people access less than 0.0001% of the pages on a subject is a pass mark for accessibility!

    Google is now an advertising engine and this 1000 result limitation just helps limit results to the select few that are likely to more positive about their subjects and more pleasing to Google's advertisers. Google have since added behavioural targetting to their ads where your search history helps their advertisers target your wallet.

    The world will be better when Google stops pretending, admits to being the worlds biggest marketer and lets some other company or collective give the public what they most wanted in the first place – a good search tool that works for them, not against them.

Comments are closed.