Thursday, October 29, 2009
The Odds of the Curse, Part II.
I computed some odds in this post that ended up very wrong. So I modified the post accordingly.
So what is needed is: the odds that a randomly picked word starts with a given letter. At the Bay Guardian, commenter Frouglas chose to be the probability that a word starts with a given letter, but that's assuming all the words are equally likely. Of course, some words are more likely than others. Commenter Spike counted the words in a file and computed the frequency of a letter starting a word this way. His method is good, but the file would be too small to get really meaningful result.
So here's what I tried to do.
Assumption 1: the distribution of the words in the English language follows Zipf's law. Actually, this is a classical model for this distribution, so that a reasonable assumption.
Assumption 2: we take 10,000 words in our dictionary. You could take 100,000 if you wanted to, it would not make much difference. But you need to have a finite dictionary.
Now, I took the 418 most frequent words in English (all the words that appear more than 200,000 times per billion). I did it only for these 418 words, because the data is not formatted nicely for processing, otherwise I would have done it for all 10,000 words. For those 418, I computed the frequency of all the words that start with the letters CFKOUY. For the rest of the words, I just assumed that the distribution that Frouglas applied (that is, the probability that any words start with a given letter) was valid. I then weighted those two distributions according to the Zipf distribution to get my probabilities that a randomly selected word starts with one of the letters CFKOUY. Namely, there is a 67% chance that such a word belongs to the Top 418, with one distrution, and 33% chance that it belongs to the rest, with the other distribution.
And at the end, I get my estimate of the probability that the FUCK YOU is coincidental (accounting for the different anagrams) to be: 1 in 500,000 billion.
If I take a larger alphabet, I'm giving more weight the Frouglas's distribution, but the order of magnitude stays the same, 1 in 300,000 billion.
So what is needed is: the odds that a randomly picked word starts with a given letter. At the Bay Guardian, commenter Frouglas chose to be the probability that a word starts with a given letter, but that's assuming all the words are equally likely. Of course, some words are more likely than others. Commenter Spike counted the words in a file and computed the frequency of a letter starting a word this way. His method is good, but the file would be too small to get really meaningful result.
So here's what I tried to do.
Assumption 1: the distribution of the words in the English language follows Zipf's law. Actually, this is a classical model for this distribution, so that a reasonable assumption.
Assumption 2: we take 10,000 words in our dictionary. You could take 100,000 if you wanted to, it would not make much difference. But you need to have a finite dictionary.
Now, I took the 418 most frequent words in English (all the words that appear more than 200,000 times per billion). I did it only for these 418 words, because the data is not formatted nicely for processing, otherwise I would have done it for all 10,000 words. For those 418, I computed the frequency of all the words that start with the letters CFKOUY. For the rest of the words, I just assumed that the distribution that Frouglas applied (that is, the probability that any words start with a given letter) was valid. I then weighted those two distributions according to the Zipf distribution to get my probabilities that a randomly selected word starts with one of the letters CFKOUY. Namely, there is a 67% chance that such a word belongs to the Top 418, with one distrution, and 33% chance that it belongs to the rest, with the other distribution.
And at the end, I get my estimate of the probability that the FUCK YOU is coincidental (accounting for the different anagrams) to be: 1 in 500,000 billion.
If I take a larger alphabet, I'm giving more weight the Frouglas's distribution, but the order of magnitude stays the same, 1 in 300,000 billion.
Comments:
Post a Comment