Thursday, October 29, 2009
The Odds of the Curse, Part II.
I computed some odds in this post that ended up very wrong. So I modified the post accordingly.
So what is needed is: the odds that a randomly picked word starts with a given letter. At the Bay Guardian, commenter Frouglas chose to be the probability that a word starts with a given letter, but that's assuming all the words are equally likely. Of course, some words are more likely than others. Commenter Spike counted the words in a file and computed the frequency of a letter starting a word this way. His method is good, but the file would be too small to get really meaningful result.
So here's what I tried to do.
Assumption 1: the distribution of the words in the English language follows Zipf's law. Actually, this is a classical model for this distribution, so that a reasonable assumption.
Assumption 2: we take 10,000 words in our dictionary. You could take 100,000 if you wanted to, it would not make much difference. But you need to have a finite dictionary.
Now, I took the 418 most frequent words in English (all the words that appear more than 200,000 times per billion). I did it only for these 418 words, because the data is not formatted nicely for processing, otherwise I would have done it for all 10,000 words. For those 418, I computed the frequency of all the words that start with the letters CFKOUY. For the rest of the words, I just assumed that the distribution that Frouglas applied (that is, the probability that any words start with a given letter) was valid. I then weighted those two distributions according to the Zipf distribution to get my probabilities that a randomly selected word starts with one of the letters CFKOUY. Namely, there is a 67% chance that such a word belongs to the Top 418, with one distrution, and 33% chance that it belongs to the rest, with the other distribution.
And at the end, I get my estimate of the probability that the FUCK YOU is coincidental (accounting for the different anagrams) to be: 1 in 500,000 billion.
If I take a larger alphabet, I'm giving more weight the Frouglas's distribution, but the order of magnitude stays the same, 1 in 300,000 billion.
So what is needed is: the odds that a randomly picked word starts with a given letter. At the Bay Guardian, commenter Frouglas chose to be the probability that a word starts with a given letter, but that's assuming all the words are equally likely. Of course, some words are more likely than others. Commenter Spike counted the words in a file and computed the frequency of a letter starting a word this way. His method is good, but the file would be too small to get really meaningful result.
So here's what I tried to do.
Assumption 1: the distribution of the words in the English language follows Zipf's law. Actually, this is a classical model for this distribution, so that a reasonable assumption.
Assumption 2: we take 10,000 words in our dictionary. You could take 100,000 if you wanted to, it would not make much difference. But you need to have a finite dictionary.
Now, I took the 418 most frequent words in English (all the words that appear more than 200,000 times per billion). I did it only for these 418 words, because the data is not formatted nicely for processing, otherwise I would have done it for all 10,000 words. For those 418, I computed the frequency of all the words that start with the letters CFKOUY. For the rest of the words, I just assumed that the distribution that Frouglas applied (that is, the probability that any words start with a given letter) was valid. I then weighted those two distributions according to the Zipf distribution to get my probabilities that a randomly selected word starts with one of the letters CFKOUY. Namely, there is a 67% chance that such a word belongs to the Top 418, with one distrution, and 33% chance that it belongs to the rest, with the other distribution.
And at the end, I get my estimate of the probability that the FUCK YOU is coincidental (accounting for the different anagrams) to be: 1 in 500,000 billion.
If I take a larger alphabet, I'm giving more weight the Frouglas's distribution, but the order of magnitude stays the same, 1 in 300,000 billion.
The Odds of the Curse
[updated: I goofed in my numbers, so I've modified significantly] So Arnold Schwarzenegger has embedded a big "fuck you" to Ammiano in the text explaining why he vetoed an otherwise uncontroversial measure. This was first uncovered by the SF Bay Guardian.
In a follow up post, they discuss the odds of this being innocent, rather than carefully planned.
Of course, they got their number from a politician, so it's wrong:
At any rate, Supervisor David Chiu has done the math and concludes that it's highly unlikely this was a mistake:
"Assuming it was real, I calculated the probability that this is pure chance. Assuming it's a 1/26 chance for each particular letter, the probability that this is random is one out of 8,031,810,176."
Ok, that 1 in 8 billion.
Well, it's quite obvious that letters are not equiprobable. More words starts with A than with Z, of course, so those odds are wrong. A commenter at the Bay Guardian came up with his formula:
using the 2of12 list from the 12dicts file found at http://wordlist.sourceforge.net/, I calculated the probability of a word starting with the following letters as follows:
f = 4.40%; u = 3.59%; c = 9.30%; k = 0.66%; y = 0.29%; o = 2.66%; u = 3.59%
for an overall probability of 2.39E-12, or approximately 1 in 370,855,495,993. so a much lower probability than that calculated by Supervisor Chiu.
That's 1 in 370 billion. But again, that is wrong. Indeed, if only you could keep ONLY ONE word starting with for instance the letter T in the English language (and re-assigning all other T-starting words to other letters), the probability to find a word starting with the letter T in a text would vary on what that unique word means. If it's "THE" then it would be quite likely, it's the most frequent word in the English language; if it's "THEREMIN" not as much. What matters is not how many words start with this letter, but how frequents words that start with a given letter are in the language.
Another commenter gets a better idea: "spike" counted the appearance of these letters as first letter in some big text and came up with odds of 1 in 600 billion. This captures both the frequency of the letters starting a word, and the frequency of the word in the language. However, his text has 30,000 words, and there are over 100,000 words in the English dictionary so his result is not statistically valid. Still that's a good ballpark estimate.
So far (not sure how spike did it) all calculated the odds of finding the letters in FUCKYOU. But these letters also spell CFKOUUY, FCUKYOU, YOUUFCK, and plenty other anagrams. So the odds above include many more words that you need to disambiguate. If you assume the two "unnecessary" are different, one starts with U1, and the other with U2, then there are two combinations of the letters C,F,K,O,U1,U2,Y which spell the right result, FU1CKYOU2 and FU2CKYOU1. There are also 7*6*5*4*3*2*1 = 7! = 5040 possible combinations of these letters. So there is a 2/5040 chances that, if the first letter of the words are taken from the set {CFKOUUY}, then it spells out FUCKYOU.
So if we take the value found by Spike (1 in 600 billion), and multiply by 2/5040, then we get one in 1.5 million billion.
No matter how you slice it, the odds are much much much lower than all the numbers suggested by the commenters at the Bay Guardian (1 in 8 billion for Chiu, one in 370 billion by frouglas, one in 600 billion computed by spike).
The odds of OJ being not guilty according to the DNA evidence are 1 in 170 million.
In a follow up post, they discuss the odds of this being innocent, rather than carefully planned.
Of course, they got their number from a politician, so it's wrong:
At any rate, Supervisor David Chiu has done the math and concludes that it's highly unlikely this was a mistake:
"Assuming it was real, I calculated the probability that this is pure chance. Assuming it's a 1/26 chance for each particular letter, the probability that this is random is one out of 8,031,810,176."
Ok, that 1 in 8 billion.
Well, it's quite obvious that letters are not equiprobable. More words starts with A than with Z, of course, so those odds are wrong. A commenter at the Bay Guardian came up with his formula:
using the 2of12 list from the 12dicts file found at http://wordlist.sourceforge.net/, I calculated the probability of a word starting with the following letters as follows:
f = 4.40%; u = 3.59%; c = 9.30%; k = 0.66%; y = 0.29%; o = 2.66%; u = 3.59%
for an overall probability of 2.39E-12, or approximately 1 in 370,855,495,993. so a much lower probability than that calculated by Supervisor Chiu.
That's 1 in 370 billion. But again, that is wrong. Indeed, if only you could keep ONLY ONE word starting with for instance the letter T in the English language (and re-assigning all other T-starting words to other letters), the probability to find a word starting with the letter T in a text would vary on what that unique word means. If it's "THE" then it would be quite likely, it's the most frequent word in the English language; if it's "THEREMIN" not as much. What matters is not how many words start with this letter, but how frequents words that start with a given letter are in the language.
Another commenter gets a better idea: "spike" counted the appearance of these letters as first letter in some big text and came up with odds of 1 in 600 billion. This captures both the frequency of the letters starting a word, and the frequency of the word in the language. However, his text has 30,000 words, and there are over 100,000 words in the English dictionary so his result is not statistically valid. Still that's a good ballpark estimate.
So far (not sure how spike did it) all calculated the odds of finding the letters in FUCKYOU. But these letters also spell CFKOUUY, FCUKYOU, YOUUFCK, and plenty other anagrams. So the odds above include many more words that you need to disambiguate. If you assume the two "unnecessary" are different, one starts with U1, and the other with U2, then there are two combinations of the letters C,F,K,O,U1,U2,Y which spell the right result, FU1CKYOU2 and FU2CKYOU1. There are also 7*6*5*4*3*2*1 = 7! = 5040 possible combinations of these letters. So there is a 2/5040 chances that, if the first letter of the words are taken from the set {CFKOUUY}, then it spells out FUCKYOU.
So if we take the value found by Spike (1 in 600 billion), and multiply by 2/5040, then we get one in 1.5 million billion.
No matter how you slice it, the odds are much much much lower than all the numbers suggested by the commenters at the Bay Guardian (1 in 8 billion for Chiu, one in 370 billion by frouglas, one in 600 billion computed by spike).
The odds of OJ being not guilty according to the DNA evidence are 1 in 170 million.
The Chron: "We'll Mislead You!"
Oh, Carolyn Lochhead, why does Heast pay you, and not the GOP directly? Here she goes:
Senate Majority Leader Harry Reid's gambit to include a government-run insurance option in health care legislation has given a fresh tailwind to the idea despite opposition from conservatives.
But lost amid the ideological battle for or against a public option is a key overlooked fact: The vast majority of Americans would have no access to a public option even under its most expansive versions.
That's because the vast majority of Americans is already covered by insurance.
Even seven years into an overhaul, an estimated 90 percent of Americans, including nearly everyone who has employer-based coverage now, would be shut out of a public option.
Left unsaid is that they would not need a public option plan, since they are already covered!
Those currently in other government programs, such as Medicare and the Veterans Administration, also would be excluded.
Because they have already government provided insurance. Duh!
This is the most absurd coverage of the day. In other news, owners of a 2009 Lexus are not covered by the cash-for-clunkers program. What a scandal! Most of Americans are ineligible for unemployment benefits. Stop the presses!
Remember: Carolyn Lochhead has taken in the past the stand that Social Security, Medicare, Medicaid should be dismantled. So she'll hang at any straw to whip up opposition to healh care reform.
Also left unsaid from the article: that the fact that public option is limited is good for fiscal reasons! Providing insurances to those who don't have it is expensive, so it's a good thing it's not for 100% of the Americans. I wonder why she would not bring it up.
Finally, the gambit in this article is called: "what's in it for me?" It's a common reflex with Republicans, who don't want to pay taxes for services they won't benefit from directly and will benefit undeserving others (with "others" being tinged with racism, as per Saint Reagan's Cadillac Driving Welfare Moms.)
"When you ask people in a poll, 'Are you in favor of a public option that would be available to everybody,' they say, 'Yes,' " Wyden said. "I don't think they're going to feel the same way about a public option available to only 10 percent of the population."
So Lochhead won't say that actually, 100% of Americans might lose their jobs and become eligible for public option. Or 100% of the Americans might decide to create a start up, and become eligible for a public option.
Senate Majority Leader Harry Reid's gambit to include a government-run insurance option in health care legislation has given a fresh tailwind to the idea despite opposition from conservatives.
But lost amid the ideological battle for or against a public option is a key overlooked fact: The vast majority of Americans would have no access to a public option even under its most expansive versions.
That's because the vast majority of Americans is already covered by insurance.
Even seven years into an overhaul, an estimated 90 percent of Americans, including nearly everyone who has employer-based coverage now, would be shut out of a public option.
Left unsaid is that they would not need a public option plan, since they are already covered!
Those currently in other government programs, such as Medicare and the Veterans Administration, also would be excluded.
Because they have already government provided insurance. Duh!
This is the most absurd coverage of the day. In other news, owners of a 2009 Lexus are not covered by the cash-for-clunkers program. What a scandal! Most of Americans are ineligible for unemployment benefits. Stop the presses!
Remember: Carolyn Lochhead has taken in the past the stand that Social Security, Medicare, Medicaid should be dismantled. So she'll hang at any straw to whip up opposition to healh care reform.
Also left unsaid from the article: that the fact that public option is limited is good for fiscal reasons! Providing insurances to those who don't have it is expensive, so it's a good thing it's not for 100% of the Americans. I wonder why she would not bring it up.
Finally, the gambit in this article is called: "what's in it for me?" It's a common reflex with Republicans, who don't want to pay taxes for services they won't benefit from directly and will benefit undeserving others (with "others" being tinged with racism, as per Saint Reagan's Cadillac Driving Welfare Moms.)
"When you ask people in a poll, 'Are you in favor of a public option that would be available to everybody,' they say, 'Yes,' " Wyden said. "I don't think they're going to feel the same way about a public option available to only 10 percent of the population."
So Lochhead won't say that actually, 100% of Americans might lose their jobs and become eligible for public option. Or 100% of the Americans might decide to create a start up, and become eligible for a public option.
Sunday, October 04, 2009
Why you should not read Debra
I did not read Debra, I assure you, but here's what on the front page of SFGate (I put in the headline and the subheadline, but no link, because you should not read it).
Traffic, trash, soda, cigarettes & the city
Nanny Newsom is picking on personal behaviors when he should be focusing on real problems. Debra Saunders.
Since when blocking city streets, trashing the city, and littering with cigarette butts or given second-hand lung cancer is a personal behavior?
Traffic, trash, soda, cigarettes & the city
Nanny Newsom is picking on personal behaviors when he should be focusing on real problems. Debra Saunders.
Since when blocking city streets, trashing the city, and littering with cigarette butts or given second-hand lung cancer is a personal behavior?