Recently, Taylor Shobe posted a problem

which is based on Marilyn vos Savant's Sunday Column on this subject

The question seems straightfoward, which shall be phrased as follows

*A die is thrown 20 times, and a report is made of the recorded outcomes. Which is more likely to be the report, A or B?*

*A=11111111111111111111*

*B= 66234441536125563152*

The phrasing of this problem turns out to be critical. As originally phrased, Taylor's problem raised objections by a number of people, including Calvin Lin, but as it stands now, it should **almost** be correctly worded for the answer of "B is more likely*, which is the common sense answer. Of course a die thrown randomly 20 times is going to generate a result like B, and nobody would be expecting A. And yet, how is it mathematically decided that B is more likely than A? If the question had been

*Which sequence A or B is the more likely outcome?*

then the correct answer would be "equally likely", since the probability of either happening is \(1\) in \({ 6}^{ 20 }\)

Can anyone throw some light on this subject? If the "common sense answer" is "B is more likely", how is that decided mathematically? Making the argument that "there's a lot more random sequences of digits than there are sequences where all the digits are the same" is not sufficient, since it involves opinions about what is random and what is order. For instance, given C and D

C=12123123412345123456

D=12312423532143215432

why are we likely to say that D is more likely than C? We can *claim* that C is "more ordered", but try quantifying that mathematically.

## Comments

Sort by:

TopNewestI avoided answering Taylor's question because my objective and subjective thought processes reached a stalemate.

"Common sense" can be so subjective. If we were dealing with shorter strings, say \(A = 11, B = 25,\) then I think that most people's "common sense" would lead them to conclude that the strings are equally likely, even though string \(A\) appears more "ordered" than string \(B.\) (I may be wrong about what the "common sense" of others might be here, though.) Once we get to strings of length \(3\), say \(A = 111, B = 254,\) then "common sense" would probably start to veer to string \(B\) as being more likely, even though they each have a chance of \(1\) in \(216\) of occurring. This trend becomes more ingrained as the strings get longer.

When I looked at the original two strings, my first thought was that, since the probability that none of the numbers \(2\) through \(6\) appear after 20 throws is so unbelievably remote, string \(B\) just had to be more likely. But I also knew that, from scratch, each string had a \(1\) in \(6^{20}\) chance of occurring. But then I thought, as you did, of strings that had the same number of occurrences of each number, but with one string possessing a more "ordered" sequence of these numbers than the other. Then what? It almost seems to be a trait of the human mind to see order as a sign of deliberate manipulation, and thus less "naturally" possible, leaving apparent "randomness" as the more likely natural state of being. So to mathematically quantify these observations, I suppose we would need to quantify this intrinsic perception of "randomness", i.e., assign some measure that takes into account, in this case, the frequency of occurrences of any substring of length \(1\) on up, and the distribution of occurrences of these same strings. But all this would still fly in the face of the purely objective observation that each string, no matter how apparently ordered, is just as likely as any other. At the moment I can't see any easy way out of this quandary.

Log in to reply

Thanks for jumping right in. You've correctly identified the difficulties of of assessing "precise" relative probabilities of "different kinds" of integer sequences. If we used specific criteria, such as "all strings that are composed of only digits 0 and 1", then we can compute the probabilities. But there's a mind-boggling range of different kinds of criteria we could apply to such sequences.

I need to be out of town (again!) for about a day and I will get back on this subject. I welcome your thoughts on this.

Edit: The other day, someone mentioned "Surreal Numbers", which are numbers that are defined by "turns taken in the tree of all possible [surreal] numbers to reach particular ones". An analogy would be that instead of using addresses to locate a place in a city, we use a sequence of L and R for left and right turns from some origin to get to the place, and that defines that location. I just wonder if we could work out a similar scheme, using various criteria, in which to get a grip on all the different kinds of sequences, from random to not-so-random to ordered? Is it possible to mathematicize this?

Log in to reply

Wow, surreal numbers are pretty trippy, and while I can see how they might be useful here, I'm not comfortable enough with them yet, so I'll stick to more mundane stuff for now.

I was thinking of trying something more along the lines of the Kolmogorov complexity, at least informally. We could first convert the string to binary, where 0 indicates an even digit and 1 an odd digit, and use established probability "rules" applicable to binary strings. To then factor in the diversity of evens and odds, we could have a function \(f(x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}),\) where \(x_{k}\) is the counter for the number of occurrences of the number \(k\) in the original string. We could then convert \(f\) for a given string to a value based on the variation, (or lack of such), amongst the \(x_{k}\)'s. This combined, (weighted average?), with the even/odd distribution measure would make for a good start in determining the "randomness" of a given string, and thus its (subjective) relative likelihood.

We could also throw in a factor that counts the number of distinct \(k\)-digit substrings present within a given \(n\)-digit sequence, for \(k = 2,3, ..., n-1.\) We might also want to introduce a function indicative of the distribution of each digit, (and possibly substrings), within the given string to watch out for the possibility of the occurrences of a digit being "bunched" in one portion of the string, as the even/odd distribution measure would only partially take care of this possibility. Adding any further measures would make the calculation inordinately complicated.

Another approach would be to apply some kind of compression algorithm to a given string. Some possible steps for \(20\)-digit strings could include: (i) the third and any successive occurrences of a digit are eliminated, (ii) the third and any further occurrences of a particular \(2\)-digit substring are eliminated, (iii) the second and any further occurrences of a particular \(3\) or more digits substring is eliminated. Using these steps, the strings \(A,B,C,D\) become

\(A' = 11, B' = 662344153612556152, C' = 121234, D' = 12312423532145432.\)

With the length of the compressed string indicating its likelihood, the results are more in line with "common sense", although \(D'\) still "seems" too long since there is no occurrence of the digit \(6,\) and since there still appears to be a rough pattern to the compressed string with an almost too even distribution of the digits, particularly \(1\) through \(3.\) In other words, \(D'\) still seems to be a string partially manufactured to simulate randomness, but done poorly so as to betray its creator's intent. There are too many ascending and descending substrings to appear random, so perhaps we could add a fourth step that eliminates some of these "offending" substrings, e.g., (iv) eliminate the third and any further occurrence of an ascending or descending substring of length \(\ge 3\) of the already once compressed string. This would result in the revised compressed string \(D'' = 1231242353214,\) which now seems more appropriately evaluated compared to \(B'.\) We could possibly also employ some of the steps from the first method I outlined to further "out" such pseudorandom strings. For example, \(D''\) would be further evaluated downward for its lack of containing any \(6\)'s, but would be marked neutral by the even/odd distribution measure.

The details of the compression algorithm would have to be adjusted for the length of strings we are evaluating, but this is the gist of things at present. One last thought; I would be suspicious of a (20-digit) string that was not affected at all by this compression algorithm, as it would almost seem too "perfect", and thus more likely manufactured.

Log in to reply

Well in regards to A vs B. I would have said equally likely. Since each roll is a separate occurance and the previous result has no bearing on the next roll. Similar to a roulette wheel. Just because it hit black 10 times doesn't mean that it HAS to land red. But because there is only one #1 on the die and 5 other options that it could land on B is viable as well because of probability in that sense. I definitely could make an argument either way.

Log in to reply

A and B, exactly as they are, occur with equal likelihood. But if a friend told you that he rolled 10 consecutive pairs of 1's with dice at the casino, would you believe him? It would have been about a hundred times more likely for him to have drawn a Royal Flush in a 5 card poker game.

Log in to reply

I would ask how much they were betting when hitting so many hard ways! LOL

Log in to reply

Great question, and I am going to study on it. BTW it is Marilyn VOS Savant, not von....

Log in to reply

Thanks, fixed von \(\rightarrow\) vos.

Log in to reply

I'm a little late to this - but I believe the problem lies in the statement, which should be, "Your friend rolls a die twenty times. He reports that he found sequence A or sequence B. Which is more likely?" Sequences A and B are equally likely, but they belong to types of sequences (streaks and mixed streaks) that are not - the friend might as well have said, "I rolled a 20-long streak" or "I didn't roll a 20-long streak," right?

I do think there are some problems with this though. If he'd said, "I rolled a sequence that started with a 6" vs. "I rolled a sequence that didn't start with the 6," we'd reach the opposite conclusion. I don't know how Marilyn got past that part.

Log in to reply

It's all the things left unsaid that matters. When Marilyn claimed the B was more likely, she was really saying, "a sequence of random digits is more likely than a sequence of just one digit". In other words, a "B" that "looks random" is more probable than a B with all 1's. Which is true. But that's not exactly how the problem is worded, or this part was left unsaid.

In fact, when attempting to compare a sequence A and another sequence B, based on "patterns" in them that we [think we] see, it can get incredibly difficult to assess probabilities. As a matter of fact, this is related to the SETI project. How do we tell if some signal from outer space is "not random"? And do "random signals" necessarily mean it's truly just a random signal? Even here on Earth, we routinely compress data for more efficient transmission, and it turns out that the more we compress, the more the resulting signal looks random!

Log in to reply

Personally, my problem is that the problem is missing a crucial piece of information: how are the sequences generated? However, this leads to more questions...

How are the sequences generated?

thatsecond sequence generated?How is the second sequence generated?

What are the "types"?

Log in to reply