What is a Random Variable? An Idea Integral to Statistics and Many Other Fields.
My own explanation of what is a random variable, deriving it, and explaining what it truly means beyond what is taught in typical intro/applied statistics courses.
Hey everyone, this is going to be my first proper statistics focused blog post. I am going to be breaking down and deriving a topic that I felt wasn’t explained well to me when I was first getting into statisics the random variable.
Most introductory statistics classes describe a random variable as “a variable whose value is subject to variations due to chance.” While that’s fine for calculating the mean of a die roll, it’s actually a bit of a lie. In reality, a random variable is neither random nor a variable. It’s a deterministic function.
The Concept of an Experiment
The Experiment and the Sample Space
Before we can even begin to define the variable, we need the world it exists in. We can call this a random experiment an example being: a procedure like checking blood pressure, or playing a videogame like overwatch.
The set of all possible outcomes is the Sample Space ().
-
If we flip a coin twice, () = {HH, HT, TH, TT}.
-
If we are measuring latency, () = [0, ).
The Deterministic Function
Here is our main derivation: A random variable is a function that maps outcomes from the sample space to the real numbers:
Why do we do this? Because you can’t add a Head and a Tail. By mapping these outcomes to numbers (H = 1, T = 0), we can perform arithmetic, calculate expectations, and build models.
Measurability
If you stop at “it’s a function,” you’re missing the requirement for engineering. For to be a valid random variable, it must be measurable.
This means that for any interval on the number line, the set of outcomes that map into that interval must be an “event” we actually have a probability for. Mathematically this can be shown as:
Where is our -algebra. If this condition isn’t met, the probability is just undefined.
Think of the -algebra as the collection of all ‘askable questions’ about our experiment. Measurability ensures that for every numerical interval we care about, there is a corresponding ‘askable question’ in our sample space.”
Realizations vs The Variable
In statistics we must distinguish between two cases:
-
: The function/process itself.
-
: A realization or observed data point.
When we collect a dataset, we are looking at a collection of realizations. As we collect more, the histogram of these values begins to resemble the Probability Density Function (PDF) of the underlying .
Why this Matters
Why go through all of this trouble? In the use cases that appear in Data Science there are two that I think are particularly relevant:
-
High-Dimensional Inference: When we move beyond single numbers to vectors, we are mapping . The same rules of measurability apply to ensure our high-dimensional models (like Bayesian priors) are mathematically sound.
-
Convergence: Understanding as a function allows us to use the Law of Large Numbers. It provies that as our sample size grows, out empirical averages “converge” to the true expected value .
I hope that this explanation of what a Random Variable is was helpful I am going to be more some more posts in a similiar fashion to this one on topics that I felt weren’t covered properly when I was first learning about statistics.