Beyond the Point: An Introduction to Functional Data Analysis

What is, in my opinion is the superior method of extracting insights from data.

Hey everyone, for this post, I’m gonna be talking about an area of stats that’s had my attention for the past 4ish months: Function Data Analysis. I hope in the future I’ll be able to conduct research on this topic and get more in-depth with you guys, but for now I’ll explain what it actually is and why it’s so interesting. So enjoy!

Introduction: Shifting from Points to Curves

Traditionally, statistics has treated data as discrete points, but as the technology of our time has advanced with things like sensors, medical imaging, IOT, and wearables, we now generate vast volumes of high-frequency data across a variety of fields. The goals of FDA, such as descriptive statistics, classification, regression, etc, are the same as for statistical analysis of multivariate data, and many related methods have been transferred to function-valued data, while FDA brings additional challenges, including high dimensionality of observation and parameters. In contrast to simpler methods that reduce the function observations to scalar summary values, FDA retains all important information directly using functional observation in the analysis.

To give a brief background on FDA, the roots of this topic go back to the 1940s and 1950s with the work of Ulf Grenander and Kari Karhunen. They considered the decomposition of a square-integrable continuous-time stochastic process into eigencomponents, known today as Karhunen-Loeve decomposition. Rigorous analysis of the function principal component analysis was done in the 1970s by Kleffe, Dauxois, and Pousse, including results about the asymptotic distribution of the eigenvalues. More recently, the field has focused more on applications and understanding the effects of dense and sparse observation schemes.

In essence, FDA goes beyond the limitations of traditional analysis by leveraging data with functionality, providing tools for researchers looking for deeper insights and solutions to complex problems.

Why Use FDA?

FDA provides a powerful set of methods to model, analyze, and interpret data that exhibits continuous variation, allowing researchers and professionals to make more accurate solutions and informed decisions based on the inherent functional nature of the data. This versatility is what makes FDA a valuable approach in a wide range of scientific and practical domains. By employing domains like linear algebra, functional analysis, probability, and statistics, FDA manipulates and analyzes functions by representing data as observations of random variables in a function space. This allows operations like differentiation, integration, and smoothing, allowing exploration of data structure and variations. By treating data as functions, FDA can uncover hidden patterns, relationships, and trends that would be challenging to discern using traditional statistical methods.

Function Data Representation

It’s possible to convert data from a discrete set x1,x2,...,xnx_1, x_2, ..., x_n to a functional form. In FDA, a basis is a set of functions that are used to represent a continuous curve or function. This process is also known as basis smoothing. It involves expressing a statistical unit xix_i as a linear combination of coefficients cisc_{is} and the basis function Φs\Phi_{s} as shown in the equation below.

xi(t)s=1Scisϕs(t)x_i(t) \approx \sum_{s=1}^{S} c_{is} \phi_{s} (t)

Basis Types

Different types of basis can be used depending on the nature of the data and the specific goals of the analysis. Some common types include the Fourier basis, polynomial basis, spline basis, and wavelet basis. Each of these types of basis has its own properties and can be useful for different types of analysis and data. Generally, the choice of basis will be dependent on the characteristics of the specific data and goals of the analysis. I’m going to briefly go over B-spline basis as that’s what I have the most experience with.

B-Spline Basis

A B-spline basis is a type of basis that is constructed using B-spline functions. B-spline functions are piecewise polynomial functions that are commonly used in computer graphics and numerical analysis. In a B-spline basis, the functions are arranged in a specific way so that they can be used to represent any continuous curve or function.

Components of a B-spline Basis:

  • Degree (p): Defines the degree of the polynomial segments.

  • Knots (t): A non-decreasing sequence of values that divide the domain.

  • Control Points (PiP_i): Points that control the shape; the curve is defined as C(u)=Ni,p(u)PiC(u) = \sum N_{i,p} (u) P_i

A couple of the key characteristics of B-Spline basis are:

  • Local Support: Each basis function is non-zero over a small segment of the entire curve domain, this is to ensure that moving a single control point doesn’t affect the whole curve.

  • Knot Dependence: Defined by a set of knot values that define the continuity degree p, and the shape of the resulting curve.

Curse of Dimensionality

Another advantage of using FDA is how it approaches the Curse of Dimensionality. This refers to the difficulties that arise when dealing with high-dimensional data. As the number of dimensions (or features) in a dataset increases, the amount of data required to accurately understand and learn the relationships between the features and the target variable grows exponentially. This leads to issues when training ML models on high-dimensional datasets.

A second reason why the curse of dimensionality is a problem in machine learning is that it can lead to overfitting. When working with high-dimensional data, it’s easy to include irrelevant or redundant features that don’t contribute to the predictive power of the model. This leads to poor generalization on unseen data.

FDA vs Multivariate Statistics

Is it even worth our time to use continuous representations, or are we adding extra complexity? Typically, discrete data is needed for computational tasks. What benefits do we even get from using a continuous representation in our analysis? While discrete data might be convenient from a computational standpoint, there are a few advantages to working with continuous representations that make it worthwhile. By looking at objects as functions, curves, or surfaces, we can use more powerful analysis techniques that yield better practical results and solutions. Let us consider how continuous representations improve our understanding and analysis of data.

If data is sampled from an underlying function, and time points are synchronized across observations, focusing on heights, then analysis can be done using a vector x=(x1,x2,...,xn)Tx = (x_1, x_2, ..., x_n)^T. If the time points also hold significance, then it’s necessary to keep them along with the height data: ((t1,x1),(t2,x2),...,(tn,xn))T((t_1, x_1), (t_2, x_2), ..., (t_n, x_n))^T. With continuous functions, we can interpolate and resample at arbitrary points and compare observations with different time points as elements of a function space.

In more traditional data analysis, we might work with data points in a table where each row represents an observation, and each column represents a variable. In FDA the data is treated as functions, where each observation is considered as a function that maps a continuous variable (like time or wavelength) to a measured value. These functions represent how the data changes over the series. Understanding these functions requires a strong grasp of the structures lying beneath them. The FDA approach allows researchers to investigate the model extensively, expanding the comprehension of data characterized by function structures across diverse fields.

This is it for now

I’m not gonna get into the mathematical foundations of FDA, but this brief overview should paint a decent picture. Not sure what the next post will be on, but I hope you stay looking out for it.