Collecting and Analyzing Data in Statistics

These are my notes on collecting and analyzing data in statistics.

Part of becoming a problem solver and user of statistics is developing an ability to appraise the quality of measurements. When you encounter data, consider whether the concept under study is adequately reflected by the proposed measurements, is the data measured accurately, and is there a sufficient quantity of the data to draw a reasonable conclusion.


Measurement and data are an integral part of science. Methods have been developed to solve research problems. Gather information about the phenomenon being studied. On the basis of the data, formulate a preliminary generalization or hypothesis. Collect further data to test the hypothesis. If the data and other subsequent experiments support the hypothesis, it becomes a law.


There are two ways to obtain data, observation and controlled experiments. In a statistical analysis, it is usually not possible to recover from poorly measured concepts or badly collected measurements. 


A response variable measures the outcome of interest in a study. An explanatory variable causes or explains changes in a response variable. Isolating the effects of one variable on another means anticipating potentially confounding variables and designing a controlled experiment to produce data in which the values of the confounding variable are regulated.


Observational data comes about from measuring things. They can be extremely valuable. 


Much of the statistical information presented to us is in the form of surveys. So, it is important to understand them and how they are done. In some cases, the purpose of a survey is purely descriptive. However, in many cases the researcher is interested in discovering a relationship.


Data in which the observations are restricted to a set of values that possess gaps is called discrete. Data that can take on any value within some interval is called continuous. The quality of data is referred to as its level of measurement. When analyzing data, you must be exceedingly conscious of the data’s level of measurement because many statistical analyses can only be applied to data that possess a certain level of measurement. 


Data that represents whether a variable possesses some characteristic is called nominal. Ordinal data represents categories that have some associated order. Note that ordinal data is also nominal, but it also possesses the additional property of ordinality. 


If the data can be ordered and the arithmetic difference is meaningful, the data is interval. An example of interval data is temperature. Interval data is numerical data that possesses both the property of ordinality and the interval property. Ratio data is similar to interval data, except that it has a meaningful zero point and the ratio of two data points is meaningful. 


Qualitative data is data measured on a nominal or ordinal scale. Quantitative data is measured on an interval or ratio scale. 


Time series data originates as measurements usually taken from some process over equally spaced intervals of time. Time series data originate from processes. Processes can be divided into two categories: stationary and nonstationary. All time series that are interesting vary, and the nature of the variability determines how the process is characterized. In a stationary process the time series varies around some central value and has approximately the same variation over the series. In a nonstationary process, the time series possess a trend, the tendency for the series to either increase or decrease over time. 


Cross-sectional data are measurements created at approximately the same period of time.