Skip to main content

Data Services: Research Reproducibility

Reproducibility and Replication

To determine if results from a research project are valid researchers attempt to replicate the research project.  Beginning from scratch, another independent researcher attempts to recreate the research project including data collection and come up with similar results.

Since data collection is both time consuming and introduces additional randomness, researchers may attempt to reproduce the data analysis process of the original research. 

Replication includes the data collect phase of research, reproduction focuses on repeating the analysis phases.

Beginning with the raw data collected in the original research, reproduction researchers go through all the steps of data cleaning, analysis and characterization of the results to see if they reach the same conclusion as the original researchers.  This process can fail at many stages. 

If the original researcher hasn’t release the raw data, others can’t begin to reproduce the analysis.  Even though data sharing is required for most federally funding research, raw data is often not made available.  If data is made available, it must also include metadata that accurately and completely describes the data.  A column for “temperature” is useless if it doesn’t include the units (Centigrade/Celsius, Fahrenheit, Rankin, etc.)

Data usually need to be cleaned in some fashion.  If the variable being collected tends to range from 14.1 to 27.8 and a data point of 1.85 has been recorded, there is a strong possibility that the decimal point was mis-recorded and the actual value was 18.5.  Researchers can modify the value to what they assume was the actual, drop the value from analysis or include the possibly incorrect value in the analysis.  Not all possible incorrect values are obvious.  For example 15.8 and 18.5 could result from the transposition of two digits, or they could both be accurate recordings of values.  The researcher will need to decide for each measurement recorded if that measurement should be included in the analysis.  The original researcher’s decision making process for including or excluding data from analysis should be well described in the methods section of the published results so that other researchers will be able to reproduce this step of data analysis.

The next step of data analysis includes a sequence and variety of mathematical operations including statistical analysis.  Any summarizing, typifying or gathering of the data should be well characterized.  The types of tests and parameters used should be recorded in the published methods. 

Even if the precise methods and tools used are recorded, there are some processes that have a non-stochastic distribution or introduce random variability.  For example, when using a clustering algorithm to slice and gather as set of data into clusters with similar characteristics, the result can be dependent on a random seed value which determines the initial center point of each cluster.

Final steps in the process of analysis include characterization and conclusions.  This is when the analytical results with patterns and correlations are used to synthesize knowledge.  The researcher determines general patterns that can be seen in the data.  From those patterns, supported by the data and analysis, the researcher will draw conclusions that directly answer a basic research question.

Design of the research question, methods of data collection, and analysis should have been decided prior to beginning the research.  It is tempting to collect and analyze data, then develop a question that is answered by that set of data.  One problem though is that every set of data (even totally random points) will appear to have patterns.  By pre-registering a research project along with peer review of the project design, the problem of fitting the research question to a dataset is mitigated. 

Journals with pre-registration: https://docs.google.com/spreadsheets/d/1D4_k-8C_UENTRtbPzXfhjEyu3BfLxdOsn9j-otrO870/edit#gid=0  From Center for Open Science, cos.io/rr

 

For a research project’s results to be reproducible, the original data must be made available, accurate and thorough descriptions of the data need to be recorded as metadata, and every step and parameter of the analysis must be included in the methods reporting. 

Unfortunately, many projects don’t meet these requirements.  The Center for Digital Scholarship can help you create truly reproducible (and verifiable) research projects.

Clustering - sensitivity to initial conditions

One way that a clustering process can work is to begin with cluster center points and then bring into each cluster anything that is more similar to that cluster than another.  As the cluster gains members, the center points are recalculated and the process is repeated until all objects have been clustered.  The resulting clusters can be highly dependent on the initial choice of center points and the sequence of objects added to clusters.

As an example.  Consider a set of objects – an American football, baseball, almond, and marshmallow confection that are to be placed into two clusters.

If the two initial clusters are centered on the American football and the baseball, almonds would be added to the football cluster because both are brown and oblong.  The marshmallow confection would be added to the baseball cluster because both are round and white.  Usage (sports or food) is split with neither cluster representing use.

If the two initial clusters instead are the football and the marshmallow, the baseball would join the football cluster, strengthening its sports characteristic while the almonds would strengthen the food characteristic of the marshmallow cluster.  Color and shape are not strong characteristics of these clusters.

Where would a kiwi fruit or canoe paddle be placed in each of the sets of clusters?

 

 

Color

Shape

Use

American Football

Brown

Oblong

Sports

Baseball

White

Round

Sports

Kiwi fruit

Brown

Oblong

Food

Banana

Yellow to brown

Oblong

Food

Almond

Brown

Octagon

Food

Canoe paddle

Varies

Oblong

Sports

Marshmallow confection

White

Round

Food

 

First clustering attempt

Cluster-1

Cluster-2

American Football

Baseball

Almonds

Marshmallow confection

 

 

 

 

(brown, oblong)

(white, round)

 

Second clustering attempt

Cluster-1b

Cluster-2b

American Football

Marshmallow confection

Baseball

Almonds

 

 

 

 

(sports)

(food)