Merge Program

From Intl Surface Temp Initiative

(Difference between revisions)
(Created page with 'The following is a description of the program that will be used to turn Stage 2 data into a consolidated master database (Stage 3). Currently this is a work in progress')
Line 1: Line 1:
-
The following is a description of the program that will be used to turn Stage 2 data into a consolidated master database (Stage 3). Currently this is a work in progress
+
The following is a description of the program that will be used to turn Stage 2 data into a consolidated master database (Stage 3). Currently this is a work in progress, and feedback is greatly appreciated
 +
 
 +
==Source Hierarchy==
 +
Before program is run, a hierarchy needs to be established to give preference to certain sources / elements.  Some examples that may give a source a higher preference are
 +
*Use of TMAX / TMIN instead of TAVG
 +
*More stations
 +
*Longer period of record
 +
*Data closer to original source  (as raw as possible)
 +
This can be defined by multiple organizations (ie the creation of GHCN-M might have a different preference of sources than another dataset). Nonetheless a hierarchy needs to be established for version 1.0.0 of the Databank. Another idea is to create an ensemble of results by randomly selecting the order 100 times and run the merge process each time
 +
 
 +
==Multi-Elemental Merge==
 +
Once a Hierarchy is established, we begin the merge process source by source (ie Source 1 vs Source 2, and then Merged Source1/2 vs Source 3, etc.), while maintaining all three elements (TMAX,TMIN,TAVG). Multiple metrics can be calculated to determine if a station is the same
 +
*METADATA METRICS
 +
**Geographical distance between 2 stations
 +
**Height distance between 2 stations
 +
**Name of station (using comparison metric such as Jaccard Index)
 +
*DATA METRICS
 +
**Compare the number of common months (ie non-missing data for both stations for a respective month)
 +
**Ratio of common months
 +
***Number of times data for common months are within +/- 1.1 of each other over the total number of common months
 +
**Compare the mean and standard deviations of the 2 stations
 +
***Possibly using the F-Test / T-Test
 +
 
 +
Here are some example Booleans that can be used to make a station match
 +
*Geographical distance = 0, AND name is exactly the same
 +
*Distance = 0, AND name is not exactly the same, AND ratio of common months is greater than 0.4
 +
*Distance  > 0, AND name is exactly the same, AND ratio of common months is greater than 0.4
 +
*Distance >= 0, AND ratio of common months is >= 0.99
 +
*0 < Distance < 20, AND part of the name is contained within each other
 +
*0 < Distance < 20, AND ratio of common months is missing, AND Name is exactly the same, AND difference in mean is +/- 0.5 AND difference in stdev is +/- 0.5
 +
 
 +
If it is determined there is a station match, it then checks to see if there are any non-common months. If so, a merge is performed. For common months, the non-missing data that is within the higher source hierarchy is given higher preference. If a station does not match up with the master dataset, then it is then considered unique and put into the master dataset.

Revision as of 12:59, 14 October 2011

The following is a description of the program that will be used to turn Stage 2 data into a consolidated master database (Stage 3). Currently this is a work in progress, and feedback is greatly appreciated

Source Hierarchy

Before program is run, a hierarchy needs to be established to give preference to certain sources / elements. Some examples that may give a source a higher preference are

  • Use of TMAX / TMIN instead of TAVG
  • More stations
  • Longer period of record
  • Data closer to original source (as raw as possible)

This can be defined by multiple organizations (ie the creation of GHCN-M might have a different preference of sources than another dataset). Nonetheless a hierarchy needs to be established for version 1.0.0 of the Databank. Another idea is to create an ensemble of results by randomly selecting the order 100 times and run the merge process each time

Multi-Elemental Merge

Once a Hierarchy is established, we begin the merge process source by source (ie Source 1 vs Source 2, and then Merged Source1/2 vs Source 3, etc.), while maintaining all three elements (TMAX,TMIN,TAVG). Multiple metrics can be calculated to determine if a station is the same

  • METADATA METRICS
    • Geographical distance between 2 stations
    • Height distance between 2 stations
    • Name of station (using comparison metric such as Jaccard Index)
  • DATA METRICS
    • Compare the number of common months (ie non-missing data for both stations for a respective month)
    • Ratio of common months
      • Number of times data for common months are within +/- 1.1 of each other over the total number of common months
    • Compare the mean and standard deviations of the 2 stations
      • Possibly using the F-Test / T-Test

Here are some example Booleans that can be used to make a station match

  • Geographical distance = 0, AND name is exactly the same
  • Distance = 0, AND name is not exactly the same, AND ratio of common months is greater than 0.4
  • Distance > 0, AND name is exactly the same, AND ratio of common months is greater than 0.4
  • Distance >= 0, AND ratio of common months is >= 0.99
  • 0 < Distance < 20, AND part of the name is contained within each other
  • 0 < Distance < 20, AND ratio of common months is missing, AND Name is exactly the same, AND difference in mean is +/- 0.5 AND difference in stdev is +/- 0.5

If it is determined there is a station match, it then checks to see if there are any non-common months. If so, a merge is performed. For common months, the non-missing data that is within the higher source hierarchy is given higher preference. If a station does not match up with the master dataset, then it is then considered unique and put into the master dataset.

Personal tools