Merge Program

From Intl Surface Temp Initiative

(Difference between revisions)
Line 30: Line 30:
**V. Number of stations in the data deck
**V. Number of stations in the data deck
-
==Multi-Elemental Merge==
+
==Description of Merge (20120106)==
 +
1) merge program loops through 29 different "sources", in order to save time/resources, we are only focusing on TMAX at this time
 +
 
 +
2) The ordering of the source list matters (see source hierarchy above), each source encountered in the source list has priority over a previously listed source.
 +
 
 +
3) The program begins by reading in the first source .... ghcnd-raw. After that, the program iterates from candidate source #2 through source #29, merging where appropriate and adding unique stations when appropriate.
 +
 
 +
4) For any particular merge, the candidate set of stations are gone through one by one, and for each one, they are compared to every station within the already merged data set. In particular three metadata metrics are calculated:
 +
*distance between stations
 +
*elevation between stations
 +
*jaccard naming similarity index
 +
 
 +
These indices are all appropriately scaled to probabilities and the total probability for potential inclusion is currently set at 0.7 (approximately 0.9*0.9*0.9). More specifically, for any one candidate station, these metrics are first calculated for "all" existing master or merged stations. Then if any stations are over the 0.7 total probability, the station with the highest probability is chosen for the merge.  That
 +
station is then removed from any possibility of being merged with any other candidates in the current candidate set.
 +
 
 +
==To Do List for the Merge Program (20120106)==
 +
1) Incorporate Data Metrics. We already have some considered, such as mean and variance, however we have to consider situations when there is an overlap of data records, and those when there is no overlap. Our first task is overlapping data, where we can calculate a simple metric such as RMSD, and then normalize to create a probability between 0 and 1.  Non-overlapping data will be next, where we may have to consider comparisons of seasonal cycles
 +
 
 +
2) Include all three elements into the merge (TMAX, TMIN, and TAVG)
 +
 
 +
==Description of Multi-Elemental Merge (20111014)==
Once a Hierarchy is established, we begin the merge process source by source (ie Source 1 vs Source 2, and then Merged Source1/2 vs Source 3, etc.), while maintaining all three elements (TMAX,TMIN,TAVG). Effectively the same piece of pairwise comparison code is run nsources-1 times.
Once a Hierarchy is established, we begin the merge process source by source (ie Source 1 vs Source 2, and then Merged Source1/2 vs Source 3, etc.), while maintaining all three elements (TMAX,TMIN,TAVG). Effectively the same piece of pairwise comparison code is run nsources-1 times.
Line 56: Line 76:
If it is determined there is a station match, it then checks to see if there are any non-common months. If so, a merge is performed. For common months, the non-missing data that is within the higher source hierarchy is given higher preference [Q: Are we going to mingle sources or simply leave missing mask untouched in the higher priority set?]. If a station does not match up with the master dataset, then it is then considered unique and put into the master dataset.
If it is determined there is a station match, it then checks to see if there are any non-common months. If so, a merge is performed. For common months, the non-missing data that is within the higher source hierarchy is given higher preference [Q: Are we going to mingle sources or simply leave missing mask untouched in the higher priority set?]. If a station does not match up with the master dataset, then it is then considered unique and put into the master dataset.
-
 
-
==Description of Merge (20120106)==
 
-
1) merge program loops through 29 different "sources", in order to save time/resources, we are only focusing on TMAX at this time
 
-
 
-
2) The ordering of the source list matters (see source hierarchy above), each source encountered in the source list has priority over a previously listed source.
 
-
 
-
3) The program begins by reading in the first source .... ghcnd-raw. After that, the program iterates from candidate source #2 through source #29, merging where appropriate and adding unique stations when appropriate.
 
-
 
-
4) For any particular merge, the candidate set of stations are gone through one by one, and for each one, they are compared to every station within the already merged data set. In particular three metadata metrics are calculated:
 
-
*distance between stations
 
-
*elevation between stations
 
-
*jaccard naming similarity index
 
-
 
-
These indices are all appropriately scaled to probabilities and the total probability for potential inclusion is currently set at 0.7 (approximately 0.9*0.9*0.9). More specifically, for any one candidate station, these metrics are first calculated for "all" existing master or merged stations. Then if any stations are over the 0.7 total probability, the station with the highest probability is chosen for the merge.  That
 
-
station is then removed from any possibility of being merged with any other candidates in the current candidate set.
 
-
 
-
==To Do List for the Merge Program (20120106)==
 
-
1) Incorporate Data Metrics. We already have some considered, such as mean and variance, however we have to consider situations when there is an overlap of data records, and those when there is no overlap. Our first task is overlapping data, where we can calculate a simple metric such as RMSD, and then normalize to create a probability between 0 and 1.  Non-overlapping data will be next, where we may have to consider comparisons of seasonal cycles
 
-
 
-
2) Include all three elements into the merge (TMAX, TMIN, and TAVG)
 

Revision as of 17:07, 6 January 2012

The following is a description of the proposed process that will be used to turn Stage 2 data into a consolidated master database (Stage 3). Currently this is a work in progress, and feedback is greatly appreciated. The plan is to release the underlying code at the time of the databank release so that it is fully open and transparent.

Contents

Source Hierarchy

Before program is run, a hierarchy needs to be established to give preference to certain sources / elements. Some examples that may give a source a higher preference are

  • Use of TMAX / TMIN instead of TAVG, recognizing that Tmax and tmin biases in the record tend to be distinct.
  • More stations
  • Longer period of record
  • Data closer to original source (as raw as possible)

This can be defined by multiple organizations (ie the creation of GHCN-M might have a different preference of sources than another dataset). Nonetheless a hierarchy needs to be established for version 1.0.0 of the Databank. Another idea is to create an ensemble of results by randomly selecting the order 100 times and run the merge process each time.

Straw Man Proposal of Hierarchy

The following information was written by Peter Thorne, which overviews the current hierachy that is being considered during the merge. This is not final, and can be discussed (which is encouraged)

  • It is proposed that the priority follow the nine over-arching classes given below. The information necessary to assign each source deck to a given classification (1-9) should be readily available from the stage 2 metadata flags:
    • 1. Daily databank stage 3 (GHCN-D raw) (alternative: GHCN-D QC’ed) – this provides a backbone of max/min values that is analyzed regularly and curated carefully on an ongoing basis with regular updates and ‘stable’ resource support.
    • 2. Data sources which contain max / min data, have had no QC / homogenization applied and have known provenance
    • 3. Data sources which contain max / min data, have had no QC / homogenization applied with poorly known provenance
    • 4. Data sources that have no QC / homogenization applied but only available as Tavg and have known provenance
    • 5. Data sources that have no QC / homogenization applied but only available as Tavg and with poorly known provenance
    • 6. Data sources with QC applied and max/min data
    • 7. Data sources with QC applied and Tavg data only
    • 8. Data sources with homogenization that have max/min data
    • 9. Data sources with homogenization that have Tavg data only
  • Within classes 2-9 the following set of criteria would be use to differentiate between the sources in the priority order with which they should be merged:
    • I. Whether the monthly data was calculated from dailies held in the databank (give this priority as it means an investigator can dig back to data within a given month)
    • II. Whether the data arises from World Weather Records / national holdings
    • III. Average length of station record in the data deck
    • IV. Oldest station record start date / average station record start date with priority given to those with earlier start dates
    • V. Number of stations in the data deck

Description of Merge (20120106)

1) merge program loops through 29 different "sources", in order to save time/resources, we are only focusing on TMAX at this time

2) The ordering of the source list matters (see source hierarchy above), each source encountered in the source list has priority over a previously listed source.

3) The program begins by reading in the first source .... ghcnd-raw. After that, the program iterates from candidate source #2 through source #29, merging where appropriate and adding unique stations when appropriate.

4) For any particular merge, the candidate set of stations are gone through one by one, and for each one, they are compared to every station within the already merged data set. In particular three metadata metrics are calculated:

  • distance between stations
  • elevation between stations
  • jaccard naming similarity index

These indices are all appropriately scaled to probabilities and the total probability for potential inclusion is currently set at 0.7 (approximately 0.9*0.9*0.9). More specifically, for any one candidate station, these metrics are first calculated for "all" existing master or merged stations. Then if any stations are over the 0.7 total probability, the station with the highest probability is chosen for the merge. That station is then removed from any possibility of being merged with any other candidates in the current candidate set.

To Do List for the Merge Program (20120106)

1) Incorporate Data Metrics. We already have some considered, such as mean and variance, however we have to consider situations when there is an overlap of data records, and those when there is no overlap. Our first task is overlapping data, where we can calculate a simple metric such as RMSD, and then normalize to create a probability between 0 and 1. Non-overlapping data will be next, where we may have to consider comparisons of seasonal cycles

2) Include all three elements into the merge (TMAX, TMIN, and TAVG)

Description of Multi-Elemental Merge (20111014)

Once a Hierarchy is established, we begin the merge process source by source (ie Source 1 vs Source 2, and then Merged Source1/2 vs Source 3, etc.), while maintaining all three elements (TMAX,TMIN,TAVG). Effectively the same piece of pairwise comparison code is run nsources-1 times.

Multiple metrics can be calculated to determine if a station is the same

  • METADATA METRICS
    • Geographical distance between 2 stations
    • Height distance between 2 stations
    • Name of station (using comparison metric such as Jaccard Index)
  • DATA METRICS
    • Compare the number of common months (ie non-missing data for both stations for a respective month)
    • Ratio of common months
      • Number of times data for common months are within +/- 1.1 of each other over the total number of common months
    • Compare the mean and standard deviations of the 2 stations
      • Possibly using the F-Test / T-Test

Here are some example Booleans that can be used to make a station match

  • Geographical distance = 0, AND name is exactly the same
  • Distance = 0, AND name is not exactly the same, AND ratio of common months is greater than 0.4
  • Distance > 0, AND name is exactly the same, AND ratio of common months is greater than 0.4
  • Distance >= 0, AND ratio of common months is >= 0.99
  • 0 < Distance < 20, AND part of the name is contained within each other
  • 0 < Distance < 20, AND ratio of common months is missing, AND Name is exactly the same, AND difference in mean is +/- 0.5 AND difference in stdev is +/- 0.5

Alternatively, to avoid hard-wired decisions it is possible that such checks could be coded in an explicitly bayesian framework whereby each test is run and forms a suitably weighted 'prior' and all such priors are recombined to form a posterior probability of a station match. This is intuitively quite nice as most of these comparison statistics are in reality a continuum (e.g. a station with reported latitude and longitude match of within 1 second should have more weight than one reported with one minute of separation) and not well suited to ad hoc binary inclusion criteria.

If it is determined there is a station match, it then checks to see if there are any non-common months. If so, a merge is performed. For common months, the non-missing data that is within the higher source hierarchy is given higher preference [Q: Are we going to mingle sources or simply leave missing mask untouched in the higher priority set?]. If a station does not match up with the master dataset, then it is then considered unique and put into the master dataset.

Personal tools