Vetting data: Difference between revisions

From Pardee Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Vetting is the process of checking that the data that we import to IFs are of good quality. Even when we import the data through an automated application there could be errors or missing data in the imported series.
'''Vetting Data'''


One of the major reason an entire country (or a number of countries) could be missed is the mismatch in country concordance table. Make sure the IFs country concordance list (i.e., the relevant column in the '''Country '''Translation table in IFs.mdb) is correct and up to date. Make sure none of the year columns are missed. Sometime the year column is missed because the data is formatted as text and not as numbers. One more general issue is around the units. Make sure you perform any unit conversion required before you import the data. Comparing the new data with the current IFs series will give you an idea on any unit conversion that might be needed.
Vetting is the process of checking to assure the quality of the data we import into IFs. Even when data is imported through an automated application there could be errors or missing data in the imported series. Vetting the new data you are bringing into IFs requires you to compare the new series to the existing historical series in IFs.


It is not recommended that you vet every single country. You can ensure data quality by following certain procedures:
There are a number of reasons for vetting data. One reason is that a country (or a number of countries) could be missed when importing due to not properly concording countries with the proper country concordance table. Make sure the IFs country concordance list (i.e., the relevant column in the '''Country '''Translation table in IFs.mdb) is correct and up to date and you are using the correct concordance list as listed in the DataDict. Additionally, make sure there are no missing years in the new data. Sometimes the year column is missed because the data is formatted as text and not as numbers, or the source simply did not provide an update for that year.  If this is the case, be sure to blend in the missing years in the IFs vetting tool. One more general issue is around the units. Make sure you perform any unit conversion required before you import the data. Comparing the new data with the current IFs series will give you an idea on any unit conversion that might be needed.


#For an updated table, open the existing and updated table side by side. Most socio-economic data do not change very fast and we usually update data on a regular basis. If you see big difference on the most recent field you should be concerned. Sort the "MostRecent" column ascending and descending and compare the countries that are highest and lowest in the two tables.
 
#Both for updated and new data, check the countries that you are familiar with or for a country for which we can make a guess. For example, the 'ranking' comparison described above can be done as a 'ranking'  check for new data (rank the "Most Recent" or "Earliest") for data series on issue areas (like poverty or governance or education) for which we have a good sense of global ranking.
#For any series, check large and important countries like USA, China or Germany. Check a couple of very poor and very small countries. Check any other country that you are very familiar with.
#Check countries with similar name (like the two Congos and two Koreas)
#If you are using template spreadsheet and manual copy and paste, check for countries at the bottom and top of the spreadsheet (don't miss Zimbabwe).
#Check for 0's. We take no data as an empty cell. If we have 0's that need to be data and must be a feasible value (e.g., GDP cannot be 0).
#Do a reality check on variable definition: Make sure percentages are below 100 (there are cases when percentages can be above 100, e.g., gross enrollment rate).
#More reality check on value: GDP growth rate of more than 10% should raise a flag (check to make sure); GDP growth rate of more than 5% for a rich country; calorie per capita of more than 3000 for a poor country; calorie per capita of more than 4000?
#When data is in Excel, line graphs for a country is a way to quickly check for transients
#If you use ImportXLS do a scatter plot on most recent (or some other years) column for old and new data


= Vetting Tool: Compare Imported Data Form in IFs =
'''Vetting Data''' '''Checklist:'''


IFs now has a new feature to do some basic initial vetting, checking things like: are there any zeroes or negatives in the imported data, are there big spikes in the extracted data, are any of the updated data value wildly different from existing data.
*Data in Access file and source Excel file match
*Use IFs vetting tool to compare new data to historical data
*Large discrepancies between new and old data are documented
*Check to see if there are big spikes in the new, imported data
*Blend countries/years with missing data in IFs vetting tool
*Zeroes in Access file are actually zeroes and not null values
*No missing countries or years
*DataDict contains no errors, Original Source has website name and Name in Source contains variable source
*Initials added to both Access and Excel file names, as well as in the DataDict notes
*Send Word file with vetting notes, along with original files to puller and project leads


This feature (screen) can also be used to merge year columns between updated and existing Access tables, a blending requirement.
'''Process:'''


The feature can be reached through IFs menu options:
#After the data series are imported into IFs, the IFsDataImport.mdb file is passed to another RA (assigned as a vetter) who brings the data from the IFsDataImport.mdb file into the model via the "Vet Imported Data" feature in the Extended Features menu
#The vetter uses the vetting tool to determine if there are any remarkable inconsistencies between the old and new data. This is a bit subjective, but necessarily so as the threshold for concern will be different depending on which series from which source is being vetted
#If significant errors are found that need to be corrected, and these errors had occurred during the import process, these are documented in the RA's vetting notes and the files are passed back to the original data puller to correct and re-send. 
#If no errors are found, the vetter blends columns for new years and preserves historical data points as appropriate, and saves the changes in the IFsDataImport file. 
#The vetter passes the IFsDataImport file (renamed to reflect series/batch update name and completion date) back to the project lead and data team supervisor
#Data team supervisor stores data import file for consolidation process


IFs -> MainMenu->Extended Features -> Manage Country Data->Historic Data File Processing-> Compare Imported Data
 


The form (screen) that opens is shown in the figure below. The process starts by opening the Access (.Mdb) file containing data series tables and a dictionary table with a list of new and/or updated data variables. "File" menu option in this form can be used to open a .mdb file, e.g. IFsDataImport.mdb or IFsDataImportBatch.mdb or any such import files passed from data extraction and renamed for the sake of filing convenience.
'''You can ensure data quality by following certain procedures''':


As long as the opened .mdb contains a bunch of data tables and a datadict with a list of those tables, the top grid will fill up with the list from the datadict. The second column in the top grid will be filled using master datadict and the cells in this column will remain empty for new extractions.
#For an updated table, open the existing and updated table side by side. Compare data points for all years and all countries to make sure there are no, or very small, differences.
#For all series, be sure to check large and important countries like USA, China or Germany.
#Check countries with similar name (like the two Congos and two Koreas), as these can sometimes get mixed up.  
#Check for zeros and make sure they are actually zeros in the source data. We take no data as an empty cell. If we have zeros that need to be data and must be a feasible value (e.g., GDP cannot be 0).
#Check the variable definition and make sure it makes sense and is in fact the right definition.
#Make sure percentages are below 100 (there are cases when percentages can be above 100, e.g., gross enrollment rate).
#Check values: GDP growth rate of more than 10% should raise a flag. For instances like this, check against the source data.
#Creating line graphs for the countries in a series is a great way to quickly check for transients. This can easily be done in Excel or Tableau.


Once you double click on any of the rows in the top grid, data for the new and the existing table (for updates) will show up in the middle grids. Any zero or negative will be marked red. You can use buttons below the grids to clear out zeroes and/or negatives from the table containing extracted data.
 


[[File:VetDataCompareWizard.png|RTENOTITLE]]
'''Using the IFs Vetting Tool'''


IFs has a feature to do some basic initial vetting and is a tool that should be used in every vetting process. The vetting tool can be found by the following:


'''IFs -> MainMenu->Extended Features -> Manage Country Data->Historic Data File Processing-> Vet Imported Data'''.


The button titled Mark Transients marks any spike found across columns (for the same country in the same table). The marking algorithm uses a default threshold of 10% (up or down) that can be changed by using the Marking Threshold box.
 


Another button titled Mark Inconsistencies mark any big discrepancy found between same data points (same country, same year) in the two tables, imported and existing. This marking algorithm also uses a default threshold of 10% (up or down) that can be changed by using the Marking Threshold box.


You should use the Refresh Display button as you switch from one kind (or threshold) of marking to another.


"Bring Years From Existing Table to New Table" button serves the blending purpose of merging complete year columns that are missing in the newly imported table (this button can, in fact, should be used post extraction prior to vetting).
The Compare Imported Table With Existing Table screen is then opened.


"Update Data Dictionary for Imported Table" button on the bottom left takes to a data dictionary form and allows you to vet/edit the data dictionary info filled during data extraction.
[[File:|624x351px|A screenshot of a social media post


The big button on the bottom right titled "Merge Imported Data to Master Database" <u>should not</u> be used during this stage of vetting. This button adds (or overwrites) imported tables to IFs\Data\IFsHistSeries.Mdb and merges datadict rows to the master copy of datadict (IFs\Data\Datadict.mdb). This button should only be used by the person who is in charge of merging the data to IFs database.
Description automatically generated]]
 
Here, go to '''File>Open Database with Imported Tables, '''and open the Access file with the data series that you want to import into IFs. This Access file contains data series tables and a dictionary table with a list of new and/or updated data variables. This will then populate the top grid with every series within that Access file.
 
[[File:|624x351px|A screenshot of a cell phone
 
Description automatically generated]]
 
Next, double click on one of the series (rows) in the top grid to display the data. This will then populate the new data (the data you are importing) and the old, historical data that is in IFs. The old data is in the first grid and the new data is in the second grid, as shown above.
 
Here you have a lot of different options to compare the new data with the old. The vetting tool will automatically mark any zeros in the table. You will want to double check to see if, in fact, these are suppose to be zeros or if they are just null values. If they are just null values, you can click the button, '''Delete Zeros.'''
 
Next, you need to click the button, '''Mark Differences between New and Old Data for same country-year. '''This will allow you to easily observe any differences, as the vetting tool will highlight them. You can also change the threshold that it will mark. The default is that it will mark any more than 10% difference. As you can see in the screenshot below, the values for Afghanistan in 2014 and 2015 are marked. You will want to go through all of the countries in each series you are importing to look at these marked differences and write down any that are very large.
 
[[File:|624x351px|A screenshot of a social media post
 
Description automatically generated]]
 
You can also click the button '''Mark''' '''Year to Year Jumps in New Data, '''which is helpful to observe any large changes year-to-year.
 
'''Blending'''
 
This vetting tool is also used to merge country and year columns between updated and existing Access tables. This is a necessity when blending to preserve the existing historical IFs values when there is no corresponding value in the new data series. For example, often times there will be an update from a source but it is missing multiple countries we already have data for. In this case, you can use the blending tool to blend in any historical country values that are not in the new data. The same can be done for missing years in the new data. &nbsp;
 
Once you have thoroughly vetted a data series, you will want to blend the new and historical series together. This ensures that we do not lose any values from the historical series. To do this for the Country values, you will want to check the '''Select All '''box in the bottom right hand corner of the screen. This will automatically check all of the countries, as seen below. Once you have check this box, press the '''Blend Data Points '''button, and click '''Yes. '''This will update the new Access file with missing Country observations.
 
You will also want to do the same process for the '''Blend Columns (Years) '''button, if there are any missing years. This ensures that we retain the historical values.
 
[[File:|624x351px|A screenshot of a social media post
 
Description automatically generated]]
 
&nbsp;

Revision as of 16:44, 25 July 2019

Vetting Data

Vetting is the process of checking to assure the quality of the data we import into IFs. Even when data is imported through an automated application there could be errors or missing data in the imported series. Vetting the new data you are bringing into IFs requires you to compare the new series to the existing historical series in IFs.

There are a number of reasons for vetting data. One reason is that a country (or a number of countries) could be missed when importing due to not properly concording countries with the proper country concordance table. Make sure the IFs country concordance list (i.e., the relevant column in the Country Translation table in IFs.mdb) is correct and up to date and you are using the correct concordance list as listed in the DataDict. Additionally, make sure there are no missing years in the new data. Sometimes the year column is missed because the data is formatted as text and not as numbers, or the source simply did not provide an update for that year.  If this is the case, be sure to blend in the missing years in the IFs vetting tool. One more general issue is around the units. Make sure you perform any unit conversion required before you import the data. Comparing the new data with the current IFs series will give you an idea on any unit conversion that might be needed.

 

Vetting Data Checklist:

  • Data in Access file and source Excel file match
  • Use IFs vetting tool to compare new data to historical data
  • Large discrepancies between new and old data are documented
  • Check to see if there are big spikes in the new, imported data
  • Blend countries/years with missing data in IFs vetting tool
  • Zeroes in Access file are actually zeroes and not null values
  • No missing countries or years
  • DataDict contains no errors, Original Source has website name and Name in Source contains variable source
  • Initials added to both Access and Excel file names, as well as in the DataDict notes
  • Send Word file with vetting notes, along with original files to puller and project leads

Process:

  1. After the data series are imported into IFs, the IFsDataImport.mdb file is passed to another RA (assigned as a vetter) who brings the data from the IFsDataImport.mdb file into the model via the "Vet Imported Data" feature in the Extended Features menu
  2. The vetter uses the vetting tool to determine if there are any remarkable inconsistencies between the old and new data. This is a bit subjective, but necessarily so as the threshold for concern will be different depending on which series from which source is being vetted
  3. If significant errors are found that need to be corrected, and these errors had occurred during the import process, these are documented in the RA's vetting notes and the files are passed back to the original data puller to correct and re-send. 
  4. If no errors are found, the vetter blends columns for new years and preserves historical data points as appropriate, and saves the changes in the IFsDataImport file. 
  5. The vetter passes the IFsDataImport file (renamed to reflect series/batch update name and completion date) back to the project lead and data team supervisor
  6. Data team supervisor stores data import file for consolidation process

 

You can ensure data quality by following certain procedures:

  1. For an updated table, open the existing and updated table side by side. Compare data points for all years and all countries to make sure there are no, or very small, differences.
  2. For all series, be sure to check large and important countries like USA, China or Germany.
  3. Check countries with similar name (like the two Congos and two Koreas), as these can sometimes get mixed up.  
  4. Check for zeros and make sure they are actually zeros in the source data. We take no data as an empty cell. If we have zeros that need to be data and must be a feasible value (e.g., GDP cannot be 0).
  5. Check the variable definition and make sure it makes sense and is in fact the right definition.
  6. Make sure percentages are below 100 (there are cases when percentages can be above 100, e.g., gross enrollment rate).
  7. Check values: GDP growth rate of more than 10% should raise a flag. For instances like this, check against the source data.
  8. Creating line graphs for the countries in a series is a great way to quickly check for transients. This can easily be done in Excel or Tableau.

 

Using the IFs Vetting Tool

IFs has a feature to do some basic initial vetting and is a tool that should be used in every vetting process. The vetting tool can be found by the following:

IFs -> MainMenu->Extended Features -> Manage Country Data->Historic Data File Processing-> Vet Imported Data.

 


The Compare Imported Table With Existing Table screen is then opened.

[[File:|624x351px|A screenshot of a social media post

Description automatically generated]]

Here, go to File>Open Database with Imported Tables, and open the Access file with the data series that you want to import into IFs. This Access file contains data series tables and a dictionary table with a list of new and/or updated data variables. This will then populate the top grid with every series within that Access file.

[[File:|624x351px|A screenshot of a cell phone

Description automatically generated]]

Next, double click on one of the series (rows) in the top grid to display the data. This will then populate the new data (the data you are importing) and the old, historical data that is in IFs. The old data is in the first grid and the new data is in the second grid, as shown above.

Here you have a lot of different options to compare the new data with the old. The vetting tool will automatically mark any zeros in the table. You will want to double check to see if, in fact, these are suppose to be zeros or if they are just null values. If they are just null values, you can click the button, Delete Zeros.

Next, you need to click the button, Mark Differences between New and Old Data for same country-year. This will allow you to easily observe any differences, as the vetting tool will highlight them. You can also change the threshold that it will mark. The default is that it will mark any more than 10% difference. As you can see in the screenshot below, the values for Afghanistan in 2014 and 2015 are marked. You will want to go through all of the countries in each series you are importing to look at these marked differences and write down any that are very large.

[[File:|624x351px|A screenshot of a social media post

Description automatically generated]]

You can also click the button Mark Year to Year Jumps in New Data, which is helpful to observe any large changes year-to-year.

Blending

This vetting tool is also used to merge country and year columns between updated and existing Access tables. This is a necessity when blending to preserve the existing historical IFs values when there is no corresponding value in the new data series. For example, often times there will be an update from a source but it is missing multiple countries we already have data for. In this case, you can use the blending tool to blend in any historical country values that are not in the new data. The same can be done for missing years in the new data.  

Once you have thoroughly vetted a data series, you will want to blend the new and historical series together. This ensures that we do not lose any values from the historical series. To do this for the Country values, you will want to check the Select All box in the bottom right hand corner of the screen. This will automatically check all of the countries, as seen below. Once you have check this box, press the Blend Data Points button, and click Yes. This will update the new Access file with missing Country observations.

You will also want to do the same process for the Blend Columns (Years) button, if there are any missing years. This ensures that we retain the historical values.

[[File:|624x351px|A screenshot of a social media post

Description automatically generated]]