GEODE: file formatting instructions

GEODE

OCCUPATIONAL INFORMATION PORTAL

CONTACT DETAILS

CAMSIS

File format conversions (SPSS and Stata examples)

This page gives information on translating file formats to the 'CSV' (tab-delimited) format used by most GEODE facilities

Intro / SPSS instructions / Stata instructions

Introduction

GEODE offers a facility to allow users to link their own datasets (e.g. micro-social survey records) with appropriate occupational information files (further help on GEODE facilities for matching occupational data).

The data matching procedures used in GEODE are implemented in terms of plain text data files. These are known as 'tab delimited files' or 'csv' files for 'comma separated variables' (the latter a misnomer in that 'csv' is used to refer to the generic class of plain text data files). Both the user's data and the occupational information file (e.g. a classification index) need to be stored in this format during the processing stage.

Of course, most social scientists work with their data in the format of a data analysis package, with popular examples being the prioprietory packages SPSS and Stata. A translation procedure may therefore be needed to convert the data between the social scientists' preferred format and the plain text format suited to the GEODE processor. At present [Jan 2007] users are required to perform this simple translation procedure manually (though we hope to be able to implement an automatic procedure to do this in the future).

The text below gives instructions on how to implement this formatting translation in SPSS and in Stata. There are two processes involved:

translating the original data into csv format

translating the new data, produced by GEODE in csv format, back into the favoured package

These translations are routine in SPSS and Stata and in most other analysis packages.

To reiterate, the data files used by most researchers typically look something like:

SPSS format

Stata format

A csv format of the equivalent data file would look something like:

CSV format

(note that csv formats, when they include variable names, may not feature a good alignment between variable names and numeric data).

SPSS examples

The key commands in SPSS are as follows:

To save from an SPSS format file into a new plain text format file:

get file="c:\temp\occ_survey_data.sav".
save translate /outfile="c:\temp\occ_data1.dat" /type=tab /fieldnames /replace .

To read in a plain text file (with variable names in top row) and save in SPSS format:

get translate /file="c:\temp\occ_data2.dat" /type=tab /fieldnames.
sav out="c:\temp\occ_survey_data2.sav".

These commands would correspond to a process whereby:

the user has an initial SPSS format data file called occ_survey_data.sav

this file is saved out into a new plain text file called occ_data1.dat

the GEODE portal processor is deployed to add in some occupational data to occ_data1.dat, producing a new file called occ_data2.dat

the new data plain text file occ_data2.dat is read into SPSS then saved in SPSS format

This SPSS syntax file gives a longer version of the same commands.

Further issues in making file format transations when working with SPSS:

Loss of value labels / other 'dictionary' information. When you translate a file from SPSS to plain text, as described above, you do not preserve any of the 'dictionary' information associated with the file. There are several ways to stop this happening. We suggest that the simplest is to extract only a subset of your data, along with a case identifier variable, then perform the GEODE linkage on that subset, before linking the data back together. This method is illustrated on the syntax file above.
Missing data storage. One problem can arise if your original micro-social data file includes SPSS defined system missing values. The problem occurs because system missing values are exported as blank spaces when using the conversion methods described above for creating a csv format file. The problem doesn't occur if other numeric values are exported instead of blanks. Therefore we recommend as a solution that all system missing values, on any variables, which are to be exported to the plain text csv file, should be converted to a numeric missing value indicator. In SPSS this may be achieved by the 'recode' and 'missing values' commands, e.g. 'recode all (missing,sysmis=-9). missing values all (-999)'.

The problem (which is shared by Stata implementations) arises if there are system missing values on any variables in the target matching file. When this occurs, the initial file matching can still be processed on the plain text files, producing a new plain text file, with the additional outputs. However the problem is that when 'get translate' is used to read the new plain text file back in to SPSS, the missing values are read as 'string' records, which might force the format of the original variable to be string format, even if this is not intended. Aside from avoiding inputting system missing values (using 'missing values' as described above), another solution in SPSS is to forcibly declare the new variables to be numeric format.

Matching data from multiple variables. If you wish to process GEODE matches on more than one occupational variable (e.g., own occupation, then spouse's occupation) the plain text outputs can come in a format problematic to SPSS. This is described further on our page giving guidance on matching occupational data using GEODE.
Reading newly created files in SPSS. Sometimes the new csv files, created after running a matching process, may not be immediately opened by SPSS. This will occurr if a previous proceedure is still running on them. The main example is the JAVA processor - if it is still active and hasn't been closed, SPSS will think the data file is open in another application, and will not allow accessing it.

Link - Help with SPSS syntax and data management

Stata examples

The key commands in Stata are as follows:

To save from a Stata format file into a new plain text format file:

use "c:\temp\occ_survey_data.dta"
outsheet using "c:\temp\occ_data1.dat", nolabel replace

To read in a plain text file (with variable names in top row) and save in Stata format:

insheet using "c:\temp\occ_data2.dat", clear
save "c:\temp\occ_survey_data2.dta", replace

These commands would correspond to a process whereby:

the user has an initial Stata format data file called occ_survey_data.dta

this file is saved out into a new plain text file called occ_data1.dat

the GEODE portal processor is deployed to add in some occupational data to occ_data1.dat, producing a new file called occ_data2.dat

the new data plain text file occ_data2.dat is read into Stata and then saved in Stata format

This Stata do file gives a longer version of the same commands.

Further issues in making file format transations when working with Stata:

Loss of value labels and other 'dictionary' information. When you translate a file from Stata to plain text as described above, you do not preserve any of the 'dictionary' information associated with the file. There are several ways to stop this happening. We suggest that the simplest is to extract only a subset of your data, along with a case identifier variable, then perform the GEODE linkage on that subset, before linking the data back together. This method is illustrated on the syntax file above.

Missing data storage. Two problems can arise if your original micro-social data file includes Stata defined system missing values. In both cases the problem occurs because system missing values are exported as blanks when using the conversion methods described above for creating a csv format file. The problems don't occur if other numeric values are exported instead of blanks. Therefore we recommend as a solution that all system missing values, on any variables, which are to be exported to the plain text csv file, should be converted to a numeric missing value indicator. In Stata this may be achieved by the 'mvencode' command, e.g. 'mvencode isco88, mv(-9)'.

The first problem arises if there are system missing values on any of the key linking variables (e.g. isco88; soc90; employment status). The issue here is ultimately that such blanks are not valid values within the standard category definition of the key linking variables (see our notes on occupational index units). (Although other values outside the standard category range would be ok, so long as they had a physical manifistation). However, because Stata exports missing records to text files with no entries at all, the plain text file seems, to the GEODE processor, to be in the wrong format, and it cannot be opened during the 'load data' stage of the Java occupational matching implemenation.

The second problem (which is shared by SPSS implementations) arises if there are system missing values on any other variables within the target matching file. When this occurs, the initial file matching can still be processed on the plain text files, producing a new plain text file with the additional outputs. However the problem now is that when 'insheet' is used to read the new plain text file back in to Stata, the missing values are read as 'string' records, which might force the format of the original variable to be string format, even if this is not intended. Aside from avoiding inputting system missing values (using 'mvencode' as described above), another solution in Stata is to forcibly 'destring' the new variables, e.g. 'destring voting, replace force'.

Matching data form multiple variables. If you wish to process GEODE matches on more than one occupational variable (e.g., own occupation, then spouse's occupation) the plain text outputs can come in a format problematic to SPSS. This is described further on our page giving guidance on matching occupational data using GEODE.

Reading newly created files in Stata. Sometimes the new csv files, created after running a matching process, may not be immediately opened by SPSS. This will occurr if a previous proceedure is still running on them. The main example is the JAVA processor - if it is still active and hasn't been closed, SPSS will think the data file is open in another application, and will not allow accessing it.

Link - Help with Stata command files and data management

Last modified 11 January 2007
This document is maintained by Paul Lambert (paul.lambert@stirling.ac.uk)

Stirling University home / Paul Lambert's homepage

To save from an SPSS format file into a new plain text format file:
	get file="c:\temp\occ_survey_data.sav". save translate /outfile="c:\temp\occ_data1.dat" /type=tab /fieldnames /replace .
To read in a plain text file (with variable names in top row) and save in SPSS format:
	get translate /file="c:\temp\occ_data2.dat" /type=tab /fieldnames. sav out="c:\temp\occ_survey_data2.sav".

To save from a Stata format file into a new plain text format file:
	use "c:\temp\occ_survey_data.dta" outsheet using "c:\temp\occ_data1.dat", nolabel replace
To read in a plain text file (with variable names in top row) and save in Stata format:
	insheet using "c:\temp\occ_data2.dat", clear save "c:\temp\occ_survey_data2.dta", replace