Abstract:
The Sociometrics Social Science Electronic
Data Library (SSEDL) in CD-ROM and Web formats includes over 300 data sets
from exemplary studies in seven health and social science fields:
adolescent pregnancy, the American family, social gerontology, maternal
drug abuse, AIDS and sexually transmitted diseases, disability, and
contextual influences on behavior. Design elements of this electronic
library are described, including: quality data, indexing at the variable
level, quality documentation, search and retrieval software, linked images
of original questionnaire item and page, and data extract software.
Resource packaging and dissemination strategies aimed at meeting the needs
of diverse research, teaching, and library target markets are also
discussed.
Keywords:
social science, health, data library,
preservation, dissemination, CD-ROM, Web
Increases in microcomputer processing
speed and hard disk storage space, coupled with decreases in the amount of
federal funding available for primary data collection, have made secondary
analysis of existing databases an attractive option for research and
teaching. Sociometrics has pioneered in making exemplary social science
data resources readily available, easy to use, and widely disseminated
through the establishment of topically-focused data archives in a number
of important health and social science areas:
- the Data Archive on Adolescent
Pregnancy and Pregnancy Prevention (150 studies comprising 234 data
sets and over 60,000 variables),
- the American Family Data Archive (20
studies comprising 122 data sets and over 70,000 variables),
- the Data Archive of Social Research on
Aging (3 studies comprising 22 data sets and over 19,000 variables),
- the Maternal Drug Abuse Data Archive (7
studies comprising 13 data sets and over 5,000 variables),
- the AIDS/STD Data Archive (11 studies
comprising 20 data sets and over 14,000 variables),
- the Research Archive on Disability in
the United States (19 studies comprising 40 data sets and over 23,000
variables), and
- the Contextual Data Archive (13 data
sets compiled from over 29 sources and over 20,000 variables).
Design A
previous article described the bootstrapping process that Sociometrics has
successfully employed to advance the field of data sharing in the social
sciences (Card, 1996). Each successive data archive has contributed to the
substantive advancement of its research field, by placing in the
public domain the "best-of-the-lot" data in the field. In
addition, each successive archive has contributed to the advancement of
the data sharing field, by enhancing standards for documentation of
public use social science data files (Table 1).
Quality Data.
Each data set in the Data Library has been selected for inclusion by a
National Advisory Panel of experts in the topical focus of the archive.
Selection has been based on strict scientific criteria of technical
quality, substantive utility, policy relevance, and potential for
secondary data analysis.
Indexing at the Variable Level.
Each variable in each data archive is indexed according to a set of
approximately 60 archive-relevant Topics that characterize the substance
of the variable and approximately 15 Types that characterize the kind of
measure (e.g., "Attitude," "Behavior,"
"Status"). This Topic and Type classification affords users a
powerful method of quickly searching for, and then extracting, variables
of interest both within and across data sets in an archive.
Quality Documentation.
Each data set is made publicly available with a standard set of five
machine-readable data and documentation files: (File 1) a raw data file;
(Files 2 and 3) machine-readable SPSS and SAS program statements that
fully document the variables and values in the data file; (File 4) an SPSS
data dictionary; and (File 5) SPSS frequencies. Each data set is
also accompanied by a printed User's Guide (provided in machine-readable
form, in addition to printed form, for the more recent archives) comprised
of a standard set of sections and subsections. The provision of standard
machine-readable and printed documentation assists users in
familiarizing themselves with the Sociometrics data sets. Once a user has
worked with one Sociometrics-packaged data set, it is easy for him or her
to work with any of the others. The original instrument and codebook are
offered as optional, supplementary documentation for each data set,
when available. For the more recent archives, the original instrument is
distributed in machine-readable form along with the data, as a set of
graphics files (page images).
Search and Retrieval Software.
Powerful search & retrieval software accompanies each data archive.
This software allows a user to search an entire topically-focused
collection, a customized group of data sets created explicitly for a given
user, or a single data set; to identify variables of interest across this
designated search space; and to save located variables as a search set.
Users can conduct: (1) full-text keyword searches, including variable
names, words in variable labels (question descriptors), and words in value
labels (response descriptors); (2) searches by assigned Topic and Type
codes; and (3) searches by study name or assigned data set number.
Standard Boolean operators (i.e., "and," "or,"
"not") can be used to combine search sets.
Linked Images of Original Questionnaire
Item and Page. An important
innovation achieved by the most recent data archives is the inclusion of
linked, electronic images of the original data collection instruments that
correspond to the archived data sets. This electronic link between the
variables and instruments allows users to obtain a better understanding of
actual variable content by viewing, for any variable of interest, the page
of the original data collection instrument containing the corresponding
item as asked of respondents. The instrument-variable link allows analysts
to examine questionnaire skip patterns and item context on-screen, a
process which enhances the variable selection process and reduces the need
for paper copies of instruments. In addition, users can also browse entire
original instruments or individual subsections of interest through a
feature that organizes the instrument around a topical table of contents.
Data Extract Software.
Finally, Data Extract software allows users of CD-ROM versions of archived
data sets to create customized SPSS or SAS program files containing only
those variables of interest to them. This capability permits analyses of
subsets of large data sets to be conducted quickly (with rapid
turn-around) on most microcomputers. It also saves users significant
program development time writing and re-writing SPSS and SAS program
statements to define variables used in a given analysis.
Dissemination
Having achieved what we believe to be a
close-to-optimal, cost-effective way to select and prepare data sets for
the public domain, we have turned our attention to innovative ways to
encourage use of this valuable data resource. The present report focuses
on advances in dissemination and user outreach that have taken place over
the last three years.
A public resource is only beneficial if it
is used appropriately. But use cannot occur without potential users being
aware of the existence, organization, contents, and capabilities of the
resource. Therefore, from the Data Library’s inception 15 years ago, we
have publicized its contents to individual researchers, professors and
students who could potentially use it. We have used a variety of methods
to reach potential users, including distribution of a thrice-yearly
newsletter, seeding of a complimentary data catalog, circulation of direct
mail fliers, placement of ads in professional journals, presentations of
papers in professional conferences, demonstrations of products at exhibit
booths at professional conferences, posting of resource announcements to
relevant Internet lists, and publication of papers in relevant scientific
journals.
More recent dissemination efforts have
turned to three new challenges: first, how to package the entire Library
of 300+ data sets from seven topically-focused collections in
cost-effective fashion; second, how to take advantage of the burgeoning
universality of a new technology: the Internet; and, third, how to meet
the needs of an important, growing constituency of non-social scientists:
data librarians.
In talking to our customers, we discovered
that what end users¾ researchers, professors, and students¾ appreciate
the most is quick access to high quality data. In contrast, librarians are
primarily concerned with archival preservation of these important
resources. To meet the differing needs of both these constituencies, we
created a new package consisting of all the data sets from all of our data
archives, the Sociometrics Social Science Electronic Data Library (SSEDL).
We put together three versions of SSEDL: two Internet versions and a
CD-ROM version.
Our Internet server (www.socio.com)
hosts a couple of SSEDL suites. The first suite allows all Internet users
to download SSEDL’s data sets upon provision of a credit card cybercash
payment. The second Internet suite allows faculty members and students of
SSEDL Data Consortium member institutions to download SSEDL’s data sets
for free. Membership in the SSEDL Data Consortium is obtained by the
institution’s library purchasing the CD-ROM version of SSEDL, for a
fraction of what the data sets would have cost separately (under $10, as
opposed to $225, per data set). This way, both the end user’s need for
quick access to high quality data and the librarian’s need for
preservation of the same data are simultaneously met in cost-effective
fashion.
We have supplemented our ongoing
dissemination efforts with several innovative ways of reaching our target
constituencies. First, we have begun teaming with professional
associations of social scientists and librarians to co-disseminate SSEDL
to their members at a discounted price. Second, we have developed
multimedia descriptions and demonstrations of SSEDL, both on CD-ROM and on
our Web site (http://www.socio.com/edl.htm).
Third, we are offering members of the SSEDL Data Consortium an opportunity
to keep their collection up to date by means of low-cost subscriptions to
SSEDL. Subscribers are provided with annual updates to the collection on
CD-ROM as well as ongoing access to the free data set-download area of the
SSEDL Internet suite.
Peering into the Future
We will continue expanding the content and
capabilities of our data set collections. We will continue the vigorous
dissemination of this valuable resource both through our direct efforts
and through collaborations with professional associations of scientists
and librarians.
We are currently developing two products
related to SSEDL. BSRI, the Behavioral Science Research Instruments
Archive, will contain searchable, edit-ready, and print-ready
machine-readable versions of the demographic, behavioral, and health
science instruments¾ questionnaires, medical forms, interview protocols¾
used to collect the data in SSEDL. Norms for the scales and items
comprising the BSRI instruments will be included in the archive in the
form of scale means and standard deviations, item frequencies or response
distributions, and item crosstabulations with age, race/ethnicity, and
gender, obtained from the linked SSEDL data archives. BSRI will also
contain a link to the corresponding SSEDL data file that will allow a
researcher to select variables for a fully documented SPSS or SAS analysis
extract file from BSRI’s variable listing, original instrument, and/or
item statistics.
MIDAS, the Multivariate Interactive Data
Analysis System, will allow online analysis of the data in SSEDL. Online
data analytic procedures will include weighted and unweighted frequencies,
percentiles, and measures of dispersion and central tendency, as well as
two-way and n-way tables with measures of association, comparison
of means (2-group and ANOVA) and correlations, and the calculation of
complex variance estimations. Users will be able to define case subsets,
recodes, or aggregations for analysis, and then produce output which can
be downloaded or printed. Custom dataset download will also be available.
Notes
- Additional information on the 300+ data
sets comprising the Sociometrics Social Science Electronic Data
Library (SSEDL) can be obtained from http://www.socio.com/edl.htm.
- Josefina J. Card, Ph.D. is the
President of Sociometrics Corporation. Comments or questions can be
addressed to her by mail: Sociometrics Corporation, 170 State Street,
Suite 260, Los Altos, CA 94022; by telephone: 650-949-3282 ext. 211;
or e-mail: jjcard@socio.com. The author wishes to thank colleagues
Eric Lang and Michael Carley for their assistance with compiling some
of the data set information cited in this report.
References
Card, J. J. (1996). Development of the
Sociometrics Data Library on Families, Aging, Substance Abuse, and AIDS. Social
Science Computer Review, 14(3), 305-309.
|