Identifying Relationships between Scientific Datasets

Abdussalam Alawini's picture

Submitted by Abdussalam Alawini on Thu, 08/25/2016 - 10:12

TitleIdentifying Relationships between Scientific Datasets
Publication TypeThesis
Year of Publication2016
AuthorsAlawini A
AdvisorMaier D
Academic DepartmentComputer Science
Number of Pages149
Date Published03/2016
UniversityPortland State University
Thesis TypeDissertation
KeywordsBloom filters, conditional random fields (CRFs), data extraction, data profiling, schema matching, scientific data management, spreadsheets, support vector machines (SVMs)

Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset A is contained in dataset B, then the connection between A and B could be that A was extended to create B.
We present a relationship-identification methodology as a solution to this problem. To examine the feasibility of our approach, we articulated a set of relevant relationships, developed algorithms for efficient discovery of these relationships, and organized these algorithms into a new system called ReConnect to assist scientists in relationship discovery. We also evaluated existing alternative approaches that rely on flagging differences between two spreadsheets and found that they were impractical for many relationship-discovery tasks. Additionally, we conducted a user study, which showed that relationships do occur in real-world spreadsheets, and that ReConnect can improve scientists' ability to detect such relationships between datasets.
The promising results of ReConnect's evaluation encouraged us to explore a more automated approach for relationship discovery. In this dissertation, we introduce an automated end-to-end prototype system, ReDiscover, that identifies, from a collection of datasets, the pairs that are most likely related, and the relationship between them. Our experimental results demonstrate the overall effectiveness of ReDiscover in predicting relationships in a scientist's or a small group of researchers' collections of datasets, and the sensitivity of the overall system to the performance of its various components