|Title||Identifying Relationships between Scientific Datasets|
|Year of Publication||2016|
|Academic Department||Computer Science|
|Number of Pages||149|
|University||Portland State University|
|Keywords||Bloom filters, conditional random fields (CRFs), data extraction, data profiling, schema matching, scientific data management, spreadsheets, support vector machines (SVMs)|
Scientific datasets associated with a research project can proliferate over time as a result of activities such as sharing datasets among collaborators, extending existing datasets with new measurements, and extracting subsets of data for analysis. As such datasets begin to accumulate, it becomes increasingly difficult for a scientist to keep track of their derivation history, which complicates data sharing, provenance tracking, and scientific reproducibility. Understanding what relationships exist between datasets can help scientists recall their original derivation history. For instance, if dataset A is contained in dataset B, then the connection between A and B could be that A was extended to create B.