Instructor: Amarnath Gupta

Dates: 9/30, 10/13, 10/28, 11/3, 11/17, 12/8 Time: 9:00 AM - 4:30 PM

Objectives

Any modern organization has its information spread over several different information hosts, like multiple databases and the Web. However, for many applications, it is necessary to access more than one information source and combine their information together so that end users can access the combined information. Data integration refers to the formal process of combining information from disparate, heterogeneous sources, and developing the mechanism to query and analyze the integrated information. The goal of this course is to understand the nature of information heterogeneity, the techniques of relating information from different sources, and the machinery required for achieving the integration. In this course, we will approach data integration both from a system’s viewpoint as well as from a user’s viewpoint. Toward the end of the course, we will briefly cover how data integration and the domain of the so called “big data” come together.

Guidelines for grading the class project

  1. Is the target of data integration clearly defined? The target can be defined through the queries that are enabled through the data integration process.
  2. What are the data sources? What are their schema, content and semantics? Here semantics refers to constraints and domain knowledge that apply to the data and useful for integration.
  3. Describe in detail the integration techniques/mechanisms used.
  4. Describe in detail how the integration is achieved.
  5. Present the results of integration and how they match the targets defined.
  6. What are the lessons learnt?

Class Piazza and Github:

Piazza sign up link. We will be using it for all further announcements and for sharing lecture slides.

The class Github contains all the scripts, data and iPython notebooks that we will use.