Ilkay Altintas Analyzing Big Data using Workflows: from fighting wildfires to helping patients / San Diego Super Computer

We will be looking at the scope of Data Science as a field, Big Data and Big Compute as disciplines, their respective trends and the new era of Data Science. This new era of Data Science encompasses new and unique challenges involving many factors like volume, velocity, and variety- and also new tools to combat these challenges. From people, purpose, process, platforms, and programmability - we will take a look at how to navigate Big Data as a discipline and process utilizing workflows, as well as looking at various applications of using workflows to analyze big data such as the WIFIRE project.

The WIFIRE Project


Wildfires are critical for ecosystems in many geographical regions. However, our current urbanized existence in these environments is inducing the ecological balance to evolve into a different dynamic leading to the biggest fires in history. Wildfire wind speeds and directions change in an instant, and first responders can only be effective if they take action as quickly as the conditions change. What is lacking in disaster management today is a system integration of real-time sensor networks, satellite imagery, near-real time data management tools, wildfire simulation tools, and connectivity to emergency command centers before, during and after a wildfire. As a first time example of such an integrated system, the WIFIRE project is building an end-to-end cyberinfrastructure for real-time and data-driven simulation, prediction and visualization of wildfire behavior. This paper summarizes the approach and early results of the WIFIRE project to integrate networked observations, e.g., heterogeneous satellite data and real-time remote sensor data with computational techniques in signal processing, visualization, modeling and data assimilation to provide a scalable, technological, and educational solution to monitor weather patterns to predict a wildfire’s Rate of Spread.

About Ilkay Altintas de Callafon

In addition to being a co-director of the Data Science & Engineering (DSE) program, Ilkay Altintas is the Chief Data Science Officer at the San Diego Supercomputer Center (SDSC), UC San Diego, where she is also the Founder and Director for the Workflows for Data Science Center of Excellence. Since joining SDSC in 2001, she has worked on different aspects of scientific workflows as a principal investigator and in other leadership roles across a wide range of cross-disciplinary NSF, DOE, NIH and Moore Foundation projects. She is a co-initiator of and an active contributor to the popular open-source Kepler Scientific Workflow System, and the co-author of publications related to computational data science and e-Sciences at the intersection of scientific workflows, provenance, distributed computing, bioinformatics, observatory systems, conceptual data querying, and software modeling. Ilkay is the recipient of the first SDSC Pi Person of the Year in 2014, and the IEEE TCSC Award for Excellence in Scalable Computing for Early Career Researchers in 2015. Ilkay Altintas received her Ph.D. degree from the University of Amsterdam in the Netherlands with an emphasis on provenance of workflow-driven collaborative science.

Slides in pdf

John Hildebrand / Scripps Oceanography.

Recent advances in digital data storage capacity and low power electronics have made it possible to collect long-term continuous broadband passive acoustic data and thereby capture the full range of marine mammal sound production in an ocean setting.

Advances are also required in data curation, search, analysis, and visualization. We have developed methods to analyze and manage passive acoustic monitoring data that we are now acquiring at a rate of approximately 25 Tb/month.

Initial stages of data processing include converting the data from the internal instrument format to a standard audio file format with metadata extensions and preparation of both working and archival copies of the data. Standardized spectra are calculated for all data using a minimum of three frequency bands: (1) high frequency up to 160 kHz, (2) mid frequency up to 5 kHz, and (3) low frequency up to 1 kHz.

A set of 16 CPUs are operated in parallel for timely initial data processing. A number of automatic detectors are run on the data including: spectrogram correlation for blue whale call detection, energy detection for fin whale calls and anthropogenic sounds, power-law detector for humpback whale units, and Teager-energy based echolocation click detection and an expert system for beaked whale echolocation clicks.

Software (Triton) has been developed for efficient manual scanning and signal discovery. The key feature of Triton is the capability to display spectrograms on virtually any time scale and provide an index between long-term spectral averages (minutes to days) and short-term spectrograms (sec to msecs). Analysis effort is also standardized using a detection logger feature, allowing multiple analysts to contribute to the same dataset with uniform coverage. Detections are aggregated into a database (Tethys) that allows combination of multiple datasets and association with environmental or other data.

Javier R. Movellan / UCSD and Emotient

The human brain could be described as a computer designed to operate with hands and faces. The hands and the face take about 80% of the sensory motor areas of the human brain. The hands specialize in interaction with the physical world and the face in interaction with the social world. Computers can do complex things with the information we provide them with our hands via keyboard and mouse, but until recently they have been blind to the wealth of information we provide with our faces. For the last 20 years the Machine Perception Laboratory at UCSD has been pursuing the development of technology for automatic recognition of facial expressions. In this talk I will present the progress we made, from the early proof of concept prototypes, to the development of the first commercial smile detector embedded in digital cameras, to the implementation of large scale real time expression recognition systems.

Massimo Mascaro and Joe Cessna / Intuit.

Intuit is transforming itself from a product oriented to a data company, leveraging the great amounts of data about its customers to produce improved and personalized experiences and to advance in new business areas. In the first part of the talk we will cover a broad range of topics related to the Tax Business where data science is being applied in an impactful way. In the second part we will drill down into the methods we’re using to rank tax topics and questions and into how we are using novelty detection to monitor both the system’s performance and the health of our business.

Massimo Mascaro, PhD

Massimo Mascaro is a Sr. Data Scientist in the Intuit Consumer Tax Group where he leads the Data Science&Data Architecture team, overseeing data science projects between both online and offline analytics.

Prior to intuit Massimo worked for The Intellisis Corporation, leading the R&D team where he developed and patented algorithms for robust speech segmentation. Prior to that Massimo was a Microsoft, in the Bing Core Ranking team where he lead the data science team responsible for personalized web ranking. Before Bing, Massimo has been a Technical Program Manager and Architect for In the Technical Computing division of Microsoft, where worked on .NET Framework Parallel and Distributed programming extensions. Ahead of his Microsoft tenure, Massimo founded and lead a small startup in Italy that specialized in OCR for large financial customers.

In his early career Massimo has been a PostDoc and Lecturer at the University of Chicago, doing research on Recurrent Neural Networks, Computer Vision and biological models of the brain visual cortex. Massimo has a PhD in Neuroscience and a Master in Theoretical Physics, both from the University of Rome, Italy.

Joe Cessna, PhD

Joe Cessna is a Data Scientist, working for Intuit’s Consumer Tax Group (TurboTax) here in San Diego. His current work is focused around the processing and understanding of the vast amounts of analytics data continually produced by our core products. This includes automatic segmentation and unsupervised anomaly detection across numerous, disparate metrics and business KPIs.

Prior to Intuit, Joe worked as the Program Director and Technical Lead for the Intelligence, Surveillance, and Reconnaissance (ISR) Business Unit at Numerica Corporation in Colorado. During his time at Numerica, Joe led programs with the Air Force, Navy, National Security Agency (NSA), and National Reconnaissance Office (NRO) working on electronic intelligence (ELINT) interception, multi-sensor data fusion, classification fusion, non-cooperative target recognition, and target anomaly detection.

Joe received his M.S. (in Engineering Physics) and Ph.D (in Computational Science, Applied Math, and Engineering) from UCSD in 2008 and 2010 respectively. His thesis developed novel algorithms for data assimilation and estimation of high-dimensional chaotic systems as well as efficient computational techniques for implementing the algorithms on switchless, distributed spherical grids. Prior to moving to San Diego, Joe earned a B.S in Engineering Mechanics/Astronautics and a B.S. In Mathematics from the University of Wisconsin, Madison.

What really matters in Data Science

The learning algorithms in widespread use for in companies nowadays include linear methods for classification and regression, nonlinear methods for the same tasks, clustering techniques, topic models, and recommendation methods. I’ll outline what each of these methods is, and discuss how successful, or not, it tends to be in practice. Then I will explain unsolved issues that arise repeatedly across applications.

Charles Elkan, PhD

Charles Elkan is the first Amazon Fellow, on leave from being a professor of computer science at the University of California, San Diego. In the past, he has been a visiting associate professor at Harvard and a researcher at MIT. His published research has been mainly in machine learning, data science, and computational biology. The MEME algorithm that he developed with Ph.D. students has been used in over 3000 published research projects in biology and computer science. He is fortunate to have had inspiring undergraduate and graduate students who are in leadership positions now such as vice president at Google.

Slides as pdf

Mitchell International’s Journey to Business Intelligence & Analytics

Irene Clepper is a data enthusiast with over 20 years of experience in delivering business intelligence solutions for medical informatics, electronic medical records and Property and Casualty industry. Since 2001 Irene has served in various engineering leadership roles at Mitchell International. Currently Senior Director of Enterprise Business Intelligence and Analytics, she leads a corporate initiative to build enterprise analytics platform which will leverage the depth and breadth of Mitchell’s data assets. Passionate about creating an inspired workplace, Irene co-founded Mitchell’s first diversity group: Women (m)Power Network. Its mission is to propel talented women and men to leadership positions, developing a strong pipeline of talent.

Before coming to Mitchell, Irene worked at Oracle Corporation and Science Applications International Corporation (SAIC) in a variety of software engineering positions. Irene holds a Bachelor of Arts degree is in Economics and earned a Master of Science degree in Computer Science from the University of California, Davis. She is a Certified Oracle Professional. Outside of work, Irene has served on the board of the San Diego Chinese Culture Association, a non-profit organization promoting Chinese culture and language learning, since

  1. She is a member of the Society of Women Engineers (SWE) and Athena San Diego.

Slides as pdf