Abstract: Dr. Chaudhuri will present an overview of some of the open challenges and opportunities for structured data management that are especially relevant for today’s world of Big Data and the Cloud. In the second half of the talk, he will discuss in depth one of the opportunities - approximate query processing - and reflect on why this technology is not mainstream in today’s data platforms.
Bio: Surajit Chaudhuri is a Distinguished Scientist at Microsoft Research and leads the Data Management, Exploration and Mining group. He also acts as the liaison between Microsoft Research and the Leadership Team of Microsoft’s Cloud and Enterprise Division. Surajit’s current areas of interest are data analytics for Big Data platforms, self-manageability, and cloud database services. Working with his colleagues in Microsoft Research, he helped incorporate the Index Tuning Wizard (and subsequently Database Engine Tuning Advisor) and Data Cleaning technology in Microsoft SQL Server. Surajit is an ACM Fellow, a recipient of the ACM SIGMOD Edgar F. Codd Innovations Award, ACM SIGMOD Contributions Award, a VLDB 10-year Best Paper Award, and an IEEE Data Engineering Influential Paper Award. Surajit received his Ph.D. from Stanford University.
Modern machine learning involves deep neural network architectures which yield state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache Mxnet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed learning.
Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional structures end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with Mxnet backend interface for large-scale efficient learning.
Dr. Varma will discuss extremely large and extremely small scale machine learning in this talk. He will start by introducing extreme classification – a new area of research focussing on multi-class & multi-label problems involving millions of labels. Extreme classification has opened up a new paradigm for thinking about key applications in our industry. Dr. Varma will discuss algorithms for some of these applications and present results on tagging on Wikipedia, product recommendation on Amazon and search and advertising on the Bing search engine. More details can be found on The Extreme Classification Reopistory.
In the second part of my talk, he will propose an alternative paradigm for the Internet of Things (IoT) where machine learning algorithms run locally on extremely resource-constrained edge and endpoint devices without necessarily needing cloud connectivity. This enables many scenarios beyond the pale of the traditional paradigm including low-latency brain implants, precision agriculture on disconnected farms, privacy-preserving smart spectacles, etc. Towards this end, Dr. Varma will discuss developing novel machine learning algorithms that can run on cheap and extremely energy efficient microcontrollers smaller than a grain of rice having just 2 KB RAM with no hardware support for floating point operations. Source code for these algorithms is publically available as part of Microsoft’s Edge Machine Learning library from https://github.com/Microsoft/EdgeML
Bio: Manik Varma is a researcher at Microsoft Research India and an adjunct professor of computer science at the Indian Institute of Technology (IIT) Delhi. His research interests lie in the areas of machine learning, computational advertising and computer vision. Classifiers that he has developed have been deployed on millions of devices around the world and have protected them from viruses and malware. His algorithms are also generating millions of dollars on the Bing search engine (up to sign ambiguity). In 2013, he and John Langford coined the term extreme classification and found that they had inadvertently started a new area in machine learning. Today, by happenstance, extreme classification is thriving in both academia and industry with Manik’s classifiers being used in various Microsoft products as well as in the wider tech sector. Manik recently proclaimed “2 KB (RAM) ought to be enough for everybody” prompting the media in the US, India, China, France, Belgium and Singapore to cover his research and compare him to Bill Gates (unfair, Manik’s more handsome!). Manik has been awarded the Microsoft Gold Star award, the Microsoft Achievement award, won the PASCAL VOC Object Detection Challenge and stood first in chicken chess tournaments and Pepsi drinking competitions. He has served as an area chair/senior PC member for machine learning, artificial intelligence and computer vision conferences such as AAAI, CVPR, ICCV, ICML, IJCAI and NIPS and is serving as an associate editor of the IEEE PAMI journal. Manik is also a failed physicist (BSc St. Stephen's College, David Raja Ram Prize), theoretician (BA Oxford, Rhodes Scholar), engineer (DPhil Oxford, University Scholar) and mathematician (MSRI Berkeley, Post-doctoral Fellow).
Penalizing a model by L1 norm, also known as LASSO penalty, has proven to be a powerful tool for learning sparse models in several settings including linear regression. However, in linear regression LASSO fails to recover the true model if the predictors are correlated. This is an important open problem and has sparked new interest in investigating alternatives to LASSO which can provably guarantee the discovery of true model. In this talk we will discuss Ordered weighted L1 norm(OWL) and show how it can discover the true model even in the presence of strong correlation.
Joint Work with Raman Sankaran and Francis Bach
Abstract: Online Social Media (OSM) like Twitter, Facebook and WhatsApp are increasingly being used to gather real-time information during disaster or emergency events, such as, earthquakes, floods, epidemics, and terror attacks. During such events, OSM are frequently used by the affected population for reporting situational information, as well as by the agencies responding to the disaster for coordinating the relief efforts. In this talk, we will discuss various challenges in utilizing OSM during disasters, which include retrieving critical situational information from lot of conversational content, classifying and summarizing the information, guarding against misinformation, and so on. These tasks become all the more more challenging due to the noisy, informal nature of the user-generated content posted on OSM. We will also discuss some recent research studies which attempt to address these challenges.
Abstract: Abstractive summarization aims to generate a shorter version of the document covering all the salient points in a compact and coherent fashion. On the other hand, query-based summarization highlights those points that are relevant in the context of a given query. The encode-attend-decode paradigm has achieved notable success in machine translation, extractive summarization, dialog systems, etc. But it suffers from the drawback of generation of repeated phrases. In this work we propose a model for the query-based summarization task based on the encode-attend-decode paradigm with two key additions (i) a query attention model (in addition to document attention model) which learns to focus on different portions of the query at different time steps (instead of using a static representation for the query) and (ii) a new diversity based attention model which aims to alleviate the problem of repeating phrases in the summary. In order to enable the testing of this model we introduce a new query-based summarization dataset building on debatepedia. Our experiments show that with these two additions the proposed model clearly outperforms vanilla encode-attend-decode models with a gain of 28\% (absolute) in ROUGE-L scores.
Bio: Mitesh has been working as an Assistant Professor in the Department of Computer Science and Engineering for the past 1 year . While at IIT Madras he plans to pursue his interests in the areas of Deep Learning, Multimodal Multilingual Processing, Conversational Systems and Question Answering. He teaches graduate and undergraduate level courses on Deep Learning and Machine Learning. Prior to joining IITM he worked as a Researcher at IBM Research India. During the four and half years that he spent at IBM he worked on several interesting problems in the areas of Statistical Machine Translation, Cross Language Learning, Multimodal Learning, Argument Mining and Deep Learning. This work led to publications in top conferences in the areas of Computational Linguistics and Machine Learning such as ACL, NAACL, EMNLP , NIPS, AAAI, etc. Prior to IBM, he completed his PhD and M.Tech from IIT Bombay in Jan 2012 and July 2008 respectively. His PhD. thesis dealt with the important problem of reusing resources for multilingual computation.
Abstract: "Big Data" comes in various forms, and brings different types of challenges depending on the type of the data. It can be very large graphs, e.g., several hundred million to about a billion nodes and edges, or a large amount of text data, e.g., Wikipedia, the web itself, or a large amount of non-textual media data, e.g., images, video, audio files etc. In this talk the speaker will first take an overview of the old problems and solutions for handling the structured form of the data, popularly known as the relational data. Then she will talk about the new generation big data problems and their relevant solutions (after the evolution of the Web 2.0). This will also include how even the old data management techniques come in handy. Notably these storage and query optimization techniques make the back-end of the cutting-edge analytics and machine learning applications, and help these applications scale. In the latter part of the talk, the speaker takes an overview of some specific problems targetting semi-structured data (graphs), and a system – BitMat – that she has developed for storage, indexing, and querying of very large graphs.
Abstract: New computing interfaces that use "natural" modes of interaction — such as multitouch and gestures — are rapidly becoming more popular than traditional keyboard-based interaction. These devices are being used to consume and directly interact with data in a wide range of contexts, from business intelligence to data-driven sciences. Applications for such devices are highly interactive, and pose a fundamentally different set of expectations on the underlying data infrastructure. In this talk, we rethink various aspects of the data stack, from the query specification process to distributed query execution, to address these interactive workloads. We explore the impact of including interactivity as first-class concept, and show that our methods result in experiences that are not only fluid, but also more intuitive for the end-user.
Bio: Arnab Nandi is an Associate Professor in the Computer Science and Engineering department at The Ohio State University. Arnab’s research is in the area of database systems, focusing challenges in large-scale data analytics and human-in-the-loop data exploration. Arnab is also a founder of The STEAM Factory, a collaborative interdisciplinary research and public outreach initiative, and faculty director of the OHI/O Informal Learning Program. Arnab is a recipient of the US National Science Foundation’s CAREER Award, a Google Faculty Research Award, and the Ohio State College of Engineering Lumley Research Award. He is also the 2016 recipient of the IEEE TCDE Early Career Award for his contributions towards user-focused data interaction.
Abstract: Crowdsourcing has grown hugely in importance in the last decade. While an individual worker may not always be accurate, consensus over several workers often achieves high labeling accuracy. It is common for researchers and practitioners to crowdsource training data for machine learning classifiers. But, when gathering training data with a limited budget, a novel tradeoff arises. Should one create a smaller but higher quality training set by issuing the same data point several times, or should one label a larger set but at lower quality by reducing the level of redundancy? What factors does this choice depend on? In this talk, Dr. Mausam will demonstrate that the right level of redundancy depends upon inductive bias, worker accuracy, and available budget. Dr. Mausam will then describe Re-Active learning, our extension of active learning for the case of crowdsourcing. Re-Active learning allows a data point to be re-labeled for improving the quality of the label. The Talk will end with the speaker's current work on applying these principles to domains with large class imbalance, which necessitate significantly different strategies for gathering training data.
Grading of student SQL queries is usually done by executing the queries on sample datasets, which may be unable to catch many errors, and/or manually checking the queries, which can be tedious and error prone. The XData system developed at IIT Bombay helps detect errors in SQL queries by generating multiple datasets tailored to show the difference between a correct query and an erroneous query that is a mutation of a correct query. Datasets generated by the system can be used to test correctness of SQL queries written by programmers, as well as to detect errors in student SQL queries by comparing the results with that of a correct query. The job of data generation is achieved by creating appropriately defined constraints, and feeding them to an SMT solver to get the required datasets.
For grading SQL queries however, just finding errors is not sufficient, since students need to be given partial marks based on how close to correct their query is. We have therefore added support in XData system for partial marking, using a series of canonicalization steps to remove irrelevant syntactic differences between queries, before comparing them.
The XData system, with comprehensive Web based interfaces for students and instructors, and integration with LMSs such as Moodle is available for free download. In addition to grading, the system supports a learning mode by students, where it provides immediate feedback. To learn more about the XData system, and to download a copy, visit http://www.cse.iitb.ac.in/infolab/xdata.
Bio: S. Sudarshan is a Professor in the Dept. of Computer Science and Engg. at IIT Bombay. His research interests are broadly in the area of database systems, with a focus on query processing and query optimization. Currently projects include the XData project, the DBridge project for holistic optimization of database applications, and the PyroJ query extensible optimizer designed for optimizing queries on parallel database systems. He is co-author, with Silberschatz and Korth, of a widely used textbook, Database System Concepts, now in its 6th edition.
Abstract: Current deep learning architectures are growing larger to
learn from complex datasets. The quest for a unified machine learning
algorithm which can simultaneously generalize from diverse sources of
information (or transfer learning) has made it imperative to train
astronomical sized neural networks with enormous computational and
In this talk, we will show a novel set of algorithms to deal with
computational, energy and memory challenged associated with massive
networks. We will discuss our recent success in developing a novel
hashing based scalable and sustainable technique to reduce the
computations associated with backpropagation algorithm drastically.
Utilizing the magic that made search algorithms over web faster, we
will demonstrate how our algorithms only need 5% of the total
computations, and they can still manage to keep on average within 1%
of the accuracy of the classical backpropagation. Approximate
proximity search, based on hash tables and other data structures,
already sits on more than three decades of research from the systems
and database community. Thus, superior utilization of parallelism and
systems resources is automatically assured with our proposal.
A unique property of the proposed hashing based back-propagation is
that the updates are always sparse. Due to the sparse gradient
updates, our algorithm is ideally suited for asynchronous and parallel
training leading to near linear speedup with increasing number of
cores. We demonstrate the scalability and sustainability (energy
efficiency) of the new proposed algorithm over several real datasets.
In the end, if time permits, we will show a simple hashing algorithm
for reducing the memory requirements associated with extreme scale
classification. Using our algorithms, we can train 100,000 classes
with 400,000 features, on a single GPU while only needing 125x less
memory than usual logistic regression (1.2GB model size instead of
Bio: Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include large-scale machine learning, randomized algorithms for big data systems and graph mining. He is a recipient of 2017 NSF CAREER Award. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. His work on how hashing can slash 95\% or more computations in deep learning got picked up by several media outlets and received significant social media attention.