CSE - IIT Kanpur

CS 4T1: Algorithms for Big Data

Credit: 3-0-0-0 (9)

Course Description

(A) Objectives

Today applications such as machine learning and numerical linear algebra work with massive data. In this course we will cover algorithmic techniques, models, and lower bounds for handling such data. A common theme is the use of randomized methods, such as sketching and sampling, to provide dimensionality reduction. In the context of optimization problems, this leads to faster algorithms, and we will see examples of this in the form of least squares regression and low rank approximation of matrices and tensors, as well as robust variants of these problems. In the context of distributed algorithms, dimensionality reduction leads to communication-efficient protocols, while in the context of data stream algorithms, it leads to memory-efficient algorithms. We will study some of the above problems in such models, such as low rank approximation, but also consider a variety of classical streaming problems such as counting distinct elements, finding frequent items, and estimating norms. Finally we will study lower bound methods in these models showing that many of the algorithms we covered are optimal or near-optimal. Such methods are often based on communication complexity and information-theoretic arguments.

(B) Contents

SNo	Broad Title	Topics	No. Of Lectures
1	Sketching and Dimensionality Reduction	Estimating count of distinct elements in a data stream, estimating norm, Johnson-Lindenstrauss’ (J-L) Lemma with multiple proofs, Fast J-L (Subsampled Randomized Hadamard Matrix) and Sparse J-L (CountSketch) constructions, norm estimation for , heavy-hitters via CountSketch and a deterministic algorithm.	10
2	Lower Bounds	Lower bounds for , Johnson-Lindenstrauss’ Lower bound, Information theory, Information Theory, Distances Between Distributions, Indexing, streaming lower bounds for norms.	6
3	Approximate Linear Algebra via Dimensionality Reduction	Least Squares Regression, Subspace Embeddings, epsilon-Net Argument, Matrix Chernoff Bound, Affine Embeddings, Approximate Matrix Product, Low Rank Approximation, Sketching-based Preconditioning, Leverage Score Sampling, Distributed Low Rank Approximation, weighted low rank approximation, M-Estimators	16-18
4	Compressive Sensing	Compressive sensing, RIP, L1 minimization, Sparse recovery using sparse matrices, RIP1	4
5	Sparse Fourier	Sparse Fourier Transform	3
6	Computational Geometry and graphs	Streaming Algorithms for Geometric problems, streaming algorithms for certain graph problems	4
		Total	43-45

(C) Pre-requisites, if any (examples: a- PSO201A,or b- PSO201A or equivalent):

Solid understanding of Probability (CS203M, MSO201 or equivalent) and Linear Algebra (MTH201A or equivalent).

(D) Short summary for including in the Courses of Study Booklet

This is a mathematically rigorous course on developing a variety of algorithms used for processing massive data, based on random sampling and sketching, dimensionality reduction, compressed sensing and sparse Fourier transform. Topics: Estimating F_0 and l_0-number of distinct items in a data stream, estimating l_2and the Johnson-Lindenstrauss’ Lemma, estimating l_p using p-stable distributions for p∈(0,2), estimating l_p for p>2 and closely related problems. Johnson-Lindenstrauss’ Lemma: Fast J-L, Sparse J-L, Applications to Numerical linear algebra: linear regression and subspace embeddings, affine embeddings, approximate matrix product, low rank approximation, leverage score sampling. Compressive Sensing RIP, l_1minimization, sparse recovery using sparse matrices, sparse Fourier transform, streaming algorithms for geometric problems and graph problems. Lower bounds for F_0, Johnson-Lindenstrauss’ Lower bound, Information theory, Indexing, lower bounds for estimating norms in the streaming model.

Recommended books

Sketching as a Tool for Numerical Linear Algebra by David Woodruff, arXiv:1411.4357.

Other useful resources: Materials from the following related courses overlap significantly and may be useful in various parts of the course:

Algorithms for Big Data 2025 David Woodruff (CMU)
Algorithms for Big Data 2015 Jelani Nelson and Piotr Indyk (Harvard and MIT)
Algorithmic Techniques for Massive Data: Alexandr Andoni (Columbia).
Algorithms for Big Data: Chandra Chekuri (UIUC).
Algorithms for Big Data: Grigory Yaroslavtsev (UPenn).
Algorithms for Modern Data Models: Ashish Goel (Stanford).
Data Streams Algorithms: Andrew McGregor (UMass Amherst).
Data Stream Algorithms: Amit Chakrabarti (Dartmouth).
Dealing with Massive Data: Sergei Vassilvitskii (Columbia).
Randomized Algorithms for Matrices and Data: Michael Mahoney (UC Berkeley).
Sublinear algorithms: Piotr Indyk, Ronitt Rubinfeld (MIT).
Sublinear and streaming algorithms: Paul Beame (University of Washington).
The Modern Algorithmic Toolbox: Tim Roughgarden, Gregory Valiant (Stanford).
Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington)

CS 4T1: Algorithms for Big Data

Course Description

Recommended books

People

Resources

Programs

Admissions

Department

Research