Home > Teaching > CS 4T1: Algorithms for Big Data

CS 4T1: Algorithms for Big Data

Credit: 3-0-0-0 (9)


Course Description


(A) Objectives


Today applications such as machine learning and numerical linear algebra work with massive data. In this course we will cover algorithmic techniques, models, and lower bounds for handling such data. A common theme is the use of randomized methods, such as sketching and sampling, to provide dimensionality reduction. In the context of optimization problems, this leads to faster algorithms, and we will see examples of this in the form of least squares regression and low rank approximation of matrices and tensors, as well as robust variants of these problems. In the context of distributed algorithms, dimensionality reduction leads to communication-efficient protocols, while in the context of data stream algorithms, it leads to memory-efficient algorithms. We will study some of the above problems in such models, such as low rank approximation, but also consider a variety of classical streaming problems such as counting distinct elements, finding frequent items, and estimating norms. Finally we will study lower bound methods in these models showing that many of the algorithms we covered are optimal or near-optimal. Such methods are often based on communication complexity and information-theoretic arguments.


(B) Contents

SNo Broad Title Topics No. Of Lectures
1 Sketching and Dimensionality Reduction Estimating count of distinct elements in a data stream, estimating  norm, Johnson-Lindenstrauss’ (J-L) Lemma with multiple proofs,  Fast J-L (Subsampled Randomized Hadamard Matrix) and  Sparse J-L (CountSketch) constructions,  norm estimation for ,  heavy-hitters via CountSketch and a deterministic algorithm. 10
2 Lower Bounds Lower bounds for , Johnson-Lindenstrauss’ Lower bound, Information theory, Information Theory, Distances Between Distributions, Indexing, streaming lower bounds for norms. 6
3 Approximate Linear Algebra via Dimensionality Reduction Least Squares Regression, Subspace Embeddings, epsilon-Net Argument, Matrix Chernoff Bound, Affine Embeddings, Approximate Matrix Product, Low Rank Approximation, Sketching-based Preconditioning, Leverage Score Sampling, Distributed Low Rank Approximation, weighted low rank approximation, M-Estimators 16-18
4 Compressive Sensing Compressive sensing, RIP, L1 minimization, Sparse recovery using sparse matrices, RIP1  4
5 Sparse Fourier Sparse Fourier Transform 3
6 Computational Geometry and graphs Streaming Algorithms for Geometric problems, streaming algorithms for certain graph problems 4
    Total   43-45


(C) Pre-requisites, if any (examples: a- PSO201A,or b- PSO201A or equivalent):


Solid  understanding of Probability (CS203M, MSO201 or equivalent) and Linear Algebra (MTH201A or equivalent).
 

(D) Short summary for including in the Courses of Study Booklet


This is a mathematically rigorous course on developing a variety of algorithms used for processing massive data, based on random sampling and sketching, dimensionality reduction, compressed sensing and sparse Fourier transform. Topics: Estimating F_0 and  l_0-number of distinct items in a  data stream, estimating l_2and the  Johnson-Lindenstrauss’  Lemma,  estimating l_p using p-stable distributions for p∈(0,2),  estimating l_p for p>2 and closely related problems. Johnson-Lindenstrauss’ Lemma: Fast J-L, Sparse J-L, Applications to Numerical linear algebra: linear regression and subspace embeddings, affine embeddings,  approximate matrix product, low rank approximation, leverage score sampling. Compressive Sensing RIP, l_1minimization, sparse  recovery using sparse matrices, sparse Fourier transform, streaming algorithms for geometric problems and  graph  problems. Lower bounds for F_0, Johnson-Lindenstrauss’ Lower bound, Information theory, Indexing, lower bounds for estimating norms in the streaming model. 

 

Recommended books

Sketching as a Tool for Numerical Linear Algebra by  David Woodruff, arXiv:1411.4357.

Other useful resources:  Materials from the following related courses overlap significantly and may be useful in various parts of the course:

  1. Algorithms for Big Data 2025 David Woodruff (CMU)
  2. Algorithms for Big Data 2015 Jelani Nelson and Piotr Indyk (Harvard and MIT)
  3. Algorithmic Techniques for Massive Data: Alexandr Andoni (Columbia).
  4. Algorithms for Big Data: Chandra Chekuri (UIUC).
  5. Algorithms for Big Data: Grigory Yaroslavtsev (UPenn).
  6. Algorithms for Modern Data Models: Ashish Goel (Stanford).
  7. Data Streams Algorithms: Andrew McGregor (UMass Amherst).
  8. Data Stream Algorithms: Amit Chakrabarti (Dartmouth).
  9. Dealing with Massive Data: Sergei Vassilvitskii (Columbia).
  10. Randomized Algorithms for Matrices and Data: Michael Mahoney (UC Berkeley).
  11. Sublinear algorithms: Piotr Indyk, Ronitt Rubinfeld (MIT).
  12. Sublinear and streaming algorithms: Paul Beame (University of Washington).
  13. The Modern Algorithmic Toolbox: Tim Roughgarden, Gregory Valiant (Stanford).
  14. Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington)