PAssignment 2: Heavy-hitters and adversarial streams

Getting started

In this assignment you will implement the CountSketch and CountMin algorithm we covered in class. The assignment will be autograded on gradescope, and you will be able to see your grades after you submit. As in Programming Assignment 1, You are allowed up to 44 submissions.

To make things more fun, we are also running a public scoreboard! The scoreboard will rank by the amount of memory you use in your solutions (you can be anonymous in this scoreboard). The top teams on the scoreboard will receive bonus credit at the end of the quarter. At this time we are still deciding on the criteria for “top” teams and the amount of bonus credit.

The starter template is here. To set up the coding environment, you’ll need to install python 2.7 along with packages mmh3, numpy, and sortedcontainers. With python installed, you can install these packages with the command (executed in bash for Linux and cmd for Windows):

python -m pip install mmh3 numpy sortedcontainers

After you install these packages, verify that you can run the stream.py python file. You should see the output (some of the numbers may be different):

Memory units (exact): 10000
[0, 12, 11, 12, 10, 9, 9, 7, 5, 11]

Implementing the assignment

To complete the assignment, you will have to fill out the implementations of CMSketcher, CSSketcher, generate_cs_adversarial, and generate_cm_adversarial in stream.py. No other files need to be modified. Also, please do not change any of the file names.

In Part I, you’ll implement the CountSketch and CountMin algorithms outlined in class. These algorithms takes in a stream of numbers and their frequencies (x_i, f_i) and reports back for each unique number an estimate of their true frequency in the stream. To implement these algorithms, you’ll have to implement the build and query methods of the CSSketcher and CMSketcher classes. The build method builds the sketch from a given stream as well as an error tolerance eps. The query method gets an x_i of the stream as input and returns a frequency estimate. The frequency estimate is judged as correct if it is within additive error eps * F1.

In Part II, you’ll investigate the trade-off between CountSketch and CountMin. You are asked to implement two functions generate_cs_adversarial and generate_cm_adversarial. These two functions must produces two streams. Each stream must have at least 10000 distinct elements. In the first stream, the frequencies of the elements must be chosen so that CountSketch uses more memory than CountMin. In the second stream, the frequencies must be chosen so that CountMin uses more memory than CountSketch.

To compute the memory your sketches use, we provide a compute_best_memory function. This function takes in a stream, a sketch, and an eps. It then rebuilds the sketch many times with various memory limits to find the smallest amount of memory that can produce the given eps for the given stream and sketch. The function test_adversarial shows you an example of how to use compute_best_memory.

Part II will be graded manually (at the end of the quarter) once you submit to GradeScope. You will receive full marks as long as test_adversarial produces the correct output locally. This part is only graded manually because the compute_best_memory function takes quite a long time to run, and may clog up the online judge.

Using `cs368lib`

Since we are working in the streaming model, your algorithms must be extremely memory efficient. To track the amount of memory your program is using, you must use the functions provided to you in cs368lib.py. cs368lib.py must not be modified. The cs368lib file provides to you the following functions/classes that you may find useful:

class tracked_sortedlist: A dynamic list that retains sorted order. Any operations on this list run in logarithmic time. Any python list operations (such as indexing, iteration, or slicing) will work on this list.
class tracked_list: A normal dynamic list. Appending or popping from the end is constant time. Any python list operations (such as indexing, iteration, or slicing) will work on this list.
tracked_int and tracked_double: Use these functions when you want to create an integer or a double variable.
hash and hash_double: These functions take in an integer elem and an integer seed. hash will map the integer elem to a random 64-bit integer. hash_double will map the integer to a random value in [0, 1]. To get different hash functions, use different seed values.

Any memory you use must be tracked using cs368lib. Failure to do so will result in penalization.

Submissions, Grading, and ⚝Scoreboard⚝

Scoreboard names

When you submit to gradescope, please only upload stream.py. Upon submission, you will get a prompt for filling out your scoreboard name. Feel free to choose any name you wish.

Time limits

The judge will run your estimators on a variety of input distributions under a some-what tight time limit. This means that the time complexity of your solution is important as well. Each test case will get 1-10 seconds of run time. If you timeout on a test case, no credit is given for that test case.

Memory limits

The grader also grades you based on the amount of memory you use. If your submission uses too much memory on a test case, no credit is given. The accepted level of memory is 30/eps^2 (since the memory of the hash functions is not counted). You are encouraged to optimize your solution to use even less memory.

Accepted level of error

If your solution does not return something within an additive eps F1 error of the exact solution, no credit is given.

Lastly, please do not submit excessively. Each submission takes a few minutes to judge and we only have 1 machine. Instead, test things locally when possible.

Final words

Start early! Please do not leave the programming assignments to the last minute.
For testing your solution, it may help to write random input generators and test locally. To help you test, the default implementation of the estimators returns the exact solution.
Always keep in mind the time complexity of your solution. It must run fast enough to process a stream of size 100k in less than 10 seconds.
If you have suggestions for new problems or improvements to current problems, let me know!