PAssignment 1: F0 and F2 estimation
Getting started
In this assignment you will implement some of the F0 and F2 estimators we covered in class. The assignment will be autograded on gradescope, and you will be able to see your grades after you submit. You are also allowed up to 44 submissions (that’s 1 per day from the time of release until June 1).
To make things more fun, we are also running a public scoreboard! The scoreboard will rank by the amount of memory you use in your solutions (you can be anonymous in this scoreboard). The top teams on the scoreboard will receive bonus credit at the end of the quarter. At this time we are still deciding on the criteria for “top” teams and the amount of bonus credit.
The starter template is here. To set up the coding environment, you’ll need to install python 2.7 along with packages mmh3
, numpy
, and sortedcontainers
. With python installed, you can install these packages with the command (executed in bash
for Linux and cmd
for Windows):
python -m pip install mmh3 numpy sortedcontainers
After you install these packages, verify that you can run the stream.py
python file. You should see the output (some of the numbers may be different):
F0 estimate: 1234
F2 estimate: 91088
Total memory units used: 30005
Implementing the assignment
To complete the assignment, you will have to fill out the implementations of f0_estimator
and f2_estimator
in stream.py
. No other files need to be modified. Also, please do not change any of the file names.
The estimators you implement can be any of the algorithms we’ve seen so far in class (or other low-memory ones you are aware of). The functions you’ll implement must take in a stream
of integers (a normal python list) and an error parameter eps
. The output is a floating point value for the estimated quantity. Please do not modify the input stream
variable (such as sorting it in place). This is a violation of the streaming model.
Using cs368lib
Since we are working in the streaming model, your algorithms must be extremely memory efficient. To track the amount of memory your program is using, you must use the functions provided to you in cs368lib.py
. cs368lib.py
must not be modified. The cs368lib
file provides to you the following functions/classes that you may find useful:
class tracked_sortedlist
: A dynamic list that retains sorted order. Any operations on this list run in logarithmic time. Any python list operations (such as indexing, iteration, or slicing) will work on this list.class tracked_list
: A normal dynamic list. Appending or popping from the end is constant time. Any python list operations (such as indexing, iteration, or slicing) will work on this list.tracked_int
andtracked_double
: Use these functions when you want to create an integer or a double variable.hash
andhash_double
: These functions take in an integerelem
and an integerseed
.hash
will map the integerelem
to a random 64-bit integer.hash_double
will map the integer to a random value in [0, 1]. To get different hash functions, use differentseed
values.1
Any memory you use must be tracked using cs368lib
. Failure to do so will result in penalization.
Submissions, Grading, and ⚝Scoreboard⚝
Scoreboard names
When you submit to gradescope, please only upload stream.py
. Upon submission, you will get a prompt for filling out your scoreboard name. Feel free to choose any name you wish.
Time limits
The judge will run your estimators on a variety of input distributions under a some-what tight time limit. This means that the time complexity of your solution is important as well. Each test case will get 1-10 seconds of run time. If you timeout on a test case, no credit is given for that test case.
Memory limits
The grader also grades you based on the amount of memory you use. If your submission uses too much memory on a test case, no credit is given. The accepted level of memory is min(3n, 20 \log n/eps^2)
where n
is the length of the stream. You are encouraged to optimize your solution to use even less memory.
Accepted level of error
If your solution does not return something within a (1 ± eps) factor of the exact solution, no credit is given.
Lastly, please do not submit excessively. Each submission takes a few minutes to judge and we only have 1 machine. Instead, test things locally when possible.
Final words
- Start early! Please do not leave the programming assignments to the last minute.
- For testing your solution, it may help to write random input generators and test locally. To help you test, the default implementation of the estimators returns the exact solution.
- Always keep in mind the time complexity of your solution. It must run fast enough to process 10-million integers in less than 10 seconds.
- If you have suggestions for new problems or improvements to current problems, let us know!
-
These hash functions use MurmurHash underneath the hood, which is not technically a k-wise independent hash function. However it is fast, and will work in practice. ↩︎