Week 8

This week, along with package upgrading tasks, Dr. Battle assigned me the task of creating a dataset generator. She shared the links to some well-known dataset generators (Macau by Zhao 2017](https://github.com/zheguang/macau/blob/master/data_generator.py) and ssb-dbgen) and asked me to try generating a bigger dataset by feeding a small car dataset (~300 objects) to each of them. Although I initially thought that the process of generating a new dataset with a similar distribution was simple, it was very complicated. As a matter of fact, I was able to get only one of the three projects the professor shared with me successfully running – the generator used in CrossFilter Benchmark by Battle et al. I looked through the script for the generator and came to notice how it utilized a lot of mathematical/statistical concepts including inverse cdf, standard deviation, etc. to eventually generate a dataset. Anyway, with the script, I was able to successfully generate a new dataset of size 50,000 from a small dataset of size 315. The new, big, generated dataset is to be used for benchmarking as the scalability is the core of the project I am currently working on.

Written on July 20, 2021