My weekly blog was written in an informal tone, as if it were my diary. If you would like to look at my work my more in detail in a professional tone, please refer to the final report.

Week 10

We ran the benchmarking test this week. We ran the performance test for Vega against PostgreSQL and DuckDB. For the test, we used multiple generated datasets of differing sizes, from 50,000 to 1 million rows. Just a side note, the input dataset for the dataset generator had only 300 rows, so it’s really interesting that it produced a dataset as high as 400 times larger than the original dataset. Back to the point, we were very pleased to find out that ours outperformed others in general. While DuckDB was the clear victor for the Extent operator across all size, and while DuckDB had the tendency to perform very well as the data size increased, Vega was noticeably performing better in all the other cases (across all other operator types and dataset sizes). Because we were able to empirically prove that Vega is very strong through the benchmarking process, I think our team achieved one of the main milestones. If you are interested in seeing a comparison graph, please check the final report.

Week 9

This week, I worked on creating a GitHub repository that lets people replicate our work conveniently. This is also for storing most of my progress during the internship as my internship is nearing an end. Therefore, along with many scripts I created during the internship, I added the crossfilter data generator script that I used last week to the repository. In fact, instead of just storing the raw script, I created a bash script that does all the job (dataset generation) in a correct order for the convenience of users. All the user needs to do is specifying the input path, metadata path, and the output path in the bash file. In addition, I created a README file with an instruction on how to generate the data the same way our team did. In addition, I created a generator that extracts certain statistical information from the data as a whole, such as its standard deviation, mean, etc., because it is required for generating normalized arguments later to operations. Because a similar process could be found in crossfilter data generator script, I decided to scrape some of the codes written by Battle et al. This generator has three options: i) returning the input exactly in the format needed for the benchmark tester (not really human-readable); ii) returning the input in the way that is more human-readable; iii) saving the result in json since this was written in python (because the crossfilter data generator was written in python but our benchmarker is written in JavaScript.)

Week 8

This week, along with package upgrading tasks, Dr. Battle assigned me the task of creating a dataset generator. She shared the links to some well-known dataset generators (Macau by Zhao 2017](https://github.com/zheguang/macau/blob/master/data_generator.py) and ssb-dbgen) and asked me to try generating a bigger dataset by feeding a small car dataset (~300 objects) to each of them. Although I initially thought that the process of generating a new dataset with a similar distribution was simple, it was very complicated. As a matter of fact, I was able to get only one of the three projects the professor shared with me successfully running – the generator used in CrossFilter Benchmark by Battle et al. I looked through the script for the generator and came to notice how it utilized a lot of mathematical/statistical concepts including inverse cdf, standard deviation, etc. to eventually generate a dataset. Anyway, with the script, I was able to successfully generate a new dataset of size 50,000 from a small dataset of size 315. The new, big, generated dataset is to be used for benchmarking as the scalability is the core of the project I am currently working on.

Week 7

This week, I worked on creating a random operator (function) generator as well as a random argument generator. What does it mean? Now that I finished making the class (refer to last week’s post for more info, or the final report for the detailed information), it was time to first create an application that randomly generates an operator, instead of having a user hard code sample operators, for benchmarking purposes. Because we need to test out the operators in a variety of cases, randomness plays a good role here. In addition, as mentioned last week, I need a tool that passes a random yet appropriate value to those operator functions. Therefore, I created those two tools so that they can be used in tandem at ease: create a function sample, and pass a random argument of appropriate type. For more conveinece, I designed it such that the user just needs to pass in the number of samples he or she needs for each operator type. Once done, I shared with Junran.

Week 6

This week, I worked on patching up the mistakes I made last week. This involved creating a new class in JavaScript in which each transformation operator is represented with a static function. The static functions in the class differ from one another in terms of its parameter. For example, some take an integer as an argument, while some other take a string as an argument. To be more specific, the JavaScript class has the following seven static classes: filter, aggregate, project, collect, bin, extent, and stack. Then in each static class, I wrote its unique operation function. It is important to note that an operation is distinct from an operator: an operation (e.g., sort in descending order) is like the unique way an operator (e.g., collect) can be utilized. In this example, I can pass the boolean “true” to the parameter “descending” of the operation function to get a Vega-formatted query command. However, it is important to note that I cannot necessarily pass “true” to every operation function because not every operation function has the same parameter. Therefore, by referring to Vega documentation, I individually created the operator functions as appropriately.

Week 5

This week, I worked on microbenchmarking preparation. To be more detailed, my job is to create a script that is to be used for benchmarking our tool. Using the script, we will be testing out filter, aggregate, bin, extent, stack, collect, and project operators with the tool and measure the amount of time it takes to execute them end-to-end. During the minor meeting with Junran, who is a graduate student leading the project after Dr. Battle, she showed me how to approach this problem with an example using car.json data. However, I mistakenly understood that I had to write benchmarking test cases specifically. So when I presented my work, Dr. Battle gave me another week to patch this up and change it into general benchmarking tests. This week, I was also assigned the job of dependency upgrades as in the first few weeks.

Week 4

Every Tuesday, our team have a separate small group meeting in the absence of Dr. Battle to brainstorm and work on group tasks together. This week, the team was assigned a job of creating a table for the Voyager optimizer. So, during the small group meeting, we created a table comprised of three columns: Vega DF Operator, Vega Parameters to Test, and Other Parameters to Test. Next week, based on what we have filled, we will be running a microbenchmarking test. For the microbenchmarking test, we decided to test out filter, aggregate, bin, extent, stack, collect, and project operators. Then in the second and third column of operator, we filled out what parameters to test.

Along with this, I have to wrap up the pull request I made last week because there are a couple of issues pointed out by the code reviewer that need to be fixed. Since those are minor issues, I think I can have it done and successfully merged by the next meeting.

Week 3

Every week, Dr. Battle assigns a paper to read to the Voyager team. After reading through the paper, we need to brainstorm and write a summary of the key takeaways from the paper, and we review it altogether in the following meeting. This week, we were assigned a paper on database optimization tool, dubbed Khameleon.

In general, interactive data visualization and exploration (DVE) applications suffer from network delays a lot. Thus, generally, a technique called prefetching is used. Prefeteching is an action of caching the responses for predicted requests beforehand. Khameleon also is a frame that uses prefetching mechanism, but it “continuously and aggressively hedges across a large set of potential requests” and “shields developers from joint optimization problem” according to Mohammed et al.

Also, all of the dependencies that I have updated need to be merged with the master branch of Voyager repository, so I was asked to make a pull request. Because the process of updating outdated dependencies is very complicated, there are not that many dependencies that I updated to the latest version. I might discuss the dependencies as one of the achievements during this DREU program since this also taught me more about Git and GitHub, such as GitHub Actions.

Week 2

I spent some time looking into the history of Vega this week. Vega appears to be very interesting; unlike other visualization tools and libraries I have used such as Tableau, which runs on direct human-visualization interaction, and d3.js, which runs through programming, Vega operates off of just one JSON file. Once a JSON file packed with information about a visualization is parsed via Vega, it is transformed into a interactive visualization.

Week 1

In the first week, I was assigned the task of upgrading the dependencies of Voyagar repo to the latest version. Because the current project I am working on, dubbed Scalable Voyager, is an enhancement project of the original project Voyager, it needs refinement. Surprisingly, the original repo is very old (made around 2016) and there are so many dependencies that are outdated in the repo at the moment, so I think it would take a lot of time to patch them all up.