Tuesday, January 15, 2013

USC CSCI 685(Advanced Topics in Database systems) - Fall 2012 Course review

I am writing this post as an expanded version of many of my responses on facebook forums to my fellow university-mates at USC about the review of the course.

TL;DR: Skip to last two paragraph.

Advanced Topics in Database systems, CSCI 685, taught by Prof. Shahram Ghandeharizadeh, is a rigorous course structured into a seminar format(more like a book club that meets twice a week, you will be expected to participate actively), focusing on reading research papers representing influential milestones that helped shape up the current marketplace of database technologies. Papers including, but not limited to, RAID[1],  Beyond Relational Databases[2], Bigtable[3], Column Stores vs. Row-Stores[4], The Gamma Database Machine Project[5], R-Trees[6], etc were discussed. There was variety in topic and it aimed at giving us a overall knowledge of these systems. Each class opened up with students commenting about the paper for that day followed by lecture/discussion of the paper facilitated by Professor. I would highly encourage that the papers be read before the class to make the best use of the classes.

Assignments were related to and building on top of BG, benchmarking tool to evaluate performance of a data store for interactive social networking actions and sessions[7]. Each student was assigned a datastore(from a list that included MySQL, DB2, Riak, Cassandra, CouchDB, VoltDB, etc) and the assignment involved two parts: writing DB interfaces to communicate with the datastore to store in it the data representing the social network domain that BG dealt with and to gather benchmark metrics running different workloads. It was predominantly in written in Java. Assignment questions were discussed in BG's google group. Prof. Ghandeharizadeh was very accommodating when the class wasn't making as quick a progress as he expected that we would and structured the assignment differently to adjust the pace to make sure everybody is upto speed. I think it was possible because of the small class size(15 students). I would imagine the assignment would have been simpler to begin with in a larger class size. An extra credit assignment on Berkeley DB was announced but once the class started working on the project, it was hard to begin this optional assignment. Assignments were completed by first half of the semester.

We were free to propose our ideas for class project and we were also given interesting ideas. Projects were done in a group of 2 and each group worked on a separate topic. It was done in Java and/or C depending on the topic. A few of the topics included benchmarking vertical scaling and horizontal scaling of different SQL and NoSQL datastores, implementing Gumball[8] on Memcached, implementing elasticity on Memcached, etc. There were in-class presentations during important milestones. The projects were meant to be very intense and the scope evolved over time as the students became comfortable with the problem at hand. Professor pointed out to relavent literature when necessary and facilitated discussions to get us thinking about the solution. This occupied most of the second half of the semester.

Two closed book exams were pretty easy to crack with participative attendance during the paper lectures and exam review lectures. Extra credits for midterm were given for reviewing a few PhD student's papers and for providing feedback on them.

One of the questions that I was asked several times by many of my fellow university-mates was how useful the course would be from an industry perspective. The variety of papers with the theoretical nature of discussions exposes us on a very high level about different kinds of problems that experts have encountered so far and the type of solutions that have been explored for them. Although this is not as directly applicable as the learnings from assignments or projects, I feel that it would equip students with a high level knowledge in wide variety of database technologies(and on generic topics such as indexing techniques, scalable systems, etc) that could be directly applied whenever we encounter similar problems while building systems.

Another popular question is how is CSCI 685 different from CSCI 599 - NewSQL Database Management Systems. CSCI 685 is broad and covers papers from a longer duration of time ranging from classical to contemporary systems. CSCI 599 - NewSQL is smaller in scope and covers extensively about the recent hype around NoSQL and more recent NewSQL products that have been bubbling up rapidly. 

Hit me up with any questions you may have. Good luck.

References

  1. D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). ACM SIGMOD, 1988
  2. M. Seltzer. Beyond Relational Databases. Communications of the ACM, July 2008, Vol. 51, No. 7.
  3. F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006.
  4. D. J. Abadi et al. Column Stores vs. Row-Stores: How Different Are They Really? ACM SIGMOD 2008.
  5. D. DeWitt et al. The Gamma Database Machine Project. IEEE Transactions on Knowledge and Data Engineering, Vol. 2, 1990.
  6. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In ACM SIGMOD 1984
  7. S. Barahmand, S. Ghandeharizadeh. BG: A Benchmark to Evaluate Interactive Social Networking Actions. CIDR 2013.
  8. S. Ghandeharizadeh, J. Yap. Gumball: A Race Condition Prevention Technique for Cache Augmented SQL Database Management Systems. Database Laboratory Technical Report 2012-01, Computer Science Department, USC