The following notes are quick thoughts and summarizations of various talks at the Data@Scale conference I attended today. The full schedule of events along with some videos of the presentations are found here.

Lessons and Observations Scaling a Timeseries Database by Ryan Betts, Director of Platform Engineering at InfluxData

  • Timeseries Database
  • LSM

Leveraging Sampling to Reduce Data Warehouse Resource Consumption by Gabriela Jacques Da Silva, Software Engineer at Facebook; and Donghui Zhang, Software Engineer at Facebook

  • Various research papers and a closed form error estimate

Voting with Witnesses the Apache Cassandra Way by Ariel Weisberg, PMC Member at Apache Cassandra

  • Quorums
  • Merkle Tree
  • Consistent Hashing (Hash Rings)
  • Visible / Witness

Deleting Data @ Scale by Ben Strahs, Software Engineer, Privacy & Data Use at Facebook

  • Schemas
  • Widespread testing
  • Restoration (Continuous)

Scaling Data Plumbing at Wayfair by Ben Clark, Chief Architect at Wayfair

  • Sliding Window on Pipeline
  • Leaky Bucket
  • ETL

Presto: Pursuit of Performance by Andrii Rosa, Software Engineer at Facebook and Matt Fuller, VP of Engineering at Starburst

  • Cost-based optimizer
  • Fast SQL querying
  • Use of coefficients to determine cost of three things: storage usage, CPU usage, and complexity

Building Highly Reliable Data Pipelines at Datadog by Jeremy Karn, Staff Data Engineer at Datadog

  • Spot Instances
  • On-demand instances
  • Isolate issues via preventatively running multiple pipelines