Notes on Data@Scale 2018
The following notes are quick thoughts and summarizations of various talks at the Data@Scale conference I attended today. The full schedule of events along with some videos of the presentations are found here.
Lessons and Observations Scaling a Timeseries Database by Ryan Betts, Director of Platform Engineering at InfluxData
- Timeseries Database
- LSM
Leveraging Sampling to Reduce Data Warehouse Resource Consumption by Gabriela Jacques Da Silva, Software Engineer at Facebook; and Donghui Zhang, Software Engineer at Facebook
- Various research papers and a closed form error estimate
Voting with Witnesses the Apache Cassandra Way by Ariel Weisberg, PMC Member at Apache Cassandra
- Quorums
- Merkle Tree
- Consistent Hashing (Hash Rings)
- Visible / Witness
Deleting Data @ Scale by Ben Strahs, Software Engineer, Privacy & Data Use at Facebook
- Schemas
- Widespread testing
- Restoration (Continuous)
Scaling Data Plumbing at Wayfair by Ben Clark, Chief Architect at Wayfair
- Sliding Window on Pipeline
- Leaky Bucket
- ETL
Presto: Pursuit of Performance by Andrii Rosa, Software Engineer at Facebook and Matt Fuller, VP of Engineering at Starburst
- Cost-based optimizer
- Fast SQL querying
- Use of coefficients to determine cost of three things: storage usage, CPU usage, and complexity
Building Highly Reliable Data Pipelines at Datadog by Jeremy Karn, Staff Data Engineer at Datadog
- Spot Instances
- On-demand instances
- Isolate issues via preventatively running multiple pipelines