Blogs on Justin Jaffray

Blogs on Justin Jaffray https://justinjaffray.com/blog/ Recent content in Blogs on Justin Jaffray Hugo -- gohugo.io Thu, 26 Jan 2023 00:00:00 +0000 A Charming Algorithm for Count-Distinct https://justinjaffray.com/a-charming-algorithm-for-count-distinct/ Thu, 26 Jan 2023 00:00:00 +0000 https://justinjaffray.com/a-charming-algorithm-for-count-distinct/ I recently came across a paper called Distinct Elements in Streams: An Algorithm for the (Text) Book by Chakraborty, Vinodchandran, and Meel. The usage of the phrase “from the book” is of course a reference to Erdős, who often referred to a “book” within which God kept the best proofs of any given theorem. Thus, for something to be “from the book” is for it to be particularly elegant. I have to say, I agree with their assessment. Functional Dependencies and Orders https://justinjaffray.com/functional-dependencies-and-orders/ Sun, 25 Dec 2022 00:00:00 +0000 https://justinjaffray.com/functional-dependencies-and-orders/ Say you’re a query optimizer. Is the following transformation “valid?” select * from t where x > 5 => select * from t where false Well…I doubt it. The first version permits strictly more rows than the second. But, what if I told you that we recently collected statistics on our data, and found that no more than 0.01% of our data has a value of x that’s larger than 5? JOIN: The Ultimate Projection https://justinjaffray.com/join-the-ultimate-projection/ Mon, 13 Jun 2022 00:00:00 +0000 https://justinjaffray.com/join-the-ultimate-projection/ Or: the core idea of how database query planners decorrelate subqueries. Joins In a relational database, to compute the join of \(R\) and \(S\) on \(p\), for each row \(r\) in \(R\), find all the rows \(s\) in \(S\) where \(p(r, s)\) is true, and emit the concatenation of \(r\) and \(s\). Readers of this blog might note this is a different definition than we’ve used in the past, but this one is well-suited to the way we’ll be using joins today. A Gentle(-ish) Introduction to Worst-Case Optimal Joins https://justinjaffray.com/a-gentle-ish-introduction-to-worst-case-optimal-joins/ Mon, 30 May 2022 00:00:00 +0000 https://justinjaffray.com/a-gentle-ish-introduction-to-worst-case-optimal-joins/ If you’ve been following databases in the past couple years, you’ve probably encountered the term “worst-case optimal joins.” These are supposedly a big deal, since joins have been studied for a long time, and the prospect of a big shift in the way they’re thought about is very exciting to a lot of people. The literature around them, however, is primiarly aimed at theorists. This post is as gentle and bottom-up as I can manage of an introduction to the main ideas behind them. Durability and Redo Logging https://justinjaffray.com/durability-and-redo-logging/ Mon, 24 Jan 2022 00:00:00 +0000 https://justinjaffray.com/durability-and-redo-logging/ The most fundamental property a database can provide is durability. That is, once I’ve told you that your write has been accepted, if a mouse chews through the power cord for the server rack, the write will not be lost. This obviously is only possible to a degree. If someone goes into your SSD with a magnetized needle and a steady hand and tweaks their bank balance, then (short of replication) the best you can probably do is detect that it’s been changed via a checksum, unless, of course, they had the foresight to update that as well. Compaction https://justinjaffray.com/compaction/ Wed, 26 May 2021 00:00:00 +0000 https://justinjaffray.com/compaction/ You hear the word “log” a lot when talking about databases. This is because databases are fundamentally comprised of sin. (Shared mutable state.) Logs are natural in a setting where you want to be principled about your treatment of mutation because one way to think about mutation is how a variable starts from an initial value and undergoes several “transformations.” This sequence of operations: let x = 1; x = x + 3; let y = 7; x = x * 2; can be written to a log as a history of what operations were performed: Query Engines: Push vs. Pull https://justinjaffray.com/query-engines-push-vs.-pull/ Mon, 26 Apr 2021 00:00:00 +0000 https://justinjaffray.com/query-engines-push-vs.-pull/ People talk a lot about “pull” vs. “push” based query engines, and it’s pretty obvious what that means colloquially, but some of the details can be a bit hard to figure out. Important people clearly have thought hard about this distinction, judging by this paragraph from Snowflake’s Sigmod paper: Push-based execution refers to the fact that relational operators push their results to their downstream operators, rather than waiting for these operators to pull data (classic Volcano-style model). Deduplicating Decklists https://justinjaffray.com/deduplicating-decklists/ Thu, 17 Sep 2020 00:00:00 +0000 https://justinjaffray.com/deduplicating-decklists/ This is not going to be my normal kind of post, it’s not very focused, and going to be a bit rambley, as I talk about a problem I thought about one day. Magic: The Gathering is a card game where players construct decks, typically of 60 cards plus a 15 card sideboard, for 75 cards total. Periodically, the company that makes the game, Wizards of the Coast (WotC) publishes a list of decks that did well recently. Branch and Bound https://justinjaffray.com/branch-and-bound/ Thu, 11 Jun 2020 00:00:00 +0000 https://justinjaffray.com/branch-and-bound/ Sometimes I write posts because I think I have a fresh perspective on something, and sometimes I write posts because for whatever reason I think every explanation of a particular concept that I’ve seen is bad. This is the latter! This post is about a lovely technique for discrete optimization called Branch and Bound. Optimization problems can generally be viewed as a search for an object with the lowest “cost.” For instance, given an instance of the Traveling Salesman Problem (TSP), the “objects” we’re searching for are “tours,” or a cycle in the graph that visits every “city” exactly once. Understanding Cost Models https://justinjaffray.com/understanding-cost-models/ Mon, 11 May 2020 00:00:00 +0000 https://justinjaffray.com/understanding-cost-models/ The dominant way modern query planners decide what algorithms to use for a given query is via a cost model. Effectively, they enumerate the set of possible plans for a query, and then run their cost model over each one, eventually returning whichever one had the cheapest cost according to the model. Analytic queries can become complex, horrifically so, in fact, and assessing their costs follows suit. To alleviate this somewhat, we’re going to explore some tools we can use to attempt to travel up the ladder of abstraction for query planners. Timely Dataflow and Total Order https://justinjaffray.com/timely-dataflow-and-total-order/ Mon, 06 Apr 2020 00:00:00 +0000 https://justinjaffray.com/timely-dataflow-and-total-order/ In anticipation of starting at Materialize, I’ve been reading through some of the fundamental literature that the product is built on. The first paper I’ve read through is Naiad: A Timely Dataflow System. I enlisted the help of friends Forte Shinko and Ilia Chtcherbakov to help me work through it, and we ended up finding an interesting question that I’m not aware of a proof online for, so I’m going to share it today. What is a Query Optimizer for? https://justinjaffray.com/what-is-a-query-optimizer-for/ Tue, 11 Feb 2020 00:00:00 +0000 https://justinjaffray.com/what-is-a-query-optimizer-for/ You could be forgiven for thinking that a “query optimizer” is a component of a database that takes a query plan and makes it better, hence, the typical programmer definition of “optimize.” This is not really how the term is used in practice, and “query optimizer” is really pretty synonomous with “query planner.” I think “query planner” is a better term so I’m going to use that. SQL is fundamentally designed such that users express what data they want rather than an algorithm for retrieving it, and this property gives us I think, the most basic definition of a query planner’s job: Join Ordering: The IKKBZ Algorithm https://justinjaffray.com/join-ordering-the-ikkbz-algorithm/ Thu, 05 Sep 2019 00:00:00 +0000 https://justinjaffray.com/join-ordering-the-ikkbz-algorithm/ This post was originally posted on the Cockroach Labs Blog Even in the 80’s, before Facebook knew everything there was to know about us, we as an industry had vast reams of data we needed to be able to answer questions about. To deal with this, data analysts were starting to flex their JOIN muscles in increasingly creative ways. But back in that day and age, we had neither machine learning nor rooms full of underpaid Excel-proficient interns to save us from problems we didn’t understand; we were on our own. An Introduction to Join Ordering https://justinjaffray.com/an-introduction-to-join-ordering/ Tue, 23 Oct 2018 00:00:00 +0000 https://justinjaffray.com/an-introduction-to-join-ordering/ This post was originally posted on the Cockroach Labs Blog The development of the relational model heralded a big step forward for the world of databases. The introduction of SQL meant that analysts could construct a new report without having to interact with those eggheads in engineering, but more importantly, the existence of complex join queries meant that theoreticians had an interesting new NP-complete problem to fawn over for the next five decades. Why Consensus? https://justinjaffray.com/why-consensus/ Tue, 15 May 2018 00:00:00 +0000 https://justinjaffray.com/why-consensus/ Depending on which computer scientists and greek lawmakers you listen to, Paxos could either be taught to a 3-year old, or requires a team of PhDs “almost a year” to fully grasp. I think the discrepancy here comes from a poorly motivated setting. I couldn’t have provided a coherent explanation of what consensus was a year ago, but in the time since then I’ve slowly pieced together an understanding of it thanks to coworkers who put up with my questions. A Proof of Correctness for CASPaxos https://justinjaffray.com/a-proof-of-correctness-for-caspaxos/ Tue, 10 Apr 2018 00:00:00 +0000 https://justinjaffray.com/a-proof-of-correctness-for-caspaxos/ This post was co-written with Ilia Chtcherbakov. CASPaxos is a recent and interesting consensus algorithm invented by Denis Rystsov. The paper describing CASPaxos provides a proof of correctness in its appendix, provided here is an alternative proof. First, let’s review the protocol. There are several participants. Clients A client submits a transformation function to a proposer. The proposer might reply with a success message consisting of the output of the function applied to the previous value of the register, along with the claim that the value returned was placed into the register. What Does Write Skew Look Like? https://justinjaffray.com/what-does-write-skew-look-like/ Wed, 28 Mar 2018 00:00:00 +0000 https://justinjaffray.com/what-does-write-skew-look-like/ This post is about gaining intuition for Write Skew, and, by extension, Snapshot Isolation. Snapshot Isolation is billed as a transaction isolation level that offers a good mix between performance and correctness, but the precise meaning of “correctness” here is often vague. In this post I want to break down and capture exactly when the thing called “write skew” can happen. A quick primer on transactions The unit of execution in a database is a transaction. Reflections on Stacking Stacks https://justinjaffray.com/reflections-on-stacking-stacks/ Tue, 13 Mar 2018 00:00:00 +0000 https://justinjaffray.com/reflections-on-stacking-stacks/ I’ve been on both ends of several variations of the following conversation: FRIEND: Check out this algorithm. It checks that a string of parentheses, brackets, and braces is balanced properly. It goes like this: You iterate through the string while keeping a stack. When you see an opening paren/bracket/brace you push it on. When you see a closing paren, if it’s the same kind as the one on top of the stack, you pop the stack. Making a PICO-8 TAS https://justinjaffray.com/making-a-pico-8-tas/ Fri, 09 Sep 2016 00:06:15 -0400 https://justinjaffray.com/making-a-pico-8-tas/ Speedrunners turn the process of beating a single-player video game into a competitive activity, trying to finish games as quickly as humanly possible. This has become increasingly mainstream (though still niche) in the past few years with the rise in popularity of events like Games Done Quick. However, there’s another, smaller community interested in beating games quickly but from a different perspective: how quickly can a game be beaten if humans are taken out of the equation?