We believe that at the heart of every data or analytics team are the data assets that team owns and maintains. The dashboards, reports, tables, queries -- they are everything. And a good team knows that in order to reduce backlog and keep their business stakeholders (and themselves) happy, they should increase the ROI on their current assets instead of creating new ones.
This is why we’re building Rupert—to organize all your data assets in the most engaging and efficient way, with minimum work from your side and maximum utilization by you, your colleagues, and business stakeholders.
In order to do so, we are investing a lot of time working on ways to automatically find, extract, parse, cluster, and document your data assets. We run experiments on different tools and methods, and are eager for feedback that will help us build the first data operation knowledge hub designed with data analysts in mind. Today we would like to share one such experiment with you.
Just like with code, no one likes using someone else’s SQL queries. If not properly documented, queries can become extremely complicated, so most of the time people rewrite them and throw their old work away. Moreover, each and every client we talked to had automated queries running in their systems, in addition to poorly maintained relics of former analysts and employees still being used in different pipelines and products. You can see where we’re going with this.
Describing a SQL query in a human readable way
Today we wanted to give you a glimpse of our work to add a human, understandable description to all these “uncharted” SQL queries. Rupert's SQL-to-Text feature is based on the well-known Transformer architecture, a neural network consisting of an encoder and a decoder with multiple attention layers. As with state-of-the-art sequence-to-sequence models (machine translation, summarization, etc.), we generate text in an auto-regressive manner (meaning that the prediction of each token is dependent on the previously predicted tokens) and use beam-search to output the text on inference time. We use the pre-trained T5 model as the basis of both the encoder and the decoder, but since it was generally not trained on SQL queries, we further pre-train it on a large magnitude (hundreds of thousands) of SQL queries in a self-supervised manner. This pre-training improves the ability of our encoder to represent SQL queries with all the subtle variations that arise from different dialects, analysts, business terms, and schema representations (tables and columns).
To train our model on the SQL-to-Text task, we perform an additional fine-tuning step using our internal data and Spider, a well-known Text-to-SQL dataset, which we pre-process specifically to support the task of SQL-to-Text. Finally, to increase the amount of training data even further, we automatically synthesize queries and description pairs.
To select the best model, we evaluate it on a test set of unseen SQL queries (and schemas) using a BLEU score metric computed with the generated predictions and gold label descriptions. Our current model gets a BLEU score of 28.49, which is a significant improvement of over 5 points when compared to a vanilla T5 model without any of our improvements.
Enter our Sandbox
We’ve built a small tool to demonstrate this capability, and would love to have you give it a shot. Share your feedback and tell us how you would use it in order to reduce your backlog! Reach out at coffee@hirupert.com to get access.
Written by: Ben Bogin, Moshe Hazoom & Yoni Steinmetz