It feels like every conference talk, blog post, podcast or marketing message that I get these days is touting the amazing benefits of Data Science, Machine Learning (ML) or Artificial Intelligence (AI) (or if you are old school, you probably recall the marketing and promotion of many similar techniques as “Statistical Modeling”).
SQL drive AI and ML

And you know what… all of these publications and forums are right… there’s gold in ‘dem hills!  Huge gains in efficiency or capability can accrue to businesses and users to implement this type of intelligence in their digital products, services and applications.

So training yourself or your team to take advantage of this opportunity is key!  But as I’ve thought about all of the data science or ML courses I’ve taken in college, online or in certification programs, I’ve noticed a gap.  The examples and lessons that focus on the modeling techniques typically offer sample data sets that contain nicely aggregated and transformed “features” for those models.  I’ll see “number of purchases” in those data sets rather than a granular file of messy and hard to interpret point-of-sales (POS) transactions. These “shortcuts” are great in a classroom for helping students of these new disciplines focus on the math and modeling techniques.   But when they find data in the real world… in their day jobs, they almost always need to turn to SQL to get access to data and to help refine it for use in modeling (or in more traditional uses such as BI dashboards or Excel reports).

And here is where an unappreciated risk (and its corresponding opportunity) currently waits… one that’s often overlooked, but a critically important part of the data science process… the retrieval and prep of data using SQL.

The reality is that the SQL required to access and transform the data needed for advanced analytics has probably already been written.  Every day your BI (reporting) developers, ad-hoc data analysts, business analysts or your data scientists write the kind of SQL needed to transform raw POS records (or medical insurance claims, retail banking transactions, stock trades, etc.) into “features” or variables with real predictive power.  And even better, these analysts live down in “the weeds” with these data sets whereas your data scientists do not.  So when your typical SQL-proficient analyst accesses this data, they do so with the correct business rules embedded in those queries such as WHERE clauses that filter the data set for only valid records and CASE statements that transform obscure system codes into meaningful labels. So while there may be “gold” in the advanced analytic outputs of your PhD scientists, there is also a lot of unmined “gold” already present in the SQL other teams use that become inputs to these models.

So ideally, a strategy that enables mining your current SQL into the fuel needed to power your AI, ML and advanced analytics should do a few things:

  1. Capture the SQL that is already powering your company with descriptive (backwards-looking) analytics whether they come from BI or ad-hoc analysts.
  2. Allow for the “best in class” SQL to be managed and promoted, making it visible not just to your data science users, but also to your least-knowledgeable analysts… giving them visibility to the correct data and business rules that drive your business
  3. Enable your end users to do this naturally as part of their workflow and not as a documentation or “top-down” governance exercise.

Fortunately, Aginity has been providing SQL authoring capability for the leading analytics data platforms (Netezza, Hive, Redshift, etc.) for over 10 years and the next generation offerings of this capability recognize these needs. We think of these innovations as helping to shift a traditional “Data & Tools” Mindset towards an “Analytics Outcomes” Mindset—prompting you to govern and manage your analytics as assets. 

It’s not hard to shift towards behaviors that allow you to capture the “gold” in your existing SQL efforts.  Your SQL analysts can immediately download Aginity Pro for free and start getting in the habit of cataloging and reusing their SQL.  This will result in consistent and more efficient results at the individual level.

But we all know a team beats individual efforts.  And by making the individual SQL catalogs visible, searchable and secure in a team or collaborative environment, you are bringing that hidden value into plain sight, like gold nuggets in a stream-bed.

So when you’re ready to take it to the next level, you can bring individual efforts into Aginity Premium.  Your team can maintain one definition of calculations and leverage each other’s skills to be more productive and consistent.

By putting these practices in to place you set the foundation to truly be able to operationalize artificial intelligence and machine learning by eliminating the redundant and often error-prone SQL implemented as part of the data science process and instead making your already existing, best-in-class SQL assets that starting point for your next-level analytics:

  1. Your data scientists will have a rich set of governed and consistently calculated features for their modeling efforts without any additional effort
  2. Your AI and ML models will be aligned with other analytics in your enterprise, making their predictions more accurate and increasing the trust your analysts have in their findings.
  3. Increasing the speed of development and minimizing re-work in the testing/hardening phase of data science deployment by using already-vetted inputs into the models.

If you’re ready to start capturing this untapped resource with us, schedule a product demo with our team today.

Share on linkedin
Share on email
Share on twitter
Share on facebook