Why Github Fails for Analytics Teams

As strong advocates for analytic code sharing and reuse, we’re often asked why we don’t integrate with GitHub or other Git repositories. The simple answer is that Git isn’t often a great fit for sharing SQL code. This post explains why, and suggests a better alternative.

It’s tedious to rewrite the same SQL over again. For analytic projects, however, the enemy isn’t just the inefficiency of writing the same code over and over again, it’s the inconsistency.  Rewrites rarely yield the same SQL with reproducible results. Your time is cheap compared to the crippling business cost of out-of-sync, orphan analytics.Discrepancies in results erode data and analytic confidence and will delay — or worse, misguide — key business decisions.

You likely have your own system for finding and reusing your own code snippets and other SQL assets.  Maybe it’s as basic as searching your SQL history to copy & paste code into your current project.  Or perhaps you have a running text document where you paste in useful code snippets for future reuse.

Eclectic, informal code catalogs may be fine when you’re working alone, but are wholly deficient for enterprise analytics which demand collaborative, scalable, consistent, reproducible results.  You need to implement a process to manage your analytics development lifecycle. This is the only way to successfully scale your analytics.

So what’s the solution to share, reuse, and ultimately scale SQL across teams, departments, and the enterprise? 

For engineers with a programming background, Git often comes to mind.  Though programming IDEs commonly integrate with popular cloud-based Git repositories such as GitHub and Bitbucket, there are three primary challenges driving the reason we believe Git repositories are not the best solution for analytic organizations to manage and share code.

Why Git Fails to Deliver for Data and Analytic Teams

Lack of contextual discoverability

Before SQL code can be reused, it must be found. A plain text search through files in a Git repository lacks necessary context such as how and where the code should be used. In analytics, this context is critical. Git provides no native mechanism to showcase robust insights such as on which data platform was a query last used, who ran it last, how many times it has been used, or if it has any downstream dependencies.

Lack of granular permission controls

Git permissions are set at the repository level. For proper data security, your shared SQL catalog permissions should align with your internal or established data governance policies. Given your analytic code could expose your data model, like where PII could be found, Git’s broad access control structure is unacceptable.

Excess technical complexity

Git incorporates technical capabilities such as branching, merging, and forking which aren’t typically necessary for analytic organizations, introduces needless complexity, and increases management overhead. To maximize the likelihood of SQL code sharing and reproducibility, you should use a system where both engineers and analysts of different technical fluency can hit the ground running with a minimal learning curve.

So how can analytics teams share & reuse their SQL in a way which encourages collaboration and consistency?

Envisioning The Ideal Shared SQL Repository

The right solution should incorporate a high degree of context around entities, robust governance, and easy versioning, all integrated in the place where analysts work.

IDE Integration

Avoid the pivot problem of switching tools and contexts. If you want to encourage sharing & reuse, analysts shouldn’t be expected to launch a separate software application.  Ideally, you’ll want the ability to save new items to a shared code catalog without interrupting your normal SQL workflow.  The same goes for searching through a catalog of reusable code.

Governance

You should be able to grant fine-grained permissions for the discoverability and editing of catalog entities so your analytic stewards can securely promote the most useful, efficient assets to your wider analyst community.

Variable/Parameter Functionality

Even the most reusable SQL statements often warrant filter modifications before execution. An ideal solution should incorporate IDE-processed syntax for variable parameters which can be edited at runtime.

Simple Versioning Capabilities

The one place we agree with Git is that code should be versioned. It’s important to know who changed what, when, and you should have the choice whether to update to the latest version or use a previous version of the code, depending on your use case. Doing so should be easy though, with minimal technical complexities.

Bonus: Inherited Consistency

The well-known problem with lots of traditional code sharing solutions is that you end up with dozens of isolated copies of the same code.  What happens when an important calculation needs to change?   A shared catalog substantially misses the mark if it’s merely a pretty UI for copying and pasting code snippets.  The right solution should allow SQL code to be referenced as an object which inherits any changes made to the master catalog entry.

A shared repository with the above capabilities ensures consistency and empowers collaboration.  Add the ability to schedule such objects, and your shared catalog becomes a central hub within your analytics pipeline.

We couldn’t find any option with such capabilities, so we built all of this into Coginiti Team!  Coginiti’s collaborative analytics catalog integrated directly into our highly capable SQL query and analysis workspace with deep database object support so data engineers and analysts can easily discover, reuse, and schedule curated analytic assets. Get a taste of this power by downloading your free trial of Coginiti Pro, our highly functional SQL Management software for individual analysts, and a steppingstone to Coginiti Premium, our full collaborative analytics platform.

Why Git Fails for Analytics Teams

Why Git Fails to Deliver for Data and Analytic Teams

Lack of contextual discoverability

Lack of granular permission controls

Excess technical complexity

Envisioning The Ideal Shared SQL Repository

IDE Integration

Governance

Variable/Parameter Functionality

Simple Versioning Capabilities

Bonus: Inherited Consistency

Footer1

Footer2

Footer3

Footer4