Introducing BigQuery ML for building predictive models with SQL | Computing
One key to efficient data analysis of big data is to do the computations where the data lives. In some cases, that means running R, Python, Java, or Scala programs in a database such as SQL Server or in a big data environment such as Spark. But that takes some fairly technical programming and data science skills not often found among business analysts or SQL programmers. In addition, if you have to extract, transform, and load your datasets from your data warehouse to another data store for machine learning, you introduce delays in the process.
To that end, Google has announced a beta release of BigQuery ML, a SQL-based extension to its enterprise data warehouse service, for building and deploying machine learning models. This is important because it lowers the barriers to training certain kinds of models and deploying predictive analytic services, both reducing the time required and bringing it within reach of data analysts and statisticians.
One SQL-style statement and six functions have been added to BigQuery’s SQL dialect to support machine learning: the CREATE MODEL
statement and the functions ML.EVALUATE
, ML.ROC_CURVE
, ML.PREDICT
, ML.TRAINING_INFO
, ML.FEATURE_INFO
, and ML.WEIGHTS
. Google has produced animated GIFs that show how to create a model with the CREATE MODEL
statement and generate predictions from the model using the ML.PREDICT
function in a SELECT
query.
You might wonder how all the formidable complexity of machine learning can be encapsulated in a SQL statement. The short answer is that it can’t, at least not now. BigQuery ML is nota general-purpose neural network modeling tool on the order of TensorFlow.
There are only two supported types of models: linear regression for forecasting (such as “What will next month’s sales be?”) and binary logistic regression for classification (such as “Will the credit applicant default?”). BigQuery ML can’t even handle multiclass logistic regression, although there’s a hint in the data structures used that multiclass logistic classification might be supported in a future release.