Google BigQuery Optimization Strategies -

- August 15, 2015

i querying data google analytics premium using google bigquery. @ moment, have 1 single query use calculate metrics (like total visits or conversion rate). query contains several nested join clauses , nested selects. while querying 1 table getting error:

error: resources exceeded during query execution.

using group each , join each not seem solve issue.

one solution adopted in future involves extracting relevant data needed query , exporting separate table (which queried). strategy works in principle, have working prototype it.

however, explore additional optimization strategies query work on original table.

in presentation you might paying bigquery of them suggested, namely:

narrowing scan (already doing it)
using query cache (does not apply)

the book "google bigquery analytics" mentions adjusting query features, namely:

group clauses generating large number of distinct groups (already did this)
aggregation functions requiring memory proportional number of input values (probably not apply)
join operations generating greater number of outputs inputs (does not seem apply)

another alternative splitting query composing sub-queries, @ moment cannot opt strategy.

what else can optimize query?

why bigquery have errors?

bigquery shared , distributed resource , such expected jobs fail @ point in time. why solution retry job exponential backoff. golden rule, jobs should retried minimum of 5 times , long job not unable complete more 15 minutes service within sla [1].

what can causes?

i can think off 2 causes can affecting queries:

data skewing [2]
unoptimized queries

data skewing

regarding first situation, happens when data not evenly distributed. because inner mechanic of bigquery uses version of mapreduce means if have example music or video file millions of hits, workers doing data aggregation have resources exhausted while other workers won’t doing @ because aggregations videos or musics processing have little no hits.

if case, recommendation uniformly distribute data.

unoptimized queries

if don’t have access modifying data, solution optimize queries. optimized queries follow these general rules:

when using select, make sure select strictly columns need diminishes cardinality of requests (avoid using select * example)
avoid using order clauses on large sets of data
avoid using group clauses create barrier parallelism
avoid using joins these extremely heavy on worker's memory, , may cause resource starvation , resource errors (as in not enough memory).
avoid using analytical functions [3]
if possible, queries on partitioned tables [4].

following of these strategies should queries have less errors , improve overall running time.

additional

you can't understand bigquery unless understand mapreduce first. reason recommend have on hadoop tutorials, 1 in tutorialspoint:

https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

for similar version of bigquery, open source (and less optimized in every single way) can check apache hive [4]. if understand why apache hive fails, understand why bigquery fails.

[1] https://cloud.google.com/bigquery/sla

[2] https://www.mathsisfun.com/data/skewness.html

[3] https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#analytic-functions

[4] https://cloud.google.com/bigquery/docs/partitioned-tables

[5] https://en.wikipedia.org/wiki/apache_hive