Backup and Replication Guide¶

The general flow ist that you do a Managed Export of your datastore entities to Google Cloud Storage. Than load that data into Google BigQuery via a load job and do all further exporting and analysis from there.

This replaces gaetk_replication <https://github.com/hudora/gaetk_replication> which was able export to MySQL and JSON on S3 directly although unreliably.

Following Parameters in gaetk2_config.py (see gaetk2.config) define the behaviour of managed export and loding into BigQuery.

GAETK2_BACKUP_BUCKET defines where in Cloud Storage the backup should be saved. Defaults to google.appengine.api.app_identity.get_default_gcs_bucket_name().

GAETK2_BACKUP_QUEUE defines the TaskQueue to use for backup. Defaults to default.

GAETK2_BIGQUERY_PROJECT is the BigQuery Project to load data in to. If not set, no data loading will happen.

GAETK2_BIGQUERY_DATASET is the dataset to use for the load job. If not set, google.appengine.api.app_identity.get_default_gcs_bucket_name() is used.

To use the functionality, you have to add two handlers to cron.yaml:

cron:
- description: Scheduled Backup and Source for BigQuery
  url: /gaetk2/backup/
  schedule:  every day 03:01
  timezone: Europe/Berlin
- description: Backup loading into BigQuery
  url: /gaetk2/load_into_bigquery
  schedule:  every day 05:01
  timezone: Europe/Berlin

See gaetk2.views.backup.BackupHandler and gaetk2.views.load_into_bigquery.BqReplication for the actual implementation.