Backup and Replication GuideΒΆ
The general flow ist that you do a Managed Export of your datastore entities to Google Cloud Storage. Than load that data into Google BigQuery via a load job and do all further exporting and analysis from there.
This replaces gaetk_replication <https://github.com/hudora/gaetk_replication> which was able export to MySQL and JSON on S3 directly although unreliably.
Following Parameters in gaetk2_config.py (see gaetk2.config) define the behaviour of managed export and loding into BigQuery.
GAETK2_BACKUP_BUCKET defines where in Cloud Storage the backup should be saved. Defaults to google.appengine.api.app_identity.get_default_gcs_bucket_name().
GAETK2_BACKUP_QUEUE defines the TaskQueue to use for backup. Defaults to default.
GAETK2_BIGQUERY_PROJECT is the BigQuery Project to load data in to. If not set, no data loading will happen.
GAETK2_BIGQUERY_DATASET is the dataset to use for the load job. If not set, google.appengine.api.app_identity.get_default_gcs_bucket_name() is used.
To use the functionality, you have to add two handlers to cron.yaml:
cron:
- description: Scheduled Backup and Source for BigQuery
url: /gaetk2/backup/
schedule: every day 03:01
timezone: Europe/Berlin
- description: Backup loading into BigQuery
url: /gaetk2/load_into_bigquery
schedule: every day 05:01
timezone: Europe/Berlin
See gaetk2.views.backup.BackupHandler and gaetk2.views.load_into_bigquery.BqReplication for the actual implementation.