Backup and Replication GuideΒΆ
The general flow ist that you do a Managed Export of your datastore entities to Google Cloud Storage. Than load that data into Google BigQuery via a load job and do all further exporting and analysis from there.
This replaces gaetk_replication <https://github.com/hudora/gaetk_replication> which was able export to MySQL and JSON on S3 directly although unreliably.
Following Parameters in gaetk2_config.py
(see gaetk2.config
) define the behaviour of managed export and loding into BigQuery.
GAETK2_BACKUP_BUCKET
defines where in Cloud Storage the backup should be saved. Defaults to google.appengine.api.app_identity.get_default_gcs_bucket_name()
.
GAETK2_BACKUP_QUEUE
defines the TaskQueue to use for backup. Defaults to default
.
GAETK2_BIGQUERY_PROJECT
is the BigQuery Project to load data in to. If not set, no data loading will happen.
GAETK2_BIGQUERY_DATASET
is the dataset to use for the load job. If not set, google.appengine.api.app_identity.get_default_gcs_bucket_name()
is used.
To use the functionality, you have to add two handlers to cron.yaml
:
cron:
- description: Scheduled Backup and Source for BigQuery
url: /gaetk2/backup/
schedule: every day 03:01
timezone: Europe/Berlin
- description: Backup loading into BigQuery
url: /gaetk2/load_into_bigquery
schedule: every day 05:01
timezone: Europe/Berlin
See gaetk2.views.backup.BackupHandler
and gaetk2.views.load_into_bigquery.BqReplication
for the actual implementation.