Practical guide to migrating between data centers in GCP • Louky's Blog

Why the migration?

In this post I’ll describe how I managed our data center migration within GCP. At my $dayjob we had to migrate data centers due to our clients requiring customer’s data to be stored and processed in a certain region. Since we were in the default GCP region (us-central1) we had to migrate everything to the new data center.

Our setup

We keep the stack rather simple and GCP-centric - we have bunch of Cloud Functions (CF) (well actually Cloud Functions for Firebase), Cloud SQL (Postgres), we utilize Firebase Hosting for serving our frontend and we use rewrites to reference our backend (Cloud Function). We also use Cloud Tasks, Pub/Sub, Cloud Storage (GCS). As you can see it’s not anything fancy and it works well for us and our size.

We are currently not using any Infrastructure as Code (IaC) solution (e.g. Terraform) - it was never set up and is not really needed, in my opinion. As a small team of 3 developers without experience in maintaining Terraform deployments, we’ve managed without it so far. I think it’d make sense to introduce some IaC solution when we grow the dev team. Cloud Functions are defined in code with the runWith Firebase syntax and we deploy using the firebase CLI - firebase deploy command.

Now let’s look how we migrated individual services.

Google Cloud Storage

Since GCS bucket names are globally unique, you can just “swap” the regions by deleting the bucket and re-creating it in the desired region.

Create a temporary bucket
Pour all files into the temp bucket (I used Transfer Service for that)
Delete original bucket
Recreate the bucket in the new region with the same name
Transfer files back from the temp bucket

we did it in off-hours so no new uploads would be lost (though if they’d get lost the user would be prompted to upload again anyway)
potential issue would be if someone would snatch the bucket name after you delete and before you recreate. But it was within a ~minute
I wanted to do it through CLI commands using gcloud, but getting Transfer Service to work was a PITA. The UI auto-adds some IAM policies that you need to reverse engineer and it was just easier to click it in the UI

Webhook migration

Before we started using the Firebase Hosting rewrites, we had all webhooks setup to go to the URL of the Cloud Function. So e.g. something like my-gcp-project.us-central1.cloudfunctions.net/api/webhooks/stripe. After you migrate the Cloud Function region there’ll be a new url - my-gcp-project.your-new-region.cloudfunctions.net/api/webhooks/stripe. I needed to migrate all services to go through the firebase hosting rewrite which means the underlying URL can change, you just need to change the rewrite.

Cloud SQL & Cloud Functions

Migrating DB was crucial for us (and probably is for any business). We knew we’d have some down time and had to make sure we won’t lose any data during the down time.

Create a DB replica in the new region
Stop all background processes (for us it was Cloud Scheduler (cron jobs) & Cloud Tasks)
Delete all Cloud Functions - this is a bit ridiculous step and it means there’ll be some down time until you redeploy them

after step 3. there should not be any new writes to the DB

Wait until the Replication Lag is at 0 (see Cloud Insights)
Promote the replica
Update Cloud Functions to point to the new Database - e.g. updating the .runtimeconfig.json
Update the rewrites section in .firebase.json - when you change the region of the underlying Cloud Function, it’s needed to reference the new region. TODO: Documentation

  "rewrites": [
   {
    "source": "/api/**",
    "function": {
     "functionId": "api",
     "region": "your-new-region"
    }
   },
  ]

Deploy Cloud Functions - you probably just need the functions talk to your frontend first, but we deployed all at once.

we verified we didn’t lose any new writes using our audit log table - we track writes to all tables and almost all columns (I absolutely love this feature)
we then deployed Cloud Functions to the new region (adding region to the runWith setup as runWith({}).region('your-new-region'))
- it’s now needed to properly reference the CloudTasks, because it assumes they are in the same region as the project - as locations/${region}/functions/${queueName}
our total downtime (user app not working, API not working) was about 10 minutes in total

Gotchas

After promoting the replica, make sure to analyze all your tables. Replicas do not keep statistics so the query planner might be confused and choose suboptimal execution plans.
- we found about this in Cloud Insights where we didn’t see correct amount of rows in the new DB (we saw 1 milion rows instead of 60 milion rows)
When changing regions, Cloud Task queues need to be referenced using locations/${region}/functions/${queueName} instead of just queueName.
- we now have the below helpers to reference the new region & task queue

export const getCloudFunctionRegion = () => "your-new-region";
export const getTaskQueueName = (queueName: string) =>
 `locations/${getCloudFunctionRegion()}/functions/${queueName}`;
// then in code
getFunctions().taskQueue(getTaskQueueName("my-queue")).enqueue({ userId });

Lessons Learned and Some Smol Advice

Have a demo environment where you can test this out

I first migrated our demo env where I learned about some of the gotchas
Migrating demo first also allowed me to prepare a detailed step-by-step plan for the production migration

Migrate incrementally if possible. I first started with the GCS migration because that is not dependent on the other services
If possible, migrate in off-hours - you’ll have some down-time when switching the DB (well and not having any API live)

you could also deploy custom “We are under maintenance” page using firebase hosting, but we didn’t need that

(Nice to have) - do not do the production migration alone! We had a meeting with full engineering team (sounds like a lot but it’s 4 people). I was doing the migration and others double checked me and checked the system / cloud services. More eyes see more.
Have a way how to check that you did not lose any data between turning off the previous master and the promotion of the replica. As I said above, we have an audit log table.
While we don’t utilize IaC I think it would have been useful at the very least with enumarating all the services we utilize

All in all, this was a cool challenge that you won’t get to do often. I hope this guide is at least a bit useful for somebody - if so, do not hesitate to let me know! I’d love to hear your feedback.

Thanks for coming to my TED Talk 🐱