Rollback for microservices with Ansible and Jenkins

Imagine your project consists of 4 microservices (3 backends, 1 frontend). Yesterday you introduced several new features and made a release. Unfortunately, your users have just reported a bug. Some of the old important features are not working. You need to do a rollback for all services. Ah, if it could be done with one button.

Actually it can be. In this article I’ll show you how.

Tech stack:

Jenkins for rollback automation
Ansible + Python for rollback script
Docker registry for storing release images
DC/OS for running apps

Overview

We will have a python script, called via Ansible from Jenkins, as described in this article. The only difference is – we should have two different tags to run. The first one gathers all available images, the second runs the rollback.

The get algorithm:

Request all images from the docker registry. Filter them by environment, sort by date and take 10 last one for every repository.
Form json with repositories, images and dates and write to file system

The run algorithm:

Read json from get second step and create a Jenkins input
Take all available images for the selected date and do a rollback

The rollback itself:

Modify the docker image section in marathon json config
Start a deploy with modified config

Special case

Imagine a service, which doesn’t change in this release. It means there won’t be any rollback image available for it. But you still need to roll it back, because of the compatibility issues. Please find the example of the situation on the picture below.

If you select Today-1 only Repo1 and Repo3 will be rolled back, as there are no images for Repo2. Perhaps it wasn’t changed.

The problem here is that Repo1 or Repo3 N-1 versions could be incompatible with Repo2 latest version. So you need to find the next version of Repo2 before the rollback date. It is Today-2 version.

Get rollbacks

We will have two actions for a rollback:

We gather all rollback dates and images available for the current environment.
User selects the data and we perform a rollback.

Ansible side

Ansible changes are minor. Just add two tags for common steps (like requirements installation):

 - name: "Copy requirements.txt"
   copy: 
    src: "requirements.txt" 
    dest: "/tmp/{{ role_name }}/" 
  tags:
     - get
     - run

Don’t forget to add tag to the always step, or your clean-up will be ignored. Using run tag only is preferred.

It would be useful to register rollbacks in get output and debug them. In this case you can use Ansible even without Jenkins.

- name: "Get rollbacks"
   shell: "source activate /tmp/{{ role_name }}/{{ conda_env }} ; {{ item }}"
   with_items: 
    - pip install -r /tmp/{{ role_name }}/requirements.txt
     - "python /tmp/{{ role_name }}/rollback.py get 
      --repo={{ repo }}
       --dump={{ dump_path }}
       --env={{ env }}" 
  args: 
    executable: /bin/bash
   tags:
     - get
   register: rollbacks  

- debug: 
    var: rollbacks.results[1].stdout 
  tags: 
    - get

Python side

With docopt you can use a single entry point with two options, one for `get` and one for run.

Usage: 
  rollback.py get --repo=<r> --env=<e> [--dump=<dr>] 
  rollback.py run --date=<d> --env=<e> --slack=<s> --user=<u> --pwd=<p> [--dump=<dr>]

The fork itself:

if arguments['get']:
     return get(repo, env, dump)
 if arguments['run']:
     return run(date, env, slack, user, pwd, dump)

To get rollbacks you need to call you Docker registry’s API first.
I assume that you use this image naming schema:
<private-docker-registry-host:port>/service-name:build-number-branch

You need to get all tags for current repo, filter them by environment, then sort by date and return last 10.

def get_rollbacks(repo: str, env: str):
     r = requests.get(f'{DOCKER_REGISTRY}/v2/{repo}/tags/list', verify=False) 
    if r.status_code != 200: 
        raise Exception(f"Failed to fetch tags {r.status_code}") 
    releases = list(filter(lambda x: x.endswith(env), r.json()['tags']))
     all_rollbacks = [(get_manifest(repo, r), {r: repo}) for r in releases[-10:]]
     return dict(all_rollbacks)

Where repo is your `service-name` and env is the current branch.

Sorting by date is a bit complex. Date is not included in tags information. The only way to get it is to fetch the mainfest and to check history.

def get_manifest(repo, tag):
     r = requests.get(f'{DOCKER_REGISTRY}/v2/{repo}/manifests/{tag}', verify=False)
     if r.status_code != 200: 
        raise Exception(f"Failed to fetch manifest {r.raw}")     history = r.json()['history'] 
    sort = sorted([json.loads(h['v1Compatibility'])['created'] for h in history]) 
    return sort[-1][:10]

The full get function:

def get(repo: str, env: str, dump: str):
     rollbacks = {}
     repos = repo.split(',') 
    for r in repos:
         for date, rb in get_rollbacks(r, env).items(): 
            if date in rollbacks: 
                rollbacks[date] += [rb]
             else: 
                rollbacks[date] = [rb] 
    print(rollbacks)
     if dump is not None:
         with open(path.join(dump, "rollback.json"), mode='w') as rb:             json.dump({'all': repos, 'rollbacks': rollbacks}, rb) 
    return rollbacks.keys()

Where repo is a comma separated list of your service-names. F.e. repo1,repo2,repo3. You also need to print rollbacks for Ansible debug.

Jenkins side

Let’s start Jenkins pipeline with environment input.

parameters {
   choice(choices: 'dev\nstage\nprod', description: 'Which environment should I rollback?', name: 'environment') 
}

if you use master environment instead of prod you don’t need to do anything. Otherwise you need to create static variable rollback_env outside of the pipeline and fill it during the first step.

script { 
    // need this as env names don't match each other. develop/master/stage in docker vs dev/stage/prod in marathon 
    if (params.environment == 'prod') { 
        rollback_env = "master" 
    } else if(params.environment == 'stage') {
         rollback_env = "stage" 
    } else {
         rollback_env = "develop"
     } 
}

Then just download your git repo with ansible playbook and run it.

git branch: 'master',
     credentialsId: <your git user credentials id>',
     url: "<your ansible repo>"
 ansiblePlaybook( 
        playbook: "${env.PLAYBOOK_ROOT}/rollback_service.yaml",          
        inventory: "inventories/dev/hosts.ini", 
        credentialsId: <your git user credentials id>', 
        extras: '-e "repo=' + "${env.REPOS}" + ' env=' + "${docker_env}" + ' slack=' + "${env.SLACK_CALLBACK}" + ' dump_path=' + "/tmp" + '" -v',
         tags: "get")

Please pay attention to the dump_path. It tells python script to create json directly in the /tmp, so that we can read it from Jenkins. Lets do it.

import groovy.json.JsonSlurper  def gather_rollback_dates() {
     def inputFile = readFile("/tmp/rollback.json")
     def InputJSON = new JsonSlurper().parseText(inputFile) 
    return InputJSON['rollbacks'].keySet().join("\n") 
}

This function will find your rollback, get all dates and form a string with \n separator. It is required to generate an input with dropdown.

stage('Select rollback date') { 
 steps { 
    script {
           def userInput = false 
          try { 
            timeout(time: 120, unit: 'SECONDS') { 
                userInput = input(id: 'userInput',
                                   message: 'Select a date to rollback',                                   parameters: [ 
                                    choice(name: 'rollback_date',                                            
                                           choices: gather_rollback_dates(),
                                            description: 'One or more services have rollback at this date')]) 
            } 
          } catch(err) {  
          }
           if (userInput) { 
            print('Performing rollback') 
            env.DATE = userInput
           } else {
            print('Skip rollback')
          } 
        } 
    } 
}

It looks like this:

Perform a rollback

We have 5 actions for a rollback:

Read json from previous step
Find missing images for the selected date
Get marathon service ids from docker ids
Change marathon app’s config
Update app in marathon

Ansible side

Nothing special here. Just call python.

- name: "Perform rollbacks" 
  shell: "source activate /tmp/{{ role_name }}/{{ conda_env }} ; {{ item }}"
   with_items: 
    - pip install -r /tmp/{{ role_name }}/requirements.txt
     - "python /tmp/{{ role_name }}/rollback.py run 
      --date={{ date }}
       --env={{ env }} 
      --slack={{ slack }} 
      --user={{ dcos_user }} 
      --dump={{ dump_path }} 
      --pwd={{ dcos_password }}" 
  tags: 
    - run

Python side

Let’s start with run method

Read json and select all available images for a selected date.

def run(date, env, slack, user, pwd, dump):
     json_data = read_rollbacks(dump)
     all_rollbacks = OrderedDict(sorted(json_data['rollbacks'].items(), key=lambda x: x[0]))
     repos = json_data['all'] 
    images = all_rollbacks[date]

If images for some repos are missing – we need to find their older versions. Add this to your run method:

if len(repos) > 1 and len(repos) > len(images):
     get_missing_images(date, repos, all_rollbacks)

Where get_missing_images just goes through all_rollbacks and selects image with nearest date for each missing image.

def get_missing_images(date, repos, all_rollbacks): 
    images = all_rollbacks[date]  # select available images
     found_services = [list(rb.values())[0] for rb in images]  # get services from images
     missing = list(set(repos) - set(found_services))  # substract to get missing 
    for service in missing:  # populate images with rollback for every missing
         rollback = get_nearest_date(service, date, all_rollbacks)
         if rollback is None: 
            print(f"Previous rollback for {service} not found") 
        else:
             images += [rollback]   

def get_nearest_date(repo, date, all_rollbacks):
     for d, images in reversed(all_rollbacks.items()): 
        if d < date: 
            for rollback, image in images[0].items(): 
                if image == repo: 
                    return {rollback: image} 
    return None

After we have our images populated we need to get marathon service ids. Our marathon ids uses standard /<department>/<environment>/<project>/<service-name>. At this step we have only service-name, so we should create a binding to Maration id.

We can do it by listing all applications running in Maration and filtering them by the environment and service name (I haven’t found better solution).

def get_service_ids(env: str, images: list, user: str, pwd: str) -> dict:
     ids_only = get_marathon_ids_for_env(env, user, pwd)  # all running services for env 
    services = {} 
    for rollback in images:
         tag = list(rollback.keys())[0]
         id_part = rollback[tag] 
        real_id = list(filter(lambda x: x.endswith(id_part), ids_only))  # filter by service-name
         if not real_id:
             raise Exception(f"Id {id_part} not found") 
        services[real_id[0]] = tag 
    return services

   def get_marathon_ids_for_env(env: str, user: str, pwd: str): 
    res = call_with_output(f'dcos auth login --username={user} --password={pwd}') 
    if res.decode().strip() != 'Login successful!':
         raise Exception("Can't login to dcos cli")
     all_services = call_with_output('dcos marathon app list') 
    matched = list(filter(lambda x: x.startswith(f"/ds/{env}"),
                           all_services.decode().split('\n'))) 
    return [m.split(' ')[0] for m in matched]

After we have service ids we can iterate through them and do a rollback for each. Add this to your run method:

services = get_service_ids(env, images, user, pwd) 
for service_id, service_tag in services.items(): 
    if slack is not None:
         notify_slack(slack, f"Rollback { service_id }: { service_tag }") 
    print(do_deploy(service_id, service_tag))

Well, that’s all. Don’t forget to add slack notifications for the rollback.

Jenkins side

Python part was the most complex. On Jenkins side you just need to call Ansible with run tag and selected date.

stage('Rollback') { 
 when { 
    expression {
          return env.DATE != null
     }
  }  
 steps {
     ansiblePlaybook( 
            playbook: "${env.PLAYBOOK_ROOT}/rollback_service.yaml",             
            inventory: "inventories/dev/hosts.ini",
             credentialsId: <your git user credentials id>',
             extras: '-e "date=' + "${env.DATE}" + ' env=' + "${params.environment}" + ' slack=' + "${env.SLACK_CALLBACK}" + ' dump_path=' + "/tmp" + '" -v',
             tags: "run")
  } 
}

Summing up

Current solution is quite complex, but it allows you to run rollbacks both from Ansible via cli and from Jenkins. The second one is preferred, as you can see the user who approved the rollback.

Have a nice firefighting and hope you’ll never have a need in rollbacks!