docker-local-dev, a developer friendly solution for local development in a ​​microservice world

About Vida

Vida Health is a virtual care company intentionally designed to treat mental and physical conditions together. Vida’s clinically validated approach combines an AI-powered, personalized mobile app experience with a national network of high-quality providers who work in a high-touch, care team model that maximizes engagement, outcomes and savings. Vida’s app offers video sessions, asynchronous messaging, and digital content and programs to help people prevent, manage, and reverse chronic conditions — like diabetes and hypertension — and the mental conditions that accompany them — like stress, depression, and anxiety. Some of America’s largest employers and health plans trust Vida’s whole health offering.

The problem we're trying to solve

The architecture at Vida has followed an evolutionary path similar to many other fast growing startups.  Beginning with a single backend monolith, we have grown over time, adding infrastructure components like redis or rabbitmq and breaking up the monolith into multiple backend services.  While there have been benefits from this evolution, it has also come with additional complexity for our engineering developers.  Running the “backend” locally during development is no longer as simple as starting up a single Django app.  Instead, you need to deploy a number of flask and Django apps, databases, and infrastructure components.  

Here is a very simplified view of our architecture.  We run this architecture in a pre-production/staging environment as well as our production environment.  All releases are validated in staging before deployment to production.

Simplified architecture
  1. A proxy layer (running ingress-nginx when we run in the cloud)

  2. webserver, our primary backend (the original monolith).  This also uses celery.

  3. a variety of microservices.  The majority of these microservices need to make API calls to webserver or other microservices as part of their own implementation

  4. infrastructure components

    1. redis

    2. rabbitmq

    3. snowplow

    4. datadog

    5. split-synchronizer

    6. (many others not listed)

Our developer productivity had become severely impacted by this complexity.  The mental overhead necessary to understand every system deeply enough that you can tweak its configuration and debug it locally left little room for thinking about the actual problem you were working on.  We needed a development environment that 

  • supports any engineer (frontend or backend) getting a running system locally with minimal setup, ideally on their first day on the job

  • supports running or actively debugging only a subset of the system, to improve performance on developer laptops

  • becomes the primary development environment for backend engineers.  To do this, it must support live debugging and easy execution of code that is under active development.  By being the primary development environment, we can ensure that the toolset we build does not become stale

  • can be used to replace most of the testing currently done in our staging environment.  If we can ensure the development environment is close enough to our staging and production environments, local testing can replace the testing of in development feature branches that we used to conduct in the staging environment

Different challenges for different types of engineers

If we want to build a development environment that supports our entire engineering organization, we need to consider the different needs and challenges that our frontend and backend engineers face.  Not only do they work on different systems, but their familiarity with different command line tools and concepts can vary significantly.

Frontend

We summarize the challenge for frontend engineers with this question: How can I deploy a working backend without having to individually set up each service and infrastructure component?

Each of our backend services lives in a unique codebase.  Deploying it locally requires installing the correct version of python, creating a python virtual environment, and finding the correct startup command in the readme or project documentation.  Even the infrastructure pieces like redis or rabbitmq have varied methods of installation (brew, docker images, bash scripts).  Finding these different instructions in the wiki or documentation can be challenging and time consuming, and very challenging to troubleshoot if something goes wrong.

Our mobile engineers face an additional challenge: how to route traffic from the app on a physical test device to backends running on their laptop.  When running through the simulator or emulator this is straightforward (the emulator can be configured to use "localhost"), but when testing from an external device, they need a way to specify a url that will route traffic to their development environment.  If possible, this url should also be consistent over time (as opposed to using the laptops ip address).

Backend

In addition to the complexity challenge that also faces frontend engineers, an additional problem faces our backend engineers: How can I reduce my reliance on testing in the staging environment?

Our main goal for backend engineering is to improve the stability of our shared staging environment.  Previously, individual developers would create a feature branch in whichever system they were modifying and deploy that branch to staging.  This allowed them to test changes to a single component while using all the other systems already deployed in staging.  For example, if you were testing a change in webserver, you would deploy just your branch, and use the proxy, database, rabbitmq, and other microservices that were already running in staging to do your testing.  While you were testing, however, no one else could deploy any different webserver code, or your changes would be overwritten.  This prevented testing by other engineers of their own feature branches, but also any pre-production deployment verification of the next release.

If we can easily replicate enough of our service architecture and data into our local environment, there will only be a few, rare use cases where developers will need to use staging for testing.  Those cases will mostly be for systems that only exist in staging (kubernetes cluster, DNS, networking, etc.).

At the same time, we need to make sure the new environment continues to support critical developer workflows, such as using their preferred IDE (PyCharm) to deploy services under active development, and use all of the debugging and tracing capabilities of that IDE.

The approach

We decided to base our new development around a single github repository that every engineer could check out onto their laptop, and start up with a few simple commands.  We called this repo "docker-local-dev": the project is based around a docker-compose file that can start all of the services and infrastructure components and link them with each other appropriately.

For access from mobile devices, or for sharing the environment with other developers, we use ngrok to create a tunnel to the local environment that is accessible over the internet.  This tunnel url is unique to the developer, and can be consistently used for their development environment.

For seeding the environment with test data, we have scripts that copy database dumps from our staging environment.

Working with the repository

There are a few one-time steps that need to be taken when using the project for the first time.

We created a script "./setup/scripts/first-time-setup.sh", which takes care of these:

  1. authenticating to google cloud (where our docker images and database dumps are stored)

  2. downloading dumps of the staging databases (creating new dumps if existing dumps are stale or have been cleaned up)

  3. setting the developer's user name in a few configuration files that will be used later

For day to day development, a user modifies the "./environment-config.yaml" to choose which services they want to deploy in their local docker environment, which they will deploy with PyCharm, and which they do not want to deploy at all.  Then, they run the "./environment-update.sh" to apply those changes to all the necessary files, and finally the "./up.sh" to restart the services with the new configuration.

Using jinja templates to support selective deployment

Originally, the project used a plain docker-compose file (instead of a jinja template), and did not have the "environment-config.yaml" to control that compose file.  This meant that if a developer did not want to run a particular service, they needed to modify the docker-compose.yaml file manually, to remove that service.  There are a few downsides to this workflow

  • If a dev has modified the docker-compose locally and we modified it in the repository (adding support for an additional service, updating an image version, etc.), they will have a git conflict when pulling the latest repository.  This can lead to a lot of wasted time manually resolving these conflicts.

  • We did not want to require all engineers to learn the ins and outs of docker-compose files, especially the nuances around how to configure networking for services to communicate with each other when some are not deployed with docker.

  • When deploying some services locally with PyCharm instead of docker-compose, multiple files need to be modified, not just the docker-compose.yaml. By automating with a script, we can ensure that all modifications happen at the same time, so all files are consistent with each other.  See the "Supporting services deployed with PyCharm'' section for more info about what needed to be modified.

To address these downsides, we enhanced the project by turning several files (including docker-compose.yaml) into Jinja templates (see ./setup/jinja-templates).  These are configured using the "environment-config.yaml" file, and are generated by the "environment-update.sh" script.  See the "Overview of the files" section later for more details on the purpose and function of each file.

Supporting services deployed with PyCharm

Running code through docker-compose works great when you already have a docker image of that code.  As part of our deployment pipelines for our services, we build and upload docker images to Google Cloud Platform.  Our frontend engineers typically want to use the same version as is deployed in staging, so they can use these pre-built images.

However, for a backend developer actively developing a service, running through docker can be a significant overhead.  First, getting your code to run inside of docker requires either building a docker image every time you modify the code (slow) or mounting your code into the container.  Second, it is difficult to use the PyCharm debugger on code running inside docker.  While technically possible (see JetBrains docs), in our testing the performance was an issue and connectivity was sometimes spotty.

We wanted to ensure that developers could use PyCharm, the tool they were already comfortable with, with as few changes to their workflow as possible.  To do this, we

  • added PyCharm runtime configurations to each project with the correct configuration to connect to any other services running in docker-local-dev (see example)

  • added the "outside_of_docker" section to the environment-config.yaml, so that a developer can denote they will deploy the service locally with PyCharm instead of docker.

  • enhanced the logic behind the nginx vida-proxy configuration to handle a service correctly whether it is "in_docker" or "outside_of_docker"

How it all works

Network topology

We realized early on that most of the challenge of building this system was the network topology, and correctly routing traffic to a service whether it was deployed with docker or with PyCharm.

We would have liked to assign each service a consistent port to always be used, whether they were deployed through docker or through PyCharm, and configure them with "localhost:DESINATION_SERVICE_PORT" to make calls to other services.  For example, this would have allowed the dockerized postgres container to be exposed on port 5432, and therefore look exactly the same as if the engineer had installed postgres through a more traditional route like "homebrew".  A webserver container running through docker or through PyCharm would access that postgres container at "localhost:5432".  

We were not able to take this simple route for two reasons.  First, we use mac laptops for our developers, and docker for mac does not support the "bridge" network (see Docker documentation).  Second, we wanted to make our development environment more similar to our staging and production environments.  In staging and production, all traffic between services is routed through our vida-proxy nginx instance.  This allows for consistent logging, rate limiting, security validation, etc. of all requests.  We wanted to do the same thing locally: have all traffic between services be routed through nginx.

Since all traffic is being routed through nginx, we now have only one system that needs to understand whether a service is deployed through docker-compose or PyCharm.  Correctly configuring nginx to handle this was a two step process.

The .env file

The .env file is created from a jinja template (.env.jinja).  It contains all configuration that needs to be shared by multiple docker containers.  For network routing, we care about the properties that look like MICROSERVICE_HOST and MICROSERVICE_INTERNAL_PORT.  Jinja is used to set the host and port differently based on whether the service is deployed through docker-compose or through PyCharm.  When deployed through docker-compose, the host name used by the proxy to reach that service is the docker service name, and the port is the port that the Dockerfile of that service exposes.  If a service is deployed through PyCharm however, the host name used by the proxy is always "host.docker.internal" (which is a special DNS name that always resolves to the host on which docker is running), and the port is the port which matches the PyCharm run configuration.

The vida-proxy.conf.template file

The vida-proxy.conf.template file is also created from a jinja template (vida-proxy.conf.template.jinja).  Jinja is used to remove the nginx configuration for services that are not deployed, as nginx will not start up properly if an upstream is not available.  This also uses the environment variables from .env to know which hostname and port to use for every backend service.  See the section titled "Using environment variables in nginx configuration" in the official docker image docs for more details about how this environment based templating works: it is a feature of the nginx docker image, but not a normal feature of nginx.

After a developer updates their environment-config.yaml, both the .env and vida-proxy.conf.template need to be regenerated with the new hostnames and ports so that service traffic is routed correctly.

This is an example of how the network layout and configuration would appear if a developer was running the webserver through docker-compose and the microservice through PyCharm.

Network layout


Editors note: While writing this blog post, I realized that our solution with the .env jinja template might be more complicated than necessary.  When initially testing this project, we wanted developers to be able to test it without interrupting their existing workflow.  This meant that we could not use the same ports for any of the databases, infrastructure components, or services.  Now that the company has aligned around docker-local-dev as the only development environment, it could be simplified.  We could update every service to use a consistent port whether it is running through PyCharm or through docker.  Then, we could simplify the vida-proxy configuration to always use "host.docker.internal" to address every service, instead of sometimes using the internal docker service name.  It would probably add some additional latency with the extra hop to localhost, but the simpler configuration might be worth it.

Overview of the main files

  • environment-config.yaml

    • Used for: The main source of configuration for your environment

    • Generated by: ./setup/scripts/first-time-setup.sh. You shouldn’t need to re-run this again though.

    • Should be updated when: Any time you change which services you want to deploy. You might remove a service completely, or switch to deploying with PyCharm

  • ./environment-update.sh

    • Used for: Re-generating the docker-compose.yaml and .env after you modify environment-config.yaml

    • Generated by: static, checked in to github

    • Should be updated when: Not Applicable

  • docker-compose.yaml and .env

    • Used for: Configuration files for docker-compose

    • Generated by: running ./environment-update.sh

    • Should be updated when: You should not manually update these, but anytime you modify environment-config.yaml, you should re-run ./environment-update.sh to refresh these two files

  • ./up.sh and ./down.sh

    • Used for: starting and stopping the docker-compose environment

    • Generated by: static, checked in to github

    • Should be updated when: N/A

  • ./ngrok-tunnel.sh

    • Used for: Running your ngrok tunnel

    • Generated by: ./environment-update.sh

    • Should be updated when: Should never need updated, as you shouldn’t be changing the ngrok_subdomain in environment-config.yaml

  • ./setup/scripts/first-time-setup.sh

    • Used for: Initial setup of your environment

    • Generated by: static, checked in to github

    • Should be updated when: N/A

environment-config.yaml

There are three main sections of the environment config

  • "in_docker": This lists all the services and infrastructure pieces you would like to be deployed with docker-compose

  • "outside_of_docker": This lists the services you will deploy with PyCharm

  • "not_deployed": This lists the services you do not want deployed at all, to help save resources

What does a service need to do to support local dev

There are two steps for adding support for docker-local-dev to an existing codebase

First, you need to modify the application to read all network and database configuration from environment variables (see example of how we read database configuration).  Recall that in the "Network Topology" section above, the url of a downstream service depends on whether it is deployed through PyCharm or docker-compose.  By reading these urls from the environment, we can use the jinja templated .env file to set these values correctly for all services that are being run through docker-compose.  

Second, it helps everyone to add a PyCharm runtime configuration and commit it to the repository (see example).  This allows anyone to start the service up immediately after checking out the project.

When reading config from environment variables, it is common to choose a default value in case the environment variable is not present.  For our purposes, we established the pattern of having the default value be the correct value if the service is being run with PyCharm (not through docker-compose).  This allows us to keep the PyCharm runtime configuration very minimal.  If we had defaulted to docker-compose, we would have had to specify the correct value for every environment variable in the pycharm runtime configuration.  It also would not have reduced the configuration in the docker-compose file much, as most of these variables are already being set in the .env file, so they are very easy to reference in the docker-compose.yaml template.

Testing and troubleshooting

Here are some common scenarios you may encounter

  • the proxy (localhost:9000) returns a 502.  nginx returns a 502 response when it cannot connect to the upstream backend.  

    • First, confirm the proxy error message by checking the proxy logs ("docker-compose logs -f vida-proxy").

    • An upstream being completely unreachable is usually caused by it failing to startup.  Check its logs with either docker or PyCharm, depending on how it was deployed.

      • This is often caused by incorrect configuration of a service the upstream depends on, or that dependent service not being up and running.

      • For service deployed with PyCharm, I have often forgotten to restart them

    • If the upstream service is running but still unreachable, the proxy configuration may be out of date.  Run "./environment-update" and "./up.sh" to ensure it has the latest.

    • the proxy sometimes only checks for the upstream when it starts up, so restarting the proxy ("docker-compose restart vida-proxy") can sometimes fix this

  • 404 for a url that should work

    • This is often caused by the base path configuration of a service not matching the proxy configuration.  When adding a new service to your local-dev, make sure that these agree.  In our example, this might occur if you tried to access localhost:9000/msvc instead of localhost:9000/microservice, or if the microservice was configured to expect it was running on localhost:9000/micro

    • This can also be caused by the trailing slash issue mentioned in Appendix A below

  • Missing static assets when loading a web page

    • This is typically caused by the trailing slash issue mentioned below in Appendix A.

    • Use the Inspector in your browser's Developer Tools to confirm which url is being used for the static assets.  Make sure that the url has the correct base path for the service (e.x. localhost:9000/microservice), and is not just using localhost:9000

Further Resources

Appendix A: More details about the nginx configuration

We learned some interesting things about nginx configuration while working on this project

upstream and depends_on

In order to transparently forward requests from the nginx proxy to the backend using the "proxy_pass" directive, we define each of our backends using the upstream module.  This also allows us to define the backends using their dns names (either service name or host.docker.internal) instead of needing to figure out their IP address.  This comes with a downside however: the services must be up and running before the nginx proxy is started.  nginx will fail to start if any upstream cannot be reached.  To solve this problem, we used the docker-compose "depends_on" to make the proxy depend on all other services (see docker-compose template).  This ensures that docker-compose always start the other services first, so their ip addresses are available to the nginx proxy when it starts

trailing slashes with proxy_pass

Some applications (including Django) expect to be accessed from the root url of a domain (e.x. test.com or webserver.test.com).  The configuration necessary for them to work from a specific path on that domain (e.x. test.com/webserver) can be difficult (see Stack Overflow example) or impossible.


In our local development environment, we want to use path based routing (test.com/webserver) for two reasons though.  First, it matches our production configuration (where we try to only have a small number of active DNS domains).  Second, we couldn't figure out how to configure multiple DNS names to route to localhost easily.


Luckily, nginx does have the ability to modify the request it forwards to the backend to make it appear as if it were being accessed on the root of the domain.  The proxy_pass directive has fairly complicated rules about what request url it passes to the proxied backend (see nginx docs).  We have found that if we do not include a trailing slash, the request is forwarded without modification, which is great if the backend supports configuration for path based routing.  If the backend does not support that configuration, then in the proxy_pass we make sure to include a trailing slash.  This causes nginx to strip the location from the url, so the backend receives a request that looks as if it was sent to the root of the domain.


In our example project, you can see that both the microservice and snowplow needed to be configured with the trailing slash.  We configured webserver without a trailing slash since its location was already the root of the domain.

Appendix B: Supporting multiple rabbitmq users and vhosts

Adding support for multiple rabbitmq users and vhosts on a single rabbitmq instance was a little bit tricky.  The docker image only officially supports setting users, permissions, and vhosts through a configuration file, and the passwords in that file are hashed, not plaintext.  We considered having a script that ran after the rabbitmq container started to create the additional users and vhosts using rabbitmqctl commands, but this would have slowed down the startup process every time.

So, we created the file we could load at startup (see rabbitmq docs about loading on boot and docs about the file format) with the following steps:

  1. Start up a rabbitmq docker container

  2. Run "add_vhost", "add_user", and "set_permissions" commands inside of that container to create the users and vhosts you will need to use

  3. Run "export_definitions" to dump the definitions to a json file, and copy this file out of the docker container.

  4. Remove all the configurations we don't want to set.  In this case, we want to keep only "permissions", "users", and "vhosts"

  5. Create a rabbitmq.conf file that configures rabbit to load the definitions.json file at startup

  6. Mount both the rabbmq.conf and definitions.json file into the docker container

You can see in our example project the commands we ran, and how the configuration files are mounted

Useful tools

http-https-echo

We made great use of the mendhak/http-https-echo docker image, which displays all components of the HTTP request it received.  This was especially helpful when figuring out the nginx configuration, especially the trailing slash necessary for certain upstream backends.

What comes next

There are a few areas to improve upon

  • Secret Management:  There are some secrets we do not want to commit to the repository (e.x. datadog and split api keys).  Currently, developers set these as environment variables using their shell, but we would like to figure out a way to share these more easily and securely.  We will likely try to use GCP SecretManager and have docker or the services load the secrets at startup.

  • Keeping images up to date: We do not use the "latest" tags on our docker images, instead preferring specific tags that are created when we deploy to our staging or production environments.  

    • This allows us to avoid two issues with the latest tag: We would need to add a "docker-compose pull" step to our process, which would potentially slow things down.  It might also lead to a developer unintentionally updating an image, when all they meant to do was restart their containers.

    • However, it comes with the downside that we need to constantly update the images referenced within the project.  We currently do this manually on a monthly schedule, but automating it as part of deployments would be better.

  • Reducing resource usage: Running enough of the systems to have a mostly working backend can take significant resources from the developer's laptop.  For backend developers, this has been mostly fine, but mobile developers have reported some issues.  Android Studio and XCode both consume a significant amount of resources themselves, leaving less remaining for the docker services.  This can really slow down development.  We are considering how we might run some of these services in a remote environment within google cloud, but want to avoid the cost of either running the services more than we need to (as devs may forget to shut them down at night), or the cost of building systems to manage the systems and automatically shut them down.

Interested in more?

Are you looking to make a real impact with your work?? Be part of the team that’s eradicating chronic illness, both physical and mental. At Vida, we empower individuals to overcome chronic mental and physical health conditions, once and for all. Our people make it possible. Check out our careers page at https://www.vida.com/careers/.

Next
Next

Web Accessibility at Vida