Trainline Europe’s search API is a cornerstone of our product, as an online train and bus ticket booking platform. The search API is responsible for fetching and aggregating results from the carriers’ APIs. Without it, we couldn’t display train schedules, and we wouldn’t be able to sell tickets. We are always trying to make it perform better, and we always make sure it highlights the most practical results for our customers.
One of my missions when I joined Trainline was to make sure the connection to external services (like carriers’ APIs) wouldn’t put our entire search API at risk. The purpose was to take the code of the carriers off the main application, then run this code in the context of small workers that consume less resources. These workers must be efficient at what they do, we should be able to start as many of them as we need to, and the part of the code connecting to external services should be isolated, which can’t be bad: if anything were to happen, we would rather see a carrier worker fail than the whole application.
A tour of our infrastructure
We use a Rails application to back a EmberJS frontend. So far nothing unusual, the Rails app is connected to RabbitMQ, user searches are sent to another application built with EventMachine, each payload is consumed and search results are returned to the Rails app. What is described in this blog post concerns the Ruby EM / Search API.
Setting timeouts: A primitive form of control on external services
Sometimes the services we need to contact are down or very slow, and when one of these services is not accessible, we don’t want to compromise the whole application.
For example if the user searches for Paris -> Madrid, we need to contact both SNCF and Renfe, and if one of these APIs is not working we need to display a comprehensive error for the user, asking to try again later. We can’t have the applications hanging while users searches accumulate.
Setting timeouts in the HTTP requests is important, but we realized early on that we needed more control over each of these services.
Splitting the application
We decided to extract the code belonging to each carrier, and run it in standalone, meaning we can have instances of the app running only the carrier’s code. These instances consume less RAM: around 280MB vs. 450MB for the whole app. Since we’re able to know how many searches we have for each country, we can set a number of instances of the application accordingly: if we have 50% of searches with a French origin and destination, we know we’ll need 50% of our workers to be SNCF.
We achieved that with Ansible and Monit. When deploying, Monit configuration is regenerated with an Ansible template. We just have to update Ansible Yaml files to set the number of workers on the servers.
Example of Ansible deployment settings:
Going live: what could possibly go wrong
Before launching the split of our Search application for a specific carrier, we were all happy and pressed the “deploy” button. Clients were starting to receive search responses handled by our splitted app and it was great. However something unexpected happened.
We use RabbitMQ to communicate between the applications, and by creating a new
layer of service-specific instances of the search API, we increased the amount
of messages exchanged. The way channels and queues are recycled was not done
properly so the amount of TCP connections to the queue exploded and RabbitMQ crashed.
Setting timeouts is very important when contacting HTTP services but completely isolating the code was the real answer for us. With this setup we can adjust the number of workers for each service according to our traffic. We can also dispatch the workers on servers with varying degrees of performance. We also know that if anything happens, the main application won’t be compromised. In short, it’s an invisible step for you, our customer, but a big step forward for our search API.