Captain Train allows to book train tickets across Europe by automatically combining multiple carriers. All you need is a few clicks on your computer or a few taps on your phone and we do the rest.
Let’s assume you’re traveling from London St. Pancras to Roma Termini. Here’s what your journey looks like on our website.
This journey is made up of three sections and two transfers. Each section of this journey must be booked with a different carrier (Eurostar, SNCF and Trenitalia). A previous blog post explains in greater detail and in French how we combine those carriers.
How do we find the best ticket for you then? In order to pick the best carrier combination, the first step is to compute multiple routes. A route is the way taken by a train in getting from London to Roma. Today we’ll focus on the technical details of our internal routing engine.
Routing is no rocket science
Computing routes for transportations systems that are on timetables (like trains or buses) is nothing new. It has been thoroughly studied by academics and implemented multiple times. A broad overview of the history of routing algorithms can be found in this blog post (in English).
We use the Connection Scan Algorithm published in 2013. It is an astonishingly simple algorithm that suits our needs.
As its name suggests, this algorithm deals with connections. A connection is the smallest step or leap a train can make going from one point to another without any stop. In other words, a connection models a train departing at a certain train station at a certain time and arriving at another train station at a certain time without intermediate halt. For instance, a journey starting from Paris at 17:56 and arriving in Lyon at 19:56 with no intermediate stop is a connection.
If the train was to stop somewhere along the way to let passengers enter or exit the train, the journey will amount to two connections with a stop in between. When it takes several connections to go from one train station to another, we call that a trip. For example, going from Paris to Lyon with a stop at Le Creusot is a trip made out of two connections (Paris-Le Creusot and Le Creusot-Lyon).
Sometimes a journey with several connections will be faster than a journey with only one connection (direct one). For the algorithm to come up with the optimal route between two points, many connections must therefore be scanned including the ones with stops in between. The most direct is not always the fastest.
Adjusting the algorithm to our needs
The authors of the original algorithm only considered one day of data, in other words one day of timetables, thereby leaving aside many issues. Since trains are also scheduled to run tomorrow and the day after tomorrow and next week and even in the next few months, we need to be able to compute a route up to 4 months ahead. How can we make up for this lack of data over time? A common pattern is to store a bitset stating which day a train runs.
first_day = "2015-01-01"
validity_pattern = 0b001101
if (validity_pattern) // if the train runs on January 4th
// Do something
That approach is frugal on memory use but causes many troubles with journeys that pass midnight or daylight saving. We therefore decided to simply repeat every connection as many times as a train runs. The extra memory consumption is largely compensated by the simplification of the algorithm which leads to an overall gain of performances since less memory lookups are required.
As a booking website, we want to suggest a set of various alternatives for our clients to chose from: the fastest journey, the journey with the least transfers, something in between, journeys with different travel times…
We obtain all those alternatives with only one run of our algorithm by searching all Pareto-optimal solutions according to the criteria (departure time, arrival time, number of transfers).
Such an approach computes multiples routes during the day. Thanks to that, not only will the algorithm return the route leaving at 8:00, but also the one leaving at 9:30 as well as the one leaving at 10:23. In some publications, this is called a range search.
Multiple departure and destinations
Big European cities have multiple stations. When the user enters Paris as its origin and Rome as the destination, we need to consider routes from any of the Parisian stations to any station in Rome.
Compute routes from one to all stations
Just like many other routing algorithms, the CSA computes all the routes from one station to all the others. While we do not use that property to combine carriers, it makes it possible to know what stations are reachable from a given station like Paris Gare du Nord.
Thanks to the multiple-criteria search, we compute in just one search every possible route all over one single day going from one station to each and every European station. We also include all the compromises between travel time and number of transfers. And it takes 100ms on average.
Here’s how it goes when you search for a train on our website:
1. We search the best route between train station A and train station B;
2. We ask availability and fares to every carrier operating on the route we just found.
In other words, the route search happens before asking the availability and fare to every carrier, not simultaneously. Therefore we have to make sure that the route computation is very fast—since the search doesn’t stop at that point and we still need to ask for fares and availability before being able to showing some search results to our clients.
In our simplest implementation to get the earliest arrival time, the computation for a request took roughly below two milliseconds. However, due to the multiple-criteria optimizations stated before, the actual computation time is higher. On our production server and over many millions of requests, 50% of the routes are computed in less than 30ms, and 95% of the routes are computed in less than 80ms.
The multiple-criteria approach is the cause of this significant overhead. The first reason is that it requires more complex data structure to hold a unknown number of possible routes to a station. The second reason is that the number of valid routes can grow rapidly. From a purely theoretical point of view, there can be an exponential number of solutions, but in practice there are rarely more than 10 valid connections between two stations in a day.
Prune bad solution early
To get those performances, we had to extend the algorithm for it to be able to prune some solutions, which means getting rid of the worst ones. Indeed, as we search for many results over the day, we scan the connections over a large span of time and the algorithm computes routes to stations very far away from the actual destination. Suboptimal solutions could easily pile up as a consequence.
To avoid considering connections that are not interesting, we build a static graph of the railway network. The cost on each edge is the shortest duration possible considering all the trains on that edge.
Before each search, we do a one-to-all Dijkstra search from the destination. This gives us a lower bound at any stations of the duration to reach the destination. This bound allows us to know that an intermediate solution will never be able to improve an existing solution reducing the search tree.
Dijkstra’s pruning technique must not be confused with A* that changes the order in which the nodes are considered. Even with Dijkstra’s pruning, we consider all the connections in the same sequence.
We are quite happy with the way our routing engine works. It allows us to deliver on our promise, which has always been to connect Europe by train. Of course we faced some challenges.
Handling time with great care
Time-related issues are a not a funny running gag. Unfortunately, our algorithm doesn’t really care about tomorrow. But trains do and so do we. For us to expand the time and take the future into account, we have to assign a Unix timestamp to all our connection times. By doing so, we avoids many issues with journeys that start today and run after midnight, that is to say tomorrow.
However, we’re not done with time yet. We still need to manage time zones for the Eurostar (Paris and London don’t belong to the same time zone) and daylight saving for night trains. Despite all our algorithmic efforts, the manipulation of raw data and time remains a bigger pain than the computing of the route itself.
Bellman condition is not respected
In a perfect world, every sub path of an optimal route is optimal. This is however not always the case with trains as we also want to minimize the number of transfers.
To travel from London to Venezia, the solution with the least transfers consists in taking the Eurostar from London to Paris and then the direct night train Thello from Paris to Venezia. A sub path of that route is London → Dijon. However, it is possible to get earlier to Dijon, leaving at the same time from London by taking a TGV from Paris to Dijon. The intermediate solution Eurostar + Thello is strictly dominated by the Eurostar + TGV solution and would normally be pruned.
A similar problem occurs with twin trains: two trains in a multiple unit run together for a few stations and then split their routes. We need to keep both as intermediate solutions even if they are strictly equal.
Join our CSA Challenge
Since the minimal algorithm is so simple we created a small challenge so that everybody can make their own implementation of the algorithm and compare them. Go ahead and try to be creative. And if you want to join us, we are hiring.