When bringing machine learning to ShippingEasy last year, decisions were made to implement the predictive aspects in Python, to make it a microservice, and to package and deploy it in all environments as a Docker container. This is the second of two posts detailing the the thinking behind these decisions, hurdles, and some thoughts on microservices and Linux containers in light of the experience. You can read part 1 here.

Microservice Hurdles

While deciding on a microservice approach solved many issues, it gave rise to others. We now had to host and operate two applications written in two languages that collaborated together in distributed fashion with all of the fragility that would entail. And even though our engineers didn’t have to run the prediction app to develop the customer facing aspects of the prediction feature, it would become necessary as we had to support the feature in the wild to reproduce production issues and validate their fixes.

Enter Docker

Here is where Docker and Linux containers (LXC) came into the mix. If you are not familiar with these technologies, you might want to read those links for context. An oversimplified description of Docker is that it is a way to bundle applications with their Linux runtime environment – the packages, software, libraries and resources needed for the application to do its job – into a container. The container-packaged application runs within its own process/user/filesystem sandbox on a host machine and with its own network interface. Containers are similar to a virtual machine, but much lighter as it shares the host’s kernel. Containers can be built from other containers to cut down on the amount of repeated provisioning tasks. Docker Hub makes it easy to push/pull containers and otherwise transport them around your environments and hosting services.

For our scenario, Docker offered many benefits:

  • We could build a container with the Python environment set up to run the prediction application.
  • Developers could run the fully integrated app locally using the prediction app as a Docker container without having to build out their own Python environment.
  • We would not have to make, build or install dependencies for Python and SciKit Learn in staging or production environments, either. Again, we did this once, in the container, and never had to repeat.
  • This would be a great opportunity to dip our toes into the container waters. If things worked well, we could leverage the lessons learned for future improvements to our infrastructure.
  • By learning how to build, deploy and host Docker containers in a production environment, a uniform and repeatable process could be developed. This would work not just for Python, SciKit Learn and the other dependencies for our little prediction application, but for any technology we might want to use in the future as the best tool for a particular job. It would simply have to be able to communicate using the same mechanism (HTTP).

Let that last bullet point sink in a bit, because I think this is where the hype of containers could be realized. If an application and all of its dependencies could be bundled into one or more containers, and any container can be deployed and hosted in the same uniform way, then you suddenly have seriously mitigated the operational overhead of embracing a polyglot approach to programming or persistence.

Docker in Development

Docker-Compose is used in development to manage dependencies for the main app. This includes all middleware and the prediction application. Our docker-compose.yml file looks something like this (omitting things like ports, volumes, environment vars, etc…).

 1 # Persistence & middleware
 2 postgres:
 3   image: postgres:9.4.1
 4 redis:
 5    image: redis:2.8.19
 6 memcached:
 7   image: memcached:1.4.24
 8 elasticsearch:
 9   image: barnybug/elasticsearch:1.5.0
11 # Microservice app
12 predictor:
13   image: se/autoship:1.0.0
14 links:
15   - redis
16   - elasticsearch

This allows us to use Docker to manage and run all of our app’s dependencies locally – including the predictor microservice application. Simply by pulling the predictor container from DockerHub, the app can be run in all of its distributed-but-collaborating glory. One downside of microservices mitigated.

There is another benefit to this setup. Persistence and middleware dependencies like Postgres, Elasticsearch, Redis and Memcache being managed by Docker allow us to easily switch versions in our dev environments and to stay in sync with what is running in production. If you come from the Ruby/Python world, Docker acts like RVM/VirtualEnv but for all of the infrastructure dependencies of your application. Docker thus also brings one closer to the realization of a 12-factor app and lessens the time it takes to get a dev environment up and running.

Docker in Production

Realizing the benefits of Docker in a production environment where you would need to span multiple hosts is not as easy as development (me add docker-compose.yml file to project, me smart). There are many cloud and self-managed options for this in various stages of development and readiness, including Amazon Container Service, Google Container Engine, Kubernetes, CoreOS/Fleet/Etcd, Docker Swarm, Deis, Flynn, Registrator, Weave and probably others that have leaked out of my head since I last looked.

Unfortunately, at the time we decided to take all of this on, none of the above efforts were ready for production use or would be serious investments when we had no trust yet in Docker as a technology that we wanted to commit to. So we rolled up our sleaves and came up with our own relatively simple approach influenced by this blog post from Century Link Labs. Haproxy balances the load to the microservice cluster and Hashicorp’s Serf handles hosts joining or leaving the cluster. Hosts are provisioned with Chef which does little more than install Docker, ensure host networking is set up correctly, and setup SSHD. The topology of our microservice infrastructure in production looks like this:

  • On the gateway host, we run an Haproxy load balancer container. It is exposed to our internal network segment via port 80. This is the predictor endpoint that our main application communicates to. Haproxy will receive the requests and then round-robin them to any application containers that register with the load balancer.
  • The load balancer container also runs a Serf agent so that it will receive member-join and member-leave events as hosts join the cluster.
  • On every host in the cluster, we run N predictor applications as Docker containers. These are exposed via Docker’s port forwarding. The ports on the host machine are within a predictable range, starting at 8000 and incrementing by 1 for each application container. Thus if we were running 12 application containers on a host, ports 8000-8012 would be where haproxy could forward requests to.
  • On every host, we run a Serf agent container that has links to each of the predictor applications being run on the host. The linking ensures that the Serf agent container is brought up last, after all of the application containers are ready to serve requests.
  • The Serf agent container on an application host joins the load balancer’s Serf agent, triggering a member-join event. The load balancer’s Serf agent reacts by rewriting Haproxy’s config to have backend server entries for each of the predictable ports for the new host (8000-8012). Haproxy is reloaded and the new host and all of its containers are brought into rotation. The reverse happens with member-leave events.
  • Zero downtime deployments are accomplished through container versioning and deploying a new version of the container serially a host at a time.

This gives us a modestly elastic ability to scale the predictor application independently of our primary application. The primary application only ever communicates via HTTP to the gateway host on port 80 (the haproxy loadbalancer). Behind that, we can bring new hosts up and down to scale as needed and they are brought into and removed from rotation via Serf’s magic.

Returning to the promise of containers, If we were to adopt more microservices that exposed themselves as web services, this infrastructure would be repeatable as long as the microservice could be communicated with over HTTP. A Java app using Neo4J as a data store? A NodeJS app with MongoDB? A Scala app using Riak? No problem.

I am dubious of recommending our precise infrastructure to others. There has been an awful lot of commercial and community effort by very smart people into the various container-serving technologies mentioned previously. They are maturing and many have been ready for production use for awhile. When/if we look at serving other applications (perhaps our main application) with Docker, I want to revisit our infrastructure. It is relatively simple and works, but the promise of these other technologies could make scaling and coordinating different Docker-served applications in a production environment even easier.


The microservice model is a way to use smaller-scale, distributed-but-collaborating applications to manage complexity that would grow in a factorial fashion within a monolithic application. They in particular have applicability to large organizations that would collapse under their own weight otherwise, but also are useful in smaller organizations where the standard stack is a poor choice for a problem or where operational requirements dictate a split. There are many tradeoffs to microservices and distributed applications, not the least of which is operational complexity. Linux containers, however, hold great promise to lessen this operational burden. Microservices and container technologies like Docker compliment each other – each increasing the other’s viability and/or value.

When bringing machine learning to ShippingEasy last year, decisions were made to implement the predictive aspects in Python, to make it a microservice, and to package and deploy it in all environments as a Docker container. This is the first of two posts detailing the the thinking behind these decisions, hurdles, and some thoughts on microservices and Linux containers in light of the experience.

The Problem and its Solution

I’ve detailed the problem elsewhere, so I won’t spend too much time expanding on it here. We needed to use machine learning algorithms to predict how our customers would ship new orders given a set of their past orders and how they were shipped. Ruby is an amazing, elegant language and Rails is an great tool for bootstrapping a product and getting to market. But neither are tools for scientific computing. Hardly anything exists in the machine learning realm, and what does is not mature with a large community of experts behind it.

The opposite is true of Python, however. It is widely used for scientific computing and has a great machine learning library in SciKit Learn. It proved itself through investigation into our problem, providing good results in a proof-of-concept. Python and SciKit Learn were the right tools for the job and gave us a solution to our domain problem.

The Problem with the Solution

As a lean start-up development organization, we are heavily coupled to Ruby/Rails. Our application is generally a monolithic Ruby on Rails app. We use Resque for background processing, and have some integrations split into engines, but by and large we are a monolith. The same Ruby/Rails code is deployed to web servers as is deployed to worker servers.

Our development team, while populated with some of the best engineers I’ve ever worked with, are by and large Ruby/Rails developers first and foremost. A few of us are polyglots, but all of us have at least our recent experience dominated by Ruby/Rails. Its what we knew how to write, test, deploy, operate, maintain and support.

Bringing Python into the mix thus presented many problems. Our development team would need to have Python environments set up locally to run the app in its entirety. We would need to deploy and run this code written in a foreign language in staging and production. It would need to interact with our Ruby/Rails code. And once everything was working together, we would need to be able to support it operationally and as a product.

Enter Microservices

Martin Fowler and others at ThoughtWorks have organized their observations of how large tech organizations manage disparate teams working with different technologies to be at least the sum of their parts. They call their thoughts Microservices. I wish they had chosen a different name, simply because “service” is such an overloaded term, and most of the time I hear someone discussing microservices, I think they misunderstand what it means, at least how Martin Fowler would define it. Or perhaps I am the one that misunderstands!

At any rate, our problems seemed to fit the case for a microservice. There was a natural dividing line around the business capability of predicting shipments and the rest of our app. That dividing line had an easy interface to design – given an order, predict a shipment. This functionality needed to be delivered using different technology than our primary stack. The app as a whole did not need to know anything about how the prediction was made, it just needed to ask for and receive an accurate prediction. Likewise, the predictive component needed only a small fraction of the data from our main application to make its predictions. Thus there was an easy partitioning of data for the two applications.

On the human front, our team as a whole didn’t need to know how the prediction application worked. To work on the customer-facing aspects of the feature, they just needed predictions to show up within the main application. So though we are not a large organization, splitting this functionality into its own application had benefits on the team front. Only a couple of people had to be burdened with how things worked internally within the prediction app. Everyone else could remain blissfully ignorant.

To me, the case for a microservice architecture emerged from the depths of our problem. There was no way we were going to remain solely a Rails application if we were going to deliver this feature. The appropriate tools to solve the business problem dictated the architecture, not the other way around. And it clearly has been proven to be the right decision.

Move Over, Rube Goldberg

Now is a good time to pause and take a look at how the microservice architecture was shaping up. We decided on a web-service vs messaging for reasons I won’t elaborate on here. On the side of our main application, the gross architecture looks like this…

  • Orders flow into our system, either from a direct API call to us (push) or by us fetching the order from a store API (pull). The order is persisted in our primary database and a prediction is requested from a PredictionService within the main application.
  • The PredictionService asks a PredictionProxy for a prediction to the order.
  • The proxy is what actually talks to the Python microservice application. It takes the order, marshalls it to JSON, makes the web request of the microservice, unmarshalls the response and hands it back to the service.
  • The PredictionService takes the prediction, validates the data, builds a Prediction object in the main application and persists it associated with the order.
  • Within our main application interface, customers can see which orders can be shipped using our validated predictions. They then send the orders to an intermediary screen for review of the predicted choices of carrier, service, packaging and so on. From there they can purchase and print the shipping labels using the predicted choices in bulk.

On the Prediction application side, the gross architecture looks like this:

  • A web request for a prediction is received.
  • We attempt to look up a cached customer data model from Redis. This is a trained algorithm using recent data for the customer. We cache the customer data models for up to 24 hours as training the algorithm with a customer’s data set is an expensive operation.
  • If we have no cached customer model, we fetch the customer’s recent shipping data from Elasticsearch, build a trained model, and cache it in redis for up to 24 hours.
  • We take the trained model for a customer and make a prediction for the order passed in with the prediction request.
  • The prediction is marshalled to JSON and returned as a response from the web request.

Dividing the application functionality on this line allowed us to use the right tools for the job and to clearly separate the concerns of our app proper and the prediction component. In development environments, we can use a fake Prediction Proxy within our app to return canned responses to prediction requests. For developers, at least, our app can still run as it always has – as a single process Rails application. So far, so good.

The oft-cited downsides to microservices were about to rear their head, however. Part 2 in this series will detail how Docker and Linux Containers helped remedy them. Click here to continue on…

Last fall when I took on ShippingEasy’s machine learning problem, I had no practical experience in the field. Getting such a task put on my plate was somewhat terrifying, and even more so as we started to wade into the waters of machine learning. Ultimately, we overcame those obstacles and delivered a solution that allowed us to automate our customer’s actions with greater than 95% accuracy. Here are some of the challenges that we experienced when applying machine learning to the shipping & fulfilment domain, and how we broke through them.

Lost in Translation

Machine learning is a subfield of computer science stemming from research into artificial intelligence.[3] It has strong ties to statistics and mathematical optimization, which deliver methods, theory and application domains to the field.

So sayeth the Wikipedia. These roots are where the lexicon of the machine learning stems from. If you have not been working directly in machine learning, statistics, math or AI, or perhaps your exposure to these are long past, a discussion about machine learning will be hard to follow. Often it is taken for granted that you know what classification, regression, clustering, supervised, unsupervised, feature vector, sample, over-fitting, binning, banding, density and a host of other terms mean.

As a result, you will be somewhat lost until you can get familiar with this language. Getting a good book will help. I would recommend Machine Learning, a short course. It clocks in at less than 200 pages, and so is something that a working professional can consume. Even if you can’t follow everything in the book, reading through it will give you a foundation that will allow you to make use of all the other resources you may find online.

To give you a starting point to building your vocabulary, I will offer a few terms here that will help determine what type of machine learning problem you are dealing with.

Supervised vs Unsupervised Learning: Supervised learning is where you have a set of input data with known outcomes by which you wish to predict the outcome of future inputs. Our problem at ShippingEasy was of this type. We had past orders and shipments and needed to predict shipments given future orders. Unsupervised learning is where you have input data, but no known outcomes. You are searching for what features have meaning within a set of data. If graphed, the data will form clusters around the patterns of meaningful features.

Classification vs Regression: Within supervised learning, there are problems of classification and regression. Classification is where you wish to determine the class (output) of an input. For instance, predicting what shirt color a person may wear on a given day based on data about what shirts they have worn in the past. The different shirt colors are the classes that you are attempting to predict.

Regression is where you wish to determine a numeric value given other numeric inputs describing the sample. For instance, predicting an engineer’s salary based on age and years in the industry. Given enough past data, you could arrive at a statistically relevant salary figure given an arbitrary age and years in the industry (assuming true relationships between age, years in industry and salary).

Algorithmic Obsession

Once you have a foundation of concepts and language, you can start looking into all of the amazing resources on the web for machine learning. Stanford’s machine learning videos are great, as are mathematicalmonk’s youtube videos.

These are fantastic resources for learning how to write machine learning algorithms. But these turned out to not be of much use to me. Not because they are not great, but because the practical application of machine learning is about solving a domain problem, not writing machine learning algorithms. To make the point, consider this portion of a machine learning algorithm expressed in mathematical notation (which is yet another barrier to the uninitiated):

What does this have to do with the problem you are trying to solve? Absolutely nothing. This is a description of one portion of an algorithm that may be fed arbitrary data to produce statistically relevant results. It has been implemented by someone smarter than you in an open source library or possibly a service offering that have been well exercised by a large audience. You could implement it perfectly, and it could produce great results or really bad results. It all depends on the relevancy of the data you feed to it, which brings me to my last point.

Its the Data, Stupid

While there are a tremendous number of resources for how to write machine learning algorithms, there are not many dealing with how to find relevant data within a domain that will allow an algorithm to produce accurate results. This is where you will find that you have spent most of your time, effort and creativity at the end of an applied machine learning project if you were smart enough to use a good machine learning library or service.

That algorithms dominate the resources for machine learning makes a certain amount of sense. Algorithms are generic and have practicability for many different scenarios. The K-Nearest Neighbor algorithm may be able to predict what movies you would like to watch on Netflix, or it might be able to predict which sex offenders are at high risk for recidivism. These different applications of K-Nearest neighbor would have very different data that needs to be surfaced from their respective domains and fed to them, however.

There exists an area of machine learning geared towards feature detection, and I won’t dismiss its validity. I will say, however, that if someone understands the domains of movie consumption and purchasing dynamics or criminal behavior, justice and rehabilitation, they have a leg up in practically applying machine learning to those domains. For even if there is a statistical correlation between day of the week and movie choices, it does not mean that there is a causative relationship between them. Understanding the domain can help you ascertain if it does.

Some of the data will be obvious. It winds up being a value in a column of a row in the database and it screams its pertinence. Some will be much less obvious and need to be inferred. For instance, for the sex offender recidivism problem, there are probably a number of criminal incidents, each with a timestamp for when they occurred. For any given person, the amount of time that has passed since their last criminal event, in days, might need to be calculated and included with the data sent to the algorithm. This ‘freshness’ of their criminal activity needs to be inferred from your data, and it may be a key to getting the desired results in predicting future likelihood of behavior. Or it might not.

I think the moral of the story here is that to really apply machine learning in a practical way, being a mathematical or statistical wizard is not the most important element of success. What I feel is more important is having an understanding of the domain to know what data is relevant and an explorers curiosity to have meaningful hunches and a willingness to explore and vet them. You will need to be comfortable employing something resembling a scientific method – ensuring accuracy is measurable, quantifying the effects of change, and meticulously exploring isolated changes to discover what data affects a system.

In conclusion

Employing machine learning to solve domain problems can provide huge value to a company or the public at large. Learning machine learning and how to properly apply it to a domain, however, can be challenging. You will need to develop a knowledge of the fundamentals of machine learning, but do not need to be a computer science, math or statistics guru to employ it. Leverage existing libs or services, and then focus your efforts on finding the meaningful data within the domain, both obvious and obscure, that will allow satisfactory results to be achieved.

At ShippingEasy, we take customer’s orders from various online storefronts and allow the customer to easily generate shipping labels to fulfill those orders at a reduced cost. To make life even easier on our customers, we wanted to automate the process of decision making when purchasing a label. After all, we had a large example set of data for them – the orders we’ve received and the various shipping choices that were made from them. Given that data, couldn’t we use machine learning to infer what actions a customer would take when confronted with an order in our UI?

The answer was yes. We developed a system dubbed AutoShip that allows us to predict customer’s shipping choices with great accuracy. We even made a snazzy marketing video with really soothing music!

Getting there was not so easy, however. Our application is written in Ruby on Rails. While our team is full of great software engineers and web development gurus, none of us were data scientists by trade. Here are 3 lessons learned or relearned on the journey to bring machine learning to our product.

1. Stand on the Shoulders of Giants

Ever hear that truism? Well, it was hammered home on this project. I love Ruby as a language and feel Rails is a pretty good framework for building web applications. But they are not exactly paragons for scientific or statistical computing. Looking for a machine learning lib in Ruby? Good luck. There are libraries in other languages that are strong in this area of computer science, however. Python’s SciKit Learn library is one example that far outclasses anything found in Ruby. So we spiked a proof of concept using it and were up and running towards a final solution.

The moral here is you really should use the best tool for the job. Ruby is a square peg to the round hole of machine learning.

2. Play To Your Strengths

We had a workable solution, now we needed to bring it to the product. Should we port to Ruby? Should we stick with Python, the one that had brought us this far? We decided on the latter, implementing it as a small microservice web application using Flask.

Services, even if you attach a micro prefix to them, bring a complexity to all environments – development, test/ci, production. To help simplify this, I packaged the microservice application in a Docker container. Thus it could be run in development without RoR developers having to set up a Python environment. In production, we deploy the containerized application behind an Haproxy load balancer, with Serf managing the cluster of microservice containers to automagically add themselves to the load balancer. This enables us to have a scalable infrastructure that can grow easily as our needs increase. I plan to elaborate on this further in a future post.

The upshot of all this is that implementing the feature as a microservice written in Python was much easier, at least for me, than trying to rewrite the excellent machine learning algorithms found in SciKit Learn in Ruby. I am a systems engineer with devops skills, and coming up with a robust and scalable microservice infrastructure was more easily accomplished than suddenly reinventing myself as a data science uber geek.

3. Polyglot Persistence. Yes, its a thing.

We use Postgres as our main database. All of our order and shipment data lived there. At first glance, it would seem like we should just use the data as found in the relational database to back the system. But for a number of reasons, we decided to use Elasticsearch as the repository for the data that our system would use.

First, it stores unstructured documents. We weren’t exactly sure what data we would need, or how it might evolve over time. We could shove literally anything into it and get it back out again without having to migrate a schema. This was a very nice boon to have when being exploratory with the data to try and determine exactly what might yield the results we were after.

Second, its fast and scalable. If we denormalized an order, shipment and prediction data into the same document, we would not have to do any complex joins while trying to get large sets of order/shipment/prediction data used to train a machine learning algorithm. And being elastically scalable would mean Elasicsearch could grow as much as we needed it to.

Lastly, it has Kibana, an amazing data visualization tool. I had already set up an Elasticsearch/Logstash/Kibana stack as outlined in a previous post. Pointing it at our order, shipment and prediction data has allowed us to have a tremendous ability to delve into our data. Without writing any code, we could easily visualize the answers to questions like “For customer A, what were the packaging choices that were made by our prediction service for the predictions that proved inaccurate?” It was invaluable during the exploratory phase, and is perhaps even more so as the feature is moving into support-mode post-rollout.

At the end of the day…

While we had some stumbles along the way, we were able to achieve our goals of providing > 95% accuracy on predicting shipments for a customers orders. In doing so, we are providing a great service for our customers that makes their jobs much easier. As for myself, I got to learn quite a bit about machine learning and using Linux containers in a production environment. It was challenging, but a tremendous amount of fun. Thanks, for the opportunity, ShippingEasy!

I’ve had to wear my dev ops hat for a bit at ShippingEasy recently in setting up an ELK stack to provide log aggregation and operational analytics. That is Elasticsearch, Logstash & Kibana. We’ve become pretty dependent on the infrastructure, as it enables us to keep an eye on how things are running and delve into problems in production when support escalations dictate devs get involved. Here is a view of our production web dashboard showing metrics like average response time, unicorn workers & queue sizes.

Web Dashboard

When this was originally set up, it was done with Logstash-Forwarder on each app server forwarding log events to Logstash which munged them and indexed them into Elasticsearch. We could then visualize those log events with Kibana. This is a typical (possibly naive) setup that looks something like this:

logstash-forwarder  > logstash > elasticsearch < kibana

To get views into the Rails stack, we parsed the log files using multiline and grok filters with custom patterns in Elasticsearch. We got around log events interleaving with each other by having each unicorn process write to its own numbered log file. This worked well for awhile, but eventually we started to run into problems as traffic started to ramp up towards a holiday buying season. Things would work for awhile, but then gaps of events would start to show up in Kibana, slowing to a trickle and eventually stop.

Thankfully, it was not that our application was dying. Logstash was. Digging in, it turned out we had two problems that exacerbated and masked each other:

  1. Logstash could not keep up with the demands of munging all of the log events we were sending to it.
  2. Logstash 1.4.1-2 has a bug in its TCP input that causes it to have a connection leak when clients connecting to it start to time out due to the previous issue.

We fixed the 2nd problem first, patching our version of Logstash with the latest code that fixes the connection bloom problem. With that cleared up, we could look at what the bottleneck was within Logstash.

Logstash is written in jRuby, and its internals are described as a pipeline. Things are processed by input, filter(worker) and output threads that do the work that is set up in the input/filter/output stanzas of the configuration. Each of these areas is fronted by a queue that can hold 20 elements. The threads pull from the queue, do their work, pass it on to the next queue or out and repeat. Out of the box, Logstash allocates one thread to each input, a single worker thread, and one thread for each output. This looks something like this:

input source --> input thread   filter thread   output thread --> output destination
                            \   /           \   /
                            queue           queue
                            /   \           /   \
input source --> input thread   filter thread   output thread --> output destination

Problems crop up when any of these areas of Logstash cannot pull from its queue faster than it is filling up. Logstash as a system backs up, with varying effects depending on what your input is. In our case, it was Logstash-Forwarder connection timeouts and subsequent connection leaking on attempts to reconnect. If Logstash was pulling from a redis list as a queue, it would be queue bloat.

Using a combination of top and java thread dumps, we could see our bottleneck was in the filter worker thread. The input threads and output threads had little CPU use and looked to be blocking on their empty queues at all times. The filter worker thread was pegging a CPU core, however. Easy enough, lets just up the number of worker threads in our Logstash deployment.

Wrong. Remember that multiline grok filtering I mentioned earlier? Turns out that Logstash’s multiline filter is not thread safe and when you use it, you are limited to only using 1 worker thread. Okay, then you simply move the multiline event collection into the input area of Logstash using a multline codec. Nope, that won’t work either. The multline filter allows you to specify a stream_identity attribute that can be used to keep the events separated by file name. The multiline input codec offers no such thing, which would mean all our efforts to keep rails multiline log messages separate from each other would be out the window.

Now we had to step back and re-evaluate the infrastructure. Ultimately, we decided to do the following:

  1. Do the multi-line event roll up on the app server side. This would become the responsibility of whatever was tailing the logs and shipping it to Logstash. We could then chuck the multiline filter in Logstash and scale out our filter workers within a single Logstash process.
  2. Use a redis list as a broker between the tailing daemon app-server side and Logstash so that we could have some event durability and have the potential to scale out to multiple Logstash processes on multiple machines to munge through our log data.

Logstash forwarder supports neither mutli-line event roll up or communicating to redis, so this meant we had to find another tailing daemon that did, or we had to deploy Logstash itself to each app server. We really did not want to do the latter, as it introduced java dependencies and seemed very heavy for what needed to be done.

Enter Beaver, a log tailing daemon written in Python that supports both of the above requirements. We did a quick proof of concept to make sure it would work, deployed it to one web server to see how it performed over 24 hours and then pushed it out across all our servers. Things have been working well for several days with no service interruptions. Now our infrastructure looks like this:

beaver > redis < logstash > elasticsearch < kibana

One Logstash instance is still enough for us after pushing multiline-roll up responsibilities to the Beaver on app servers and being able to use multiple threads/cores to do filter processing in Logstash. But when increasing log traffic/size again starts to overwhelm Logstash, we are better positioned to scale out to multiple instances munging the data being pushed to redis:

beaver           logstash
      \         /        \
beaver > redis <          > elasticsearch < kibana
      /         \        /
beaver           logstash

It was an interesting 3-4 days spent in Logstash-scale land. It is an amazing tool that really helps us deliver a quality experience to users of our application. As part of an ELK stack, it is the 80% of Splunk that you really want at no cost. But without paid licensing, you have to roll up your sleeves and get to work in cases like these. Fortunately, there is a great community behind it and lots of help to be found on the web and in #logstash at freenode.

At ShippingEasy, we use the ruby Prawn gem to generate shipping label PDFs for our customers. This is where we make our money, and so having this be a fast and pain-free experience is crucial to our business. Prawn has generally delivered finished PDFs well, but its performance has been not what we want. So I have started looking into how we can speed up this process. Here are some early results of benchmarking some options including upgrading Ruby, pure jRuby and jRuby invoking Java.

One thing I did early on was to just collect some basic benchmarking numbers for Prawn and its rendering of images into PDFs. There were 4 test groups:

  1. Prawn with Ruby 2.0.0 (at the time our current setup)
  2. Prawn with Ruby 2.1.2 (an upgrade we were undergoing)
  3. Prawn with jRuby and JIT compilation (no code changes)
  4. Prawn with jRuby delegating the PDF work to a Java class using PDFBox

The benchmark code used was Prawn’s png_type_6.rb (or a java equivalent) and yielded some interesting results…

Components Time Speed Increase
Ruby 2.0.0 + Prawn 6.65s
Ruby 2.1.2 + Prawn 5.10s 130%
jRuby 1.7.12 (JIT) + Prawn 4.02s 165%
jRuby 1.7.12 + Java/PDFBox 3.26s 204%

My takeaways from this are:

  1. Upgrade to ruby 2.1.2. Performance boosts + no code change = win.
  2. jRuby’s JIT compilation option is no joke. Your code interprets to bytecode once and subsequent invocations run the compiled bytecode more fast than MRI interprets ruby.
  3. The interoperability between jRuby/Java is a nice feature. I came up through the java ranks, so being able to drop to it (instead of C) when needing to go to a lower-level for performance is handy.

We have only upgraded to ruby 2.1.2 at this point, and I do not know if we’ll wind up doing anything else here. Even so, its nice to know we have additional options if we need to continue to improve performance in this area.

For the Java/PDF box benchmark, I used the following code:

 1 # encoding: utf-8
 3 $LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
 4 require "benchmark"
 5 require 'java'
 6 require 'target/javapdf-1.0-SNAPSHOT-jar-with-dependencies.jar'
 7 java_import com.shippingeasy.javapdf.CreatePdf
 8 pdf_creator = CreatePdf.new
10 N=100
12 Benchmark.bmbm do |x|
13   x.report("PNG Type 6") do
14     N.times do
15       pdf_creator.generate
16     end
17   end
18 end
 1 package com.shippingeasy.javapdf;
 3 import java.io.File;
 4 import java.awt.image.BufferedImage;
 5 import javax.imageio.ImageIO;
 7 import org.apache.pdfbox.pdmodel.*;
 8 import org.apache.pdfbox.pdmodel.edit.*;
 9 import org.apache.pdfbox.pdmodel.graphics.xobject.*;
11 public class CreatePdf {
12   public void generate() throws Exception {
13     PDDocument doc = null;
14     try {
15       doc = new PDDocument();
16       drawImage(doc);
17       doc.save("dice.pdf");
18     } finally {
19       if (doc != null) {
20         doc.close();
21       }
22     }
23   }
25   private void drawImage(PDDocument doc) throws Exception {
26     PDPage page = new PDPage();
27     doc.addPage(page);
28     PDPageContentStream content = new PDPageContentStream(doc, page);
29     content.drawImage(xImage(doc), 0, 0);
30     content.close();
31   }
33   private PDXObjectImage xImage(PDDocument doc) throws Exception {
34     BufferedImage img = ImageIO.read(new File("data/images/dice.png"));
35     return new PDPixelMap(doc, img);
36   }
37 }

Lance Woodson

Dev + Data + DevOps