As a true Christmas tradition I have to post a blog article during the Christmas days.
This year, I will show how you can remote control the headless browser PhantomJS using the open standard protocol WebDriver (also used by the Selenium project). The PhantomJS browser will run inside a Docker Container that will be hosted on a Linux Azure Web App. Why is that useful you might ask, let me try to give a few use case.
- UI Testing
- Build your own screenshot api
- Web scraping
PhantomJS (often in combination with Selenium) is quite popular for doing browser based testing, if you run a build server you can run your tests on the actual build server, but if you have many UI tests, that can take a fair amount of time, so you might want to have more test runners to speed of the time it takes to complete the tests. The typical approach used in bigger companies is to have some sort of test farm, most of these farms are probably using docker containers today but they still have to run on some docker orchestration platform that you need to maintain, or even worse they run on actual Virtual Machines that needs maintenance. Using Azure Web App is a lightweight alternative, you can spin up many as many web apps (each web app hosting one container) as you desire in minutes (and kill them equally fast to incur no charges). This way you get much of the same scalability as you would be able to with a docker orchestration platform, without having to know anything about docker orchestration.
Build your own screenshot API
A fairly often asked request is how to programmatically take a screenshot of a web page. Unfortunately there is not a simple solution to that, as web pages can be quite complex (unless you go with one of the many SaSS solutions). One solution is to remote control a browser and take the screenshots that way, PhantomJS can be used for that, and building an API on top of our Azure Web App hosted PhantomJS Docker container is a simple task, that I will show in a follow up blog post.
Ideally when doing web scraping we want to avoid having to deal with remote controlling a browser, but in some scenarios it is simply not possible to get the information we want to scrape without having something that looks and acts a real browser, and in those scenarios PhantomJS can be a good fit.
So the solution I’m going to present is focused on the infrastructure aspects. It will be a docker container hosting PhantomJS and an Azure Resource Manager Template that create one or more Azure Web Apps Hosting the container. The main reason for using Azure Web Apps as the docker host platform, is to bring awareness to this, in my opinion nice, lightweight approach to host containers in the cloud.
In the next blog post I will show how to remote control the PhantomJS browser using the Selenium WebDriver from C#.
The Docker Container
I started out searching the docker hub for a phantomJS container that I could use, and found wernight/phantomjs which is maintained and looked exactly like what I wanted. The container works flawlessly on my Windows 10 machine, but I had trouble getting it to work in Azure Web Apps.
The first obstacle was that the container runs in user space, which is a good thing if you are concerned about security (as far as I understand), but unfortunately it made the container crash on startup when trying to run in Azure.
The azure web app keeps logs files for the docker environment under
/home/logfiles/docker these files contain valuable information, if you docker container crashes. The error when starting the container was
failed to register layer: Error processing tar file(exit status 1): Container ID 72379 cannot be mapped to a host ID
To avoid running in user space, I copied the docker file, and created my own image, with that part removed. You can find it here https://hub.docker.com/r/sjkp/phantomjs/
Now the container wouldn’t crash on start up. But I still was unable to setup phantomjs correctly so that I could connect to it.
So at first I tried supplying the following startup command (as that is how I would start the container on my Windows 10 host):
-d -p 8910:8910 sjkp/phantomjs phantomjs --webdriver=8910
But that didn’t work at all. After a lot of fiddling around trying different things, I found out two important things, that I have not seen documented anywhere.
- You can only use port 80 and 443, so if you container is running on another port, you can port map it, you have to expose 80 and/or 443
- You can’t pass arguments, so the –webdriver=8910 that tells phantomjs to run with webdriver on the specified port, will not work either, those arguments are simply ignored
The solution I found was to change the port of the docker image and also the default parameters. That way I can run the phantomjs docker container with the startup command as show in the picture. There might be other ways to do it, but as of right now there is literately no documentations for docker containers on azure web apps, so this was the only way I could get it to work.
Azure Resource Manager Template
With a working docker container, we can very easily build a Azure Resource Manager templates that can deploy one or more instances of our container to an Linux Azure Web App.
I’m not going to post the templates in its entirety here, but you can find it on github where I also put a nice deploy to azure button that will allow you to spin it up quickly.
Once your web app is up and running, you can test that it is working by sending a post request to
With the Content-Type header being set to application/json and with a body of
Then you get a session object in return that looks like
That concludes the infrastructure needed for getting up and running with PhantomJS in a Azure Web hosted docker container. I will do a follow up post where we do some interesting things with what we just created.