I wanted to learn how the public cloud works. Specifically, how do I deploy an application on the internet in 2021. To learn how to do this, I decided to deploy my fork of the awesome arxiv-sanity-preserver. This fork serves paper from the arxiv category cs.DC (Distributed, Parallel, and Cluster Computing). Since AWS is the most popular cloud provider, I will deploy the application on AWS. Let's get started.
The project README provides fairly good documentation on how to run the project. I set up and run the website on my personal computer. Since I am testing that everything works, I only download 100 recent papers, by passing the --max-index
argument.
Following all the steps, I can access the application on my local machine by browsing to https://localhost:5000 .
Now, I want to deploy the application to AWS. Specifically, I want to access the app from any computer connected to the internet.
I created an EC2 instance with the Ubuntu OS. It has 1vCPU, 1GB of RAM and 8GB of attached SSD. (this is free in the AWS free tier). Instructions here.
When creating the instance, I allow access to the SSH port (22) and HTTP port (80). This is so we can SSH into the machine, and communicate with our app running on the machine via HTTP.
After creating the AWS instance, I connect to the instance using SSH. Then, I installed the relevant dependencies for my app using apt
:
$ sudo apt update
$ sudo apt-get install imagemagick poppler-utils python3-virtualenv sqlite3
and installed mongodb using:
# install mongodb
# https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
$ wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | sudo apt-key add -
$ echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/4.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.4.list
$ sudo apt-get update
$ sudo apt-get install -y mongodb-org
# To prevent unintended upgrades, pin the package at the currently installed version
$ echo "mongodb-org hold" | sudo dpkg --set-selections
$ echo "mongodb-org-server hold" | sudo dpkg --set-selections
$ echo "mongodb-org-shell hold" | sudo dpkg --set-selections
$ echo "mongodb-org-mongos hold" | sudo dpkg --set-selections
$ echo "mongodb-org-tools hold" | sudo dpkg --set-selections
Then, I installed the app dependencies as in 1. Run Locally, and run the app. To start the application in production mode, I run the serve.py with the --prod
flag. i.e. python serve.py --prod
. By default the app runs on port 5000. Since the storage on the instance is limited (8 GB), for now, I only fetched 400 papers. I will fix this in the future.
To chech that the app is running, on another terminal on the instance, run :
$ curl [localhost:5000](http://localhost:5000) | less
This shows the HTML response of the web server. The server logs (output in terminal running python [serve.py](http://serve.py) --prod
) will show what HTTP resource accessed and the response (200 OK, if everything works correctly).
Now, the application is running on port 5000, but the instance only allows public access on HTTP port 80. So, navigating to the the public IPv4 address of the instance (i.e. [http://instance-ip-address](http://instance-ip-address:5000)
), it will not load anything. There are a 2 options now.
[http://instance-ip-address:5000](http://instance-ip-address:5000)
. Thar is clunky and bad.To redirect the traffic, I then use the linux iptables
command (reference):
$ iptables -t nat -A PREROUTING -i eth0 -p tcp --dport $srcPortNumber -j REDIRECT --to-port $dstPortNumber
where replace $srcPortNumber with 80 and $dstPortNumber with 5000. Now, on navigating to http://instance-ip-address, I can see the app running!
However, testing this by running $ curl [localhost](http://localhost) | less
on the instance itself would not work. To enable the redirect on the instance, I invoke the iptables
command as (reference):
$ iptables -t nat -A OUTPUT -o lo -p tcp --dport 80 -j REDIRECT --to-port 5000
Now, the HTTP response from the instance can be checked using curl as well.
At this stage, the application is running on an AWS EC2 instance, and is publically accessible by visiting http://instance-ip-address!
In this exercise, I learned the following:
The application works right now, but a lot of improvements should be made to call this a production level deployment. I will tackle this step by step. The immediate next steps are.
EC2 instances are ephemeral, i.e. they can be stopped and terminated. Once an instance is terminated (accidentally or AWS goes down or what not), the data stored in the storage attached to that instance is lost.
A better strategy is to store the application data in a database (like Dynamo DB) or a stable data store (like Amazon S3). In our application, the data is the txt files that are created after pdf to text conversion.
In 2.4, I used iptables
to redirect traffic from port 80 to 5000. This configuration will need to be created everytime a new instance is created. A more robust strategy may be to use a load balancer, that receives public connections on port 80, and internally redirects all the traffic to the EC2 instance on port 5000. This is more complex to setup, as we need to pay more attention to the security groups settings. For a more complex application, this allows serving more concuurent HTTP requests by scaling out to multiple running instances. I will explore this in the future.