Posts Tagged ‘aws’

Share Testing Your AWS Elastic Load Balancer

Tuesday, July 27th, 2010

Vijay Ramachandran asked me, via twitter, how to test if an Amazon Elastic Load Balancer is really doing it’s job. Because 140 characters really isn’t sufficient space to handle this answer, I’ve created this post. Feel free to use any of this in any of your environment.

First, I’ll assume you’ve covered some of the basics with ELB.

The default configuration you’ll end up with following my guides above is a stateless system that distributes the requests more or less evenly across all configured servers. However, when you do it the first time, it’s nice to see that it’s actually doing what you think it should be. The steps are simple

  1. Verify each instance is working as expected
  2. Verify the load balancer is distributing the requests across multiple instances
  3. Verify the instances are working behind the load balancer

1. Verify each instance is working

This is far and away the easiest step. You can simply access each machine by the amazon assigned IP address for that specific instance and ensure that it’s doing what you expect. The only potential issue here is you might jump from one machine to a different machine if you are not watching your URL. For example, if you are on ec2-123-123-123-123.compute-1.amazonaws.com, access your application at that address and ensure it works as expected, if it jumps to a domain name because you’ve hard coded a link somewhere, you may not be testing the new server at all.

2. Verify the load balancer is distributing the requests across multiple instances

To test that requests are being distributed across multiple machines, I use a test file. I generate my test file automatically by running the following script as part of the boot-up routine. This simply saves the instance-id from the metadata into a text file. If you are uncomfortable placing this information in the web root, you can optionally place it behind basic authentication, put it into a script that hashes it (md5 or sha1) or some other application based logic to access it.

/usr/local/bin/curl http://169.254.169.254/latest/meta-data/instance-id
 > /var/www/html/instance-id.txt

Check the path for curl and the web root for your local system and adjust accordingly. This should work from RedHat flavored distributions.

Once you’ve run this on each of your instances, you can tell that requests are being distributed to both machines by simply requesting your load balancer address and verifying that it changes. (Obviously replace the following request with the correct address for your machine.)

http://applicationservers-123456789.us-east-1.elb.amazonaws.com/instance-id.txt

3. Verify the instances are working behind the load balancer

Now for the last and final test. Confident that your requests are being distributed across both machines, test that your application works as expected. First under the Amazon assigned name, applicationservers-123456789.us-east-1.elb.amazonaws.com in this example, then under your CNAME’d alias.

If everything still works, you can assume all is good.

4. Bonus Check

If you really, really, really want to know… you can also verify using your access logs. Check in /var/log/httpd/access_log or wherever your web server logs are kept to see that requests are being distributed to each machine.

DNS Tips:

1. Never use the real IP returned from dig or nslookup as an A record in DNS unless you automate checking it (and even still I wouldn’t) because the actual IP changes from time to time. Only use CNAME entries.

2. If you are using GoDaddy’s DNS tool, you can’t CNAME the root of a domain (ie .example.com). For this case I use one instance as a permanent instance with an elastic IP and point the root A record for my domains to this. I then assign www. as a CNAME for the load balancer’s AWS assigned domain. Last but not least, I use .htaccess and mod_rewrite to ensure requests are sent to www.example.com. This ensures traffic is being sent to the load balancer address.

Share Amazon Sweetens the Cloud Pot

Friday, March 26th, 2010

I was excited to learn that AWS has sweetened the pot for people who want to try out infrastructure as a service by eliminating bandwidth charges less than 1Gb. Furthermore, the aggregation of pricing across services means many smaller sites will get their bandwidth for FREE!

I strongly believe that if all your putting up for your website is static pages and a few photos, you could effectively use S3 as your lone hosting solution. Now that the bandwidth for these small sites would be zero, your charge is only the cost of the files, which would likely be less than $0.10 per month.

The email I received is below.

Dear AWS Customers,

Starting April 1, 2010, your Data Transfer Out pricing tier for a given Region will be based on your total Data Transfer Out usage within that Region for Amazon Simple Storage Service (Amazon S3), Amazon Elastic Compute Cloud (Amazon EC2), Amazon SimpleDB, Amazon Relational Database Service (Amazon RDS), Amazon Virtual Private Cloud (Amazon VPC), and Amazon Simple Queue Service (Amazon SQS). Until now, usage tiers have been calculated individually for each service, based on data transfer related to that service. Because AWS is now aggregating your total Data Transfer Out usage across multiple services, you can reach higher usage tiers and lower pricing more quickly. In addition, you’ll benefit from a complimentary tier which provides your first GB of outbound transfer in each Region each month at no charge.

The tiered pricing for Data Transfer Out is as follows for each Region:

First 1 GB of data transferred out per month is free
Remainder of first 10 TB per Month: $0.15 per GB
Next 40 TB per Month: $0.11 per GB
Next 100 TB per Month: $0.09 per GB
Over 150 TB per Month: $0.08 per GB
As you may know, all inbound data transfer is free of charge until June 30, 2010. All data transfer usage (both inbound and outbound) for participating Amazon Web Services now appears in aggregate in its own section of your AWS account activity page and monthly bill. As a bonus, you’ll notice that your first GB of outbound data transfer in each Region is now included free of charge.

As always, thank you for your support.

Sincerely,

The Amazon Web Services Team

Share Honesty Box: EBS Performance Revisited

Tuesday, March 2nd, 2010

As part of my work on Honesty Box, I’ve been reviewing EBS disk performance once again. This was a great opportunity to expand on the research from last year. After re-reading what I posted then, along with the wealth of data that has been compiled since, I realized I still didn’t have sufficient information to answer two key questions.

  1. How does the number of EBS volumes impact a performance of RAID 0?
  2. Does the instance size, make a significant difference in the RAID performance?

As before I used Bonnie++ to measure the results. You can read about the full method I used below.

Results

  • RAID 0 performed better with an even number of EBS volumes.
  • RAID 0 performed best with 8 volumes for writes and random seek.
  • RAID 0 performed poorly for reads!
  • Larger instances perform significantly better than smaller instances.
  • The ephemeral store has very good overall performance.

Data

The titles of the Bonnie tests can be confusing for folks removed from the programming process. Be sure to read the full explanation of what each test is doing.

Sequential Output is a measure of the write performance to the drive. Higher bars are better. With RAID 0 it appears that an even number of drives performs significantly better than an odd number.

Sequential Create is a measure of the files created by Bonnie. Higher values are better. Test that complete too quickly return no values. That is the cause of the missing bars for Read/sec above. You can safely consider that value too fast to measure.

Sequential Input is a measure of the read performance from the drive. Higher values are better. This is concerning because of the steady decline in block read performance associated with the number of available volumes. This may have to do with the time of day that these tests were run and really warrants more investigation. It should also be noted that this is a measure of sequential performance so unless your reading contiguous files off the disk, this number may be irrelevant to you.

Random Create measures how the files are created and deleted. Higher values are better. Again, tests that happen too quickly are discarded explaining the Read/sec result having no values.

Random Seeks should scale consistently with the number of EBS volumes added. Higher values are better. However, that did not appear to be the case and a limit appeared to be reached at 8.

Effect of CPU

To test the impact of the CPU units, I selected the 4 volume array and then compared it with the tests run last year. Both were using 4 volume EBS RAID 0 with XFS file systems. They both used the noop IO scheduler. The underlying OS did change from Fedora to Ubuntu and a year has passed.

Sequential Output Taller is better. Clearly the additional IO capacity in the larger instance does make a big difference in the performance of the volumes. I would expect smaller increments in CPU capacity would result in smaller differences.

Sequential Input Taller is better. Clearly the m1.large instance out performs the smaller m1.small instance.

Thoughts and Next Steps

After reviewing the performance of the native ephemeral storage, I wonder if partitioning the ephemeral store and assembling a RAID array from there might not be the best route for high speed storage? Of course backup would be a potential issue, but snapshotting of XFS may be able to mitigate that. For future tests I would like to study the impact of using the -b flag which causes Bonnie++ to flush to disk. I also think larger volume sets as shown by these tests and different I/O schedulers may yield different results.

Method

As before I used Bonnie++ to measure disk performance but it’s limitations are fairly well understood and it gives us a metric that can be compared with other metrics. You can read the full explanation of what each value actually means here. Armed with 16 EBS stores mapped to an unused m1.large instance, I began running tests. The process was as follows:

  1. Create a new RAID set using a chunk size of 256
  2. Use XFS to format the drives
  3. Mount the filesystem w/ Ubuntu defaults
  4. Capture Bonnie results
  5. Dissassemble the RAID set
  6. Rinse and repeat

I did this for 2-10 volumes and then one additional test with 16 volumes. For comparison, I also ran the test with the ephemeral store and a single EBS volume. Those are the results represented in each of the graphs above. I reran the 6 volume test 3 times over the course of a day and took an average value for the graphs.

Share Cloud Pricing Models

Monday, December 14th, 2009

By ArcticNomad Yesterday Amazon announced their Spot pricing model. Effectively providing market driven pricing for instances on EC2. Depending on your product, this probably won’t impact you much, but it got me to thinking about pricing of the cloud. Amazon’s Web Services was a game changer when it launched. Buy the computing resources you need for only the time you needed them. However, your stuck with a very limited set of instances and therefore you need to architect your systems around their pre-defined instance sizes. While they expanded their instance offering to include high cpu and more recently high memory instances, you’re still stuck with a fairly rigid set of boxes from which to run your systems.

A specific weak spot I’m having with the pre-defined box sizes is Memcached. It turns out that Memcached is fairly light on the processor and requires essentially no disk I/O. Really the processor is just a go between for the memory and the network card. If you are looking at putting a 32Gb server online to manage the caching tier for your app, you’d need to buy the “High-Memory Double Extra Large Instance” for $1.20/hr (or $10,512/year) wait… what?! Okay, obviously we should pre-pay this, typical business model is to run the hardware over a 3 year cycle, so lets pay the $4,900 up front and then we enjoy a more comfortable $0.42/hr (or $3,679.20/year + $1,633.34/year for the pre-pay = $5,312.54 each year for 3 years). Obviously the $15,937.60 we pay over 3 years is easier to swallow than the $31,536 if we don’t pre-pay it.

Now, if your running your infrastructure in the cloud and considering using Memcached, you really can’t put a box in a rack somewhere else because the increased latency and unreliability means you may not be able to get data from your cache in a cost effective way so I’m not going to look at what buying a box with that kind of memory would cost, not to mention there is such variation in buying rack/ping/power that it would be too messy to calculate here.

This has me intruiged to see how other providers are doing their billing. I love the idea of a-la-carte servers paid by the hour. But really what would be great is allowing me to choose the CPU, memory, and I/O I need. This brings me to two smaller cloud providers who seem to have interesting offerings.

First up is 3Tera. 3Tera offers a completely different take on the cloud infrastructure model. The idea behind their offering is that you purchase hardware (or lease it) and then slice the box however you want. Basically, running your own virtual cloud! You can consider different hardware options, including stuffing a ton of RAM into weaker boxes and so on. Ultimately the product is a resource allocation tool. The dark side is that you have to pay for all that hardware, even if your not using it. Really this isn’t a cost savings over EC2. Although it’s an interesting idea if your system resource needs shift significantly over time, but are consistent enough to warrant buying or leasing hardware. I’m really interested in their technology and they have an impressive list of partners running the software that you can then lease the virtual images from.

The second provider is OpSource Cloud. OpSource charges a base fee for the VLAN service and then you build your infrastructure on top of that. The beauty is that it’s a-la-carte down to the cpu cycles and memory! Currently the memory footprint is limited to 8Gb and each machine needs between 1 and 4 CPU’s. However, this pricing model is interesting as you can provision a single CPU with 8Gb of RAM which comes out to roughly $0.24/hr (or $2,102.40/year). Starting 4 of these instances to hold the 32Gb of cache is only slightly cheaper than Amazon’s model coming in at a whopping $8,409.60/year. There are some cost savings available if you buy a silver, gold or platinum pricing tier for a monthly pre-pay. The pricing for those starts at $500/month and goes up; so you really need to have some significant hardware running to justify those costs. Another gotcha with this plan is that you need to provision a network which is $0.20/hr. I’m going to be keeping an eye on this provider. I think in the future they may have a winning solution.

Unfortunately, I don’t yet see a solution that fits my specific need. Perhaps I need to adjust my thinking and look at alternatives. It may be time to consider Amazon’s Simple DB, which provides simple key/value storage like Memcached, although as a service. Is it the answer for putting large amounts of data into a non-RDBMS? I’ll consider that in another post.


Creative Commons Photo by ArcticNomad

Share Amazon Opening US-WEST-1

Wednesday, December 2nd, 2009

While I don’t have access to it yet, apparently Amazon has opened US-WEST-1 for EC2. Customers using enStratus have access already.

This is important because anyone leveraging platforms like Facebook or MySpace have just put themselves nearly 3000 miles closer to the key data centers where these platforms are running out of.

Share MySQL Relational Database Service on AWS

Tuesday, October 27th, 2009

Amazon Web Services Logo The stable of services available through AWS is continuing to expand! Last night Amazon announced RDS (Relational Database Service) which look a lot like EC2 instances running MySQL with EBS volumes – something I have a fair bit of experience with. However, these have the added benefit of being a service that can scale memory and processor both up and down with a single service call.

# ds-modify-db-instance mydbinstance --db-instance-class db.m1.xlarge -s 100

This flexibility comes with a downside, namely a 4 hour monthly service window where patches, updates and those requested capacity changes are applied. You can choose to apply them immediately, but your application should be prepared to handle the downtime. What happens is, your database instance goes offline and when it comes back, it has all the changes you requested applied. So at best, you should expect uptime in the 99.4% range. Most applications can handle a 4 hour downtime if it’s planned for. Under more conventional MySQL builds, developers or system administrators will mitigate these downtimes by first applying changes to slaves, promotion of one slave to master and then finally applying the changes to the original master. This sort of safety net provides gives applications smaller downtime windows (at most a few minutes each) allowing for theoretical 99.999% uptime.

Transitioning to RDS may not be without pain either. Importing your data is done through a mysqldump (or other flatflile export) and then playing that file back into your AWS instance. Depending on the size of your dataset a full mysqldump and re-importing may take days (no I’m not exaggerating). Also note, during the time mysqldump runs, your original database will acquire a read lock for consistency. With some DB’s I manage, I’ve stopped using MySQL dump entirely because the dumps took more than 4 hours to complete on a dedicated slave. With the myriad of snapshotting technologies available, it’s much easier to grab a binary copy of the DB every few hours. One last limitation is replication isn’t an option. I suspect AWS will be working on this soon as part of a HA (High Availability) release option.

Despite the limitations, I’m excited about this offering. This offloads much of the maintenance and management tasks which are usually the most tedious. I also hope that this means a higher IO disk subsystem may be coming to EBS soon.

Share Elastic Load Balancing in Multiple Zones

Thursday, July 30th, 2009

Ran into a problem this morning with Amazon’s Elastic Load Balancer. If you want to have multiple availability zones, say us-east-1a and us-east-1b, behind your elastic load balancer, be sure to have at least one healthy instance running in each. What happens otherwise is inbound requests will “dead end” and serve up 503 errors. This is because the DNS actually resolves to each zone regardless of the health checks and before passing requests to the actual load balancer for that zone. In otherwords, the zones are unaware of the status of the machines in different zones. :(

From Paul@AWS on the Amazon Discussion Forum:

The output from your describe call shows that you have two zones enabled:

       <AvailabilityZones> 
          <member>us-east-1b</member> 
          <member>us-east-1a</member> 
        </AvailabilityZones>

but you only have instances behind one of them. Whenever your client happens to get directed to the empty zone (which happens at DNS resolution time), it will have a dead-end.

The solution is to either add instances in the additional zone or disable that extra zone.

You can read the full thread that tipped me off this morning to the issue I was experiencing.

Share 3 Amazon Elastic Load Balancer Tips

Wednesday, July 15th, 2009

Amazon Web Services Logo

Getting running on Amazon’s Elastic Load Balancer is easy. Once your up, you’ll also need to monitor it and do some basic maintenance of your nodes. These tips should make the most of the Elastic Load Balancer and show you some simple ways to get the monitoring data you’ll need.

1. Configure Health Checks

If an instance is off or not responding, you will want the load balancer to stop sending requests to those instances ASAP. The following code will setup a check that polls your server every 5 seconds. Be warned, this functionality is not enabled by default!

elb-configure-healthcheck  ApplicationServers  --target "TCP:80" --interval 5 --timeout 3 --unhealthy-threshold 2 --healthy-threshold 2

Once you have this setup, the load balancer will check that port 80 responds to http requests and will stop sending requests to any instances if it sees a problem. You can then check in on the status of your instances with the following command:

elb-describe-instance-health ApplicationServers
INSTANCE-ID  i-12345678  InService
INSTANCE-ID  i-23456789  InService

2. Clear Out Old Instances

If you are using auto scaling to automatically add the instances to the load balancer, you can probably skip this one. But if you are like me and add instances to the load balancer only after completing the startup scripts, you’ll need to periodically clean out any invalid instances. After running with elastic load balancer for a few weeks, I found I had extra instances registered with the load balancer that were no longer running. When an instance that registers itself is shut down by auto scaling, the load balancer isn’t updated. This is VERY important, because after a week, the instance id will likely have cycled through to someone else!

elb-deregister-instances-from-lb ApplicationServers --instances i-23456789,i-34567890

3. Check Monitoring Values

You already know about describing the instance health from when you setup the health check before. Now checkout the cloud watch monitoring tools. If you’re not using auto scaling, these values in combination with your own internal metrics will help determine when to add capacity. Spend some time with your log files and these metrics.

mon-list-metrics | grep "ApplicationServers"
 
HealthyHostCount    AWS/ELB  {LoadBalancerName=ApplicationServers}
Latency             AWS/ELB  {LoadBalancerName=ApplicationServers}
RequestCount        AWS/ELB  {LoadBalancerName=ApplicationServers}
UnHealthyHostCount  AWS/ELB  {LoadBalancerName=ApplicationServers}
 
mon-get-stats HealthyHostCount --statistics Average,Minimum,Maximum --dimensions "LoadBalancerName=ApplicationServers" --namespace "AWS/ELB" --period 600 --headers
 
Time                 Samples  Average  Minimum  Maximum  Unit
2009-07-16 03:47:00  98.0     2        2.0      2.0      Count
2009-07-16 03:57:00  103.0    2        2.0      2.0      Count
2009-07-16 04:07:00  98.0     2        2.0      2.0      Count
2009-07-16 04:17:00  98.0     2        2.0      2.0      Count
2009-07-16 04:27:00  99.0     2        2.0      2.0      Count
2009-07-16 04:37:00  98.0     2        2.0      2.0      Count

Share Scaling Out with EC2, CloudWatch, Auto Scaling and Elastic Load Balancing

Thursday, July 9th, 2009

Amazon Web Services Logo Earlier this year, Amazon launched a suite of new services that replaced the need to work with a product like Scalr and RightScale for building scaleable applications on the EC2 platform. Those tools help you allocate more resources according to current application load. The key benefit of using a cloud based service is that you only pay for what you use. However, without one of the afore mentioned providers, and their additional costs, you were in a lurch designing a system that could detect the current load of your infrastructure and respond accordingly. Amazon has now made it very simple to create infrastructure that can expand AND contract very simply, of course only paying for what you use.

The Tools

Elastic Load Balancer

Solutions for load balancing were as varied as round robin selection DNS to running a load balancer on an instance (I’d been using Nginx on an m1.small instance $0.10/hr). Nginx worked well, with an assigned an elastic ip (static ip) that could move from machine to machine as needed and special scripts to manage the pool of servers (or do it manually). It worked, but was by no means efficient or even easy to maintain. Furthermore, there is a single point of failure with the Nginx host. Being proactive, it was possible to create a monitoring system to monitor Nginx, and then bring up and configure a new server before re-directing the elastic ip to the new host should it fail. It was a hack and certainly not elegant!

Enter elastic load balancing. You create an elastic load balancer and then add the instances to the load balancer. That’s it! Amazon handles the redundancy and the best part is that it’s only $0.025/hr that’s a savings of $54/month over running a load balancer instance. There is of course a drawback. With Nginx and other load balancers, you have the option to do intelligent load balancing. Advanced functionality like sticky sessions and response rewriting isn’t available for the Amazon solution. However, with a well designed application, this should be irrelevant.

CloudWatch

Monitoring the cloud is VERY important. Amazon has issues with all sorts of things from EBS stores going offline to instances being completely unavailable. Before CloudWatch, I used a mix of systems including SNMP monitoring and 3rd party service Pingdom to keep tabs on my instances. The CloudWatch product doesn’t replace these, but rather supplements the data I gather from them. CloudWatch is an additional $0.015 per server above the default instance cost, it takes about 2 minutes to come online and the statistics are available through the API almost immediately after that. CloudWatch provides access to monitored instance’s CPU utilization, disk read bytes, disk read operations, disk write bytes, disk write operations, network in, and network out. I find for my needs, CPU utilization is an excellent indicator of server performance and I use that to determine when to add a new server or take one away.

You can gather these statistics grouped by AMI, Instance Id, instance type and even AutoScaling group. If you can reliably detect your need to add an additional server based on these statistics, you’ll be able to take advantage of Auto Scaling; more on that in a minute. If not, it’s very simple to write a script that determines if it’s time to start a new server up to help with processing and register it with the load balancer. Oh, and did I mention for the load balancer you also get access to healthy host count, latency, request count, and unhealthy host count? These could be helpful metrics for rolling your own scaling scripts or may be sufficient for knowing when you need an additional server.

Auto Scaling

This is the glue that brings it all together. Auto Scaling monitors your statistics from CloudWatch and starts new instances when needed then turns them back off when no longer needed. Currently this is all offered for FREE if you are using CloudWatch! The setup is simple once you go through it the first time, but took me a couple of tries to get it right. So in my case, I monitor my application server pool and when I see that it’s stressed, I add another server. Because of the way it’s configured, I have some safe guards in place that keep me from starting thousands of instances too.

How To Do It

Background

This assumes you’ve installed all the Amazon CLI tools for Elastic Load Balancing, Auto Scaling and CloudWatch, your fairly comfortable at the command line and know how to make your own AMI. Now, you’ll need to determine what the best way for you to publish your code to a new server is. Some possible solutions to this are rsync, subversion, nfs mount, s3 or a mix of technologies. Some folks just bundle up the code in their AMI (works well if your codebase is static). Regardless, that’s a bit beyond the scope of this post. After you create your solution, you’ll create a server image (AMI) that can boot up and correctly get a copy of the code you’re running. If you already have that, you can of course just use that one. Once you create an instance that can be turned on and handle traffic…

The Process

  1. Create the Load Balancer
  2. Create the Auto Scale Launch Config
  3. Create the Auto Scale Group
  4. Create the Auto Scale Trigger(s)

Create the Load Balancer

The DNS-NAME that is returned is the point you’ll direct all traffic to. Add this as a CNAME in your DNS for your domain.

elb-create-lb ApplicationServer --availability-zones us-east-1a --listener "protocol=http,lb-port=80,instance-port=80"
DNS-NAME ApplicationServer-12345678.us-east-1.elb.amazonaws.com

Create the Auto Scale Launch Config

The AMI will of course be your AMI that knows how to come online and get a fresh copy of your code and you may be using different instance types. Definitely take a look over the documentation to ensure you are doing it all right

as-create-launch-config AppServerConfig --image-id ami-12345678 --instance-type m1.small --group default

Create the Auto Scale Group

I use a nice long cooldown period here (10 minutes) so that the servers don’t come online or go offline too quickly. If you expect an occasional slashdotting, you might want this to be shorter. This also provides some a boundry. There will always be at least 1 server and no more than 3. This also tells auto scaling that you want the new instance to join the load balancer.

as-create-auto-scaling-group AppServerGroup --launch-configuration AppServerConfig --availability-zones us-east-1a --min-size 1 --max-size 3 --cooldown 600 --load-balancers ApplicationServer

Create the Auto Scale Trigger(s)

You will likely spend a good bit of time working on this portion. What this basically does is if the average CPU utilization for my servers is above 70% for 10 minutes, bring a new server online. Then likewise, if it falls below 30% for 10 minutes, turn one off. The Auto Scaling Group we created ensures there is always at least 1 server online.

as-create-or-update-trigger AppServerTrigger --auto-scaling-group AppServerGroup --namespace "AWS/EC2" --measure CPUUtilization --statistic Average --dimensions "AutoScalingGroupName=AppServerGroup" --period 60 --lower-threshold 30 --upper-threshold 70 --lower-breach-increment=-1 --upper-breach-increment 1 --breach-duration 600

That is all there is to it! You now have an system that can grow your application servers up on demand! I hope this helps you build out an infrastructure that lets you scale up your next web application. You might want to look over the command line tool documentation before getting started.

Share Loading Data Into Bash Variables

Tuesday, July 7th, 2009

Go away or I will replace you with a very small shell script. For those unfamiliar with Unix and Linux environments, bash is the command line shell that is standard on many distributions. These examples grew out of challenges attempting to automate EC2 processes. These basic principles of course can be applied more generally as needed. My goal is to simply provide the options I need most often in a single place. As I continue to automate routine tasks, I might just be able to replace myself, or others, with a series of scripts!

The Gaps

The Solutions

Loading data from a URL

With Amazon EC2 many startup values are available via a HTTP request to the internal address instance-data.ec2.internal. If you want to know more about these values, the Developer Guide is a good resource.

#!/bin/bash
MY_INSTANCE_ID=`exec wget -q -O - http://instance-data.ec2.internal/latest/meta-data/instance-id`
echo $MY_INSTANCE_ID

This script grabs the instance id and puts it into a variable and prints it back out by using the result of a remote execution to wget with the quiet (-q) and the output file set as standard output (-O -), the second dash is what sends the data to standard output so don’t forget it! Now anywhere in our script we want the instance id for string comparison, logging or whatever, we have it!

Loading data from a file

What if the data we want to load is in a file on the disk? This method is not good for processing giant apache access logs, but with smaller text files, it will work just fine.

#!/bin/bash
FILE_DATA=( $( /bin/cat file_data.txt ) )
for I in $(/usr/bin/seq 0 $((${#FILE_DATA[@]} - 1)))
	do
		echo $I $FILE_DATA[$i]
	done

What’s going on? The code is being loaded into an array, in bash, called FILE_DATA. It then loops over each element in the array using a for loop. Finally within the loop, we simply print the current index and then output the line we loaded. This would be roughly equivalent to running cat -n file_data.txt from the shell directly, but obviously gives us the flexibility to do further processing with the string contained in the variable.

Loading data from the user

Obviously this is not ideal for creating a process that runs on a cron job. However, if a script is being run by a user, they often need to tweak something about the way it runs that often can’t be detected automatically. In this case, you’ll want the user to key the data directly into your script.

#!/bin/bash
read -p "Enter Something: " VARIABLE
echo $VARIABLE

This example uses read with the optional prompt (-p) flag. This causes the text in the quotes to be displayed on the users standard output or terminal window.

Loading data from the command line

A step further is to let the user pass in data on the command line at run time. This of course can also be automated if needed. The following example leverages getopts to parse the parameters that were called in.

#!/bin/bash
OPT_A=0
OPT_B='Undefined'
while getopts ":ab:" OPTION
do
	case $OPTION in
	a ) OPT_A=1 ;;
	b ) OPT_B=$OPTARG ;;
	esac
done
shift $(($OPTIND - 1))
echo $OPT_A $OPT_B

The example script takes 2 different parameters a flag “-a” and a flag “-b” which expect data. In the example, default values are provided for each value, this gives the effect of making all flags optional. Using the flag -a would likely toggle a specific behavior within your script, perhaps loading a specific configuration file instead of the default one. If you wanted to collect data in each field, you simply add a colon “:” after each flag, ‘a’ in this example, following the getopts command. You would then update the case statement to reflect your expectation of data being present in $OPTARG. See the modified script below for clarification.

#!/bin/bash
OPT_A=0
OPT_B='Undefined'
while getopts ":a:b:" OPTION
do
	case $OPTION in
	a ) OPT_A=$OPTARG ;;
	b ) OPT_B=$OPTARG ;;
	esac
done
shift $(($OPTIND - 1))
echo $OPT_A $OPT_B

But wait… there’s more!

There’s also a simple way to pass data in that just stores the input from the command line into the $1, $2, $3, $4 and so on input variables.

#!/bin/bash
echo $2 $1

The script above when run as “./test_script hello world” will output “world hello”. This method can be handy for scripting quick tasks that you often use a series of parameters for. For example, adding the flags “-la” to “ls” as demonstrated below.

#!/bin/bash
ls -la $1

Script Configuration

So now that we can get different bits of data from all these different sources, what if all my scripts leverage the same data? Can’t I just have it as a single configuration file that I edit once? YES! This next example does just that. While it doesn’t technically load data into a variable, it does allow you to encapsulate your code, including a file full of variable assignments, into logical chunks. In my case, I was looking to avoid editing multiple scripts when configuration changes were needed.

First I created my configuration script, my_script.cfg, in the same directory I am running my example script below.

# Comments are allowed
OPT_1='Ubuntu'
OPT_2='Linux'
OPT_3='64bit'

Now the script that uses the configuration file above.

#!/bin/bash
OPT_1='RedHat'
OPT_2='Linux'
OPT_3='i386'
if [ -f my_script.cfg ];then 
	. my_script.cfg
fi
echo $OPT_1 $OPT_2 $OPT_3

Dissecting the script you’ll see that I first set some default values. Next the code checks for the existence of the configuration file. If found, it is included. It’s important to note that this is included because it actually allows you to run code within the configuration file. An EC2 instance might, for example, place all of the calls to instance-data.ec2.internal for metadata into a configuration file that’s simply included on scripts that use that information.

That’s it! Hope you find this resource helpful!

And for anyone looking to put those around you on alert, buy the t-shirt from Think Geek.

© 1998-2008 AF-Design, All rights reserved.