Archive for the ‘sql’ Category

Deleting Data From InnoDB

Saturday, October 24th, 2009

MySQL Logo Problem: We are given a large MySQL database table that no longer fits in your system’s working memory. You need to prune the data since a significant portion of this data is no longer relevant to keep in this table. Our expectation is ~75% of the data will remain in the table because of a uniform and random distribution of values in col1 and col2. How then, do we go about pruning this table as efficiently as possible?

The table structure is as follows:

describe big_table;
+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| id        | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment | 
| col1      | bigint(20) unsigned | NO   | MUL | NULL    |                | 
| col2      | bigint(20) unsigned | NO   | MUL | NULL    |                | 
| chardata  | varchar(35)         | YES  |     | NULL    |                | 
+-----------+---------------------+------+-----+---------+----------------+

Solution 1: Delete

We can simply run a query against the database that will delete all records that are irrelevant. The idea here is that we use a single query that’s easily readable to filter out the unwanted records. For this example we’ll assume we are sharding this data and want to keep all odd values for either col1 or col2 in this table. We can use MySQL’s MOD() function and check for even values and delete them. We expect this query to remove ~25% of our table.

DELETE FROM big_table WHERE MOD(col1,2) = 0 AND MOD(col2, 2) = 0;

Solution 2: Create -> Insert -> Rename -> Drop (CIRD)

Another solution is to create a new table identical to the original table then simply insert the records we want to keep into the new table. Once we have all the records we wish to keep, we simply rename the tables and then drop our original table. The difference in this query is to think about what we want to keep as opposed to what we want to get rid of. We expect this method to insert ~75% of the original table.

CREATE TABLE big_table_copy LIKE big_table;
INSERT INTO big_table_copy SELECT * FROM big_table WHERE MOD(col1, 2) = 1 OR MOD(col2, 2) = 1;
RENAME TABLE big_table TO big_table_old, big_table_copy TO big_table;
DROP TABLE big_table_old;

So which is faster?

It turns out it’s considerably faster to use solution 2. Solution 1 finished preparing our test table in just over 33 minutes. Solution 2 completed the inverse task in just over 5 1/2 minutes. The test data was 1.58 million rows of randomly generated data. The MySQL server was cripped to a 32Mb buffer pool to closely mimic our real world workload which would be heavily IO bound. The on disk file was about 260Mb. Using CIRD is about 6x faster than a straight delete. This probably isn’t surprising as it’s been reported for a long time that DELETE is ~10x slower than INSERT with InnoDB, but I wanted to validate these findings to determine if they still held true for newer versions (5.1.30) of MySQL.

A Smaller Database is a Faster Database

Tuesday, February 24th, 2009

You can probably save a considerable amount of storage by right sizing your columns. It’s been documented that right sizing data types for columns can return data faster and of course making the data smaller results in less wasted storage space on disk. So I decided to find out just how much.

Inserts

I created two user tables and attempted to insert 1,200,000 rows in each, getting some performance information along the way. The method was to insert 100,000 records in a table, check out how disk space had been used and how the performance of the inserts was impacted. The code to run a similar test on your system is available at the end of this post.

Query Insert Performance

As you can see on the graph, when the on-disk file reached ~45Mb, the performance of the inserts fell very quickly. This is likely due to the low innodb_buffer_pool_size set on this machine (32M) and the forced swapping for the table. Analysis of the table after the inserts showed a average row length of 80 bytes for User1 and 67 bytes for User2. Roughly 15% smaller, which lends credibility to the roughly 15% increase in number of rows inserted before suffering similar performance bottlenecks.

Interestingly, I expected a higher throughput for the total queries per second for the user2 table because of the smaller data segments. While a small 1% increase was observed, this could eassily have been due to fragmented disk or any other number of potential elements impacting the test machine. However, since MySQL uses pages and not individual file writes for each record, it is unlikely to make a large impact unless the difference in record size (currently a difference of 13 bytes per record) is significant.

Selects

Curious how significant the impact would be on select speed, I turned my attention to seeing how fast I could get the data back. So I modified my script below to use a select statement instead and hammered the DB again. As you can see below, the performance of the select is greatly impacted by the number of rows. I ran the test with 300K, 600K and 1.2M rows and saw a performance decrease with each increment, however, the smaller record size was consistently faster than the larger format.

Query Select Performance

With user1 at 1.2Million rows, I was able to select 1M random uid values at an average rate of 3,458.3 qps. My smaller table, user2, was able to select 1M random uid values at an average rate of 3,590.6 qps, about 3.8% faster than the larger table. This translated to 132.3 queries per second more. Again, I suspect the performance increase has to do with the number of records stored in memory at any given time. Increasing the density of records by decreasing their size effectively makes the ram more efficient.

Running the tests again after purging roughly 1/2 of the records had similar results as you can see in the graph. The improvement was 4.3% which is clearly even faster but as the random pages were fetched from disk, the majority of the data now fit in RAM.

Investigating further, I opted to trim the file to just 300K records and run the tests one last time. The difference was only 1.5% faster, similar to the increased performance for the inserts. Perhaps a future test to measure the increased performance as a measure of the decrease in record length would show that a 15% decrease in record length will result in 1/10 increase in read/write performance? Alas, that’s for another blog post.

Conclusion

There are serious benefits to keeping your data footprint smaller. Although this is not the silver bullet of database tuning, it will certainly help your system run more efficiently. The best way to increase query throughput for reads is to have faster disk I/O. However, as this test shows, you can stretch your performance dollars just a little bit further by using smaller data values that make more sense for the values you are actually storing.

You can get more information on the space required by column type from MySQL’s website and do your own back of the napkin performance calculations on how much of a lift you can hope to see… good luck, and happy tuning.

The Scripts

<?php
	$records = 100000;
	$conn = mysqli_connect("127.0.0.1","root","","test");
	$start = microtime(true);
	for($i=0;$i<$records;$i++){
		$query = "INSERT IGNORE INTO user1 (uid, age, gender, lastvisit) VALUES (" . rand(1000000, 100000000) . ", " . rand(13, 50) . ", " . rand(0,2) . ", NOW())";
		// $query = "INSERT IGNORE INTO user2 (uid, age, gender, lastvisit) VALUES (" . rand(1000000, 100000000) . ", " . rand(13, 50) . ", " . rand(0,2) . ", NOW())";
		// $query = "SELECT * FROM user1 WHERE uid = " . rand(1000000, 100000000);
		// $query = "SELECT * FROM user2 WHERE uid = " . rand(1000000, 100000000);
		$conn->query($query);
	}
	$stop = microtime(true);
	mysqli_close($conn);
	print round(($stop - $start), 4) . " seconds elapsed filling the table (~" . number_format($records/round($stop-$start,4),1) . " qps).\n";
?>
mysql> CREATE TABLE `user1` (`uid` bigint(20) NOT NULL, `age` int(11) DEFAULT NULL, `gender` int(11) DEFAULT NULL, `lastvisit` datetime DEFAULT NULL, PRIMARY KEY (`uid`)) ENGINE=InnoDB;
mysql> CREATE TABLE `user2` (`uid` bigint(20) unsigned NOT NULL, `age` tinyint(3) unsigned NOT NULL, `gender` tinyint(3) unsigned NOT NULL, `lastvisit` timestamp NULL DEFAULT NULL, PRIMARY KEY (`uid`)) ENGINE=InnoDB
mysql> describe user1;
+-----------+------------+------+-----+---------+-------+
| Field     | Type       | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+-------+
| uid       | bigint(20) | NO   | PRI | NULL    |       | 
| age       | int(11)    | YES  |     | NULL    |       | 
| gender    | int(11)    | YES  |     | NULL    |       | 
| lastvisit | datetime   | YES  |     | NULL    |       | 
+-----------+------------+------+-----+---------+-------+
 
mysql> describe user2;
+-----------+---------------------+------+-----+---------+-------+
| Field     | Type                | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| uid       | bigint(20) unsigned | NO   | PRI | NULL    |       | 
| age       | tinyint(3) unsigned | NO   |     | NULL    |       | 
| gender    | tinyint(3) unsigned | NO   |     | NULL    |       | 
| lastvisit | timestamp           | YES  |     | NULL    |       | 
+-----------+---------------------+------+-----+---------+-------+

PHP MySQL vs MySQLi Database Access Metrics

Friday, January 30th, 2009

Last week I had a MySQL DBA/Developer tell me, “MySQLi sucks” when I asked why some code I was reviewing used the mysql_* extensions provided in PHP instead of mysqli_*. This really didn’t sit well with me. I turned to a trusted source for PHP information, and found they recommend mysqli_* for connecting to MySQL servers versions 4.1.3 and above. Although, I read many anecdotal stories about how mysqli_* was faster, I was unable to find any quantified proof of just how much faster. So in keeping with my performance measurement track I’ve been on this week, I decided to conduct my own test. I was surprised by the result. MySQL is faster in some cases.

The database I used is a pet project of mine that tracks some vehicle information. It uses a varchar primary key based on an automotive VIN number and the sample table I queried against has about 30K rows. The average row length is about 150bytes. I conducted 2 types of tests. The first was a completely unrealistic test measuring how fast a single row could be returned using each of the extension types. The other also returned a random sampling of rows. Since this isn’t a performance test of MySQL, I allowed each test to run one time completely to ensure the query cache was fully populated in MySQL. Then each test was run 3 subsequent times. Each test executed 1,000,000 queries to the server.

Single Hard Coded Query:

The first listing is the mysql_* calls. You’ll notice the

mysql_free_result()

function is commented out. I did test with this enabled and disabled and found that with this function enabled the queries returned at the same rate as with it enabled. So in the interest of fairness, I disabled it to more closely mimic the code executing with the mysqli_* calls.

<?php
	$tq = 1000000;
	$mt = microtime(true);
	$conn = mysql_connect("127.0.0.1","user","");
	mysql_select_db("test_schema", $conn);
	for($i=0; $i<$tq; $i++){
		$data = mysql_query("SELECT * FROM vehicles LIMIT 1", $conn);
		$data_object = mysql_fetch_assoc($data);
		// mysql_free_result($data);
	}
	mysql_close($conn);
	$et = microtime(true) - $mt;
	print "Elapsed time: " . number_format($et) . "s\n";
	print "QPS: " . ($tq / $et) . "\n";
?>

The next listing is the same code converted to leverage the mysqli_* calls. You’ll see that the result is not freed each time through the loop.

<?php
	$tq = 1000000;
	$mt = microtime(true);
	$conn = mysqli_connect("127.0.0.1","user","","test_schema");
	for($i=0; $i<$tq; $i++){
		$data = $conn->query("SELECT * FROM vehicles LIMIT 1");
		$data_obj = $data->fetch_assoc();
	}
	$conn->close();
	$et = microtime(true) - $mt;
	print "Elapsed time: " . number_format($et) . "s\n";
	print "QPS: " . ($tq / $et) . "\n";
?>

The results were surprising to me. The mysql_* extension actually ran considerably faster than the mysqli_* extenstion. In fact, the mysql_* extension ran 3.6% faster. Perhaps this DBA was on to something? I also tested this same methodology using prepared statements with MySQLi and found the results actually degraded. Discouraged, I devised a new test that would hopefully be more realistic. The values provided are queries per second.

Method Run #1 Run #2 Run #3 Avg
mysql_* 7,227.7 7,222.2 7,220.7 7,223.5
mysqli_* 6,963.0 6,974.0 6,974.4 6,970.5
mysqli prepared statements 4,849.6 5,005.3 4,978.7 4,944.5

Parameterized Queries:

More often than not, we do not run the same select query 1 million times, and we are actually more likely to run a query over a series of data sets. While this in itself is a terrible idea in production code because you should re-write the query to do it all as a single database call, it provides a better picture of how the database might be interacted with in PHP; especially if persistent connections are being utilized. However, if you find that you have some code that fits the patterns used here, it might be to your advantage to review the prepared statement code below.

The basic flow of the scripts is to first fetch 10 thousand (~1/3) of the table’s primary keys and populate an array. Then the code will select 1 million rows using a randomly selected primary key (vin). The data is still relatively small, and fits into the query cache after a warm up period. This shows how efficiently the query passing is instead of how efficient MySQL is at answering different questions, making it a better comparison of the two technologies. The three test scripts I ran are provided here for your reference.

Leveraging the mysql_* calls

 
<?php
	print "Fetching VIN array...\n";
	$vin_array = array();
	$conn = mysql_connect("127.0.0.1","user","");
	mysql_select_db("test_schema",$conn);
	$vins = mysql_query("SELECT vin FROM vehicles LIMIT 10000", $conn);
	while($vin = mysql_fetch_assoc($vins)){
		$vin_array[] = $vin['vin'];
	}
	mysql_close($conn);
	unset($conn);
 
	print "Starting test...\n";
	$tq = 1000000;
	$mt = microtime(true);
	$conn = mysql_connect("127.0.0.1","user","");
	mysql_select_db("test_schema", $conn);
	for($i=0; $i<$tq; $i++){
		$data = mysql_query("SELECT * FROM vehicles WHERE vin = '" . $vin_array[rand(0,count($vin_array)-1)] . "'", $conn);
		$data_object = mysql_fetch_assoc($data);
		// mysql_free_result($data);
	}
	mysql_close($conn);
	$et = microtime(true) - $mt;
	print "Elapsed time: " . number_format($et) . "s\n";
	print "QPS: " . ($tq / $et) . "\n";
?>

Leveraging the mysqli_* calls

<?php
	print "Fetching VIN Array...\n";
	$vin_array = array();
	$conn = mysqli_connect("127.0.0.1","user","","test_schema");
	$vins = $conn->query("SELECT vin FROM vehicles LIMIT 10000");
	while($vin = $vins->fetch_assoc()){
		$vin_array[] = $vin['vin'];
	}
	$conn->close();
	unset($conn);
 
	print "Starting test...\n";
	$tq = 1000000;
	$mt = microtime(true);
	$conn = mysqli_connect("127.0.0.1","user","","test_schema");
	for($i=0; $i<$tq; $i++){
		$data = $conn->query("SELECT * FROM vehicles WHERE vin = '" . $vin_array[rand(0,count($vin_array)-1)] . "'");
		$data_obj = $data->fetch_assoc();
	}
	$conn->close();
	$et = microtime(true) - $mt;
	print "Elapsed time: " . number_format($et) . "s\n";
	print "QPS: " . ($tq / $et) . "\n";
?>

Leveraging mysqli_ calls and prepared statements.

<?php
	print "Fetching VIN array...\n";
	$vin_array = array(); 
	$conn = mysqli_connect("127.0.0.1","user","","test_schema");
	$vins = $conn->query("SELECT vin FROM vehicles LIMIT 10000");
	while($vin = $vins->fetch_assoc()){
		$vin_array[] = $vin['vin'];
	}
	$conn->close();
	unset($conn);
 
	print "Running tests...\n";
	$tq = 1000000;
	$mt = microtime(true);
	$conn = mysqli_connect("127.0.0.1","user","","test_schema");
	$stmt = $conn->stmt_init();
	$stmt->prepare("SELECT * FROM vehicles WHERE VIN = ?");
	for($i=0; $i<$tq; $i++){
		$stmt->bind_param("s", $vin_array[rand(0,count($vin_array) - 1)]);
		$stmt->execute();
		$stmt->bind_result($a,$b,$c,$d,$e,$f,$g,$h,$z,$j,$k,$l,$m,$n);
		$stmt->fetch();
	}
	$conn->close();
	$et = microtime(true) - $mt;
	print "Elapsed time: " . number_format($et) . "s\n";
	print "QPS: " . ($tq / $et) . "\n";
?>

The performance of this “harder” task degraded for all technologies. Again, the mysql_* extension was about 3.6% faster, but the prepared statements were 9.7% faster than the old mysql_* calls and 13.7% faster than the mysqli_ calls. This improvement makes a good case for transitioning repetitive calls in your code to prepared statements leveraging the MySQLi extenstions. Again, the values are measured in queries per second.

Method Run #1 Run #2 Run #3 Avg
mysql_* 4,063.6 4,016.7 4,054.0 4,044.8
mysqli_* 3,902.3 3,905.8 3,896.0 3,901.4
mysqli prepared statements 4,443.8 4,421.8 4,445.7 4,437.1

Without testing stored procedures, it’s difficult to say if they would have continued to improve the performance of this test, however, it is clear that the speed advantages of mysqli_* extensions are only present if you are utilizing the advanced features of the more recent versions of the database.

Conclusion:

If your code is purely hard coded statements, the old school MySQL syntax may be your best bet for a performance lift. However, most likely your code would benefit more from prepared statements and some of the other performance enhancements provided by the MySQLi extension set. I wouldn’t say, “MySQLi Sucks” but it does appear there is a valid argument for continuing to use it if you are not leveraging any of the performance enhancements of the newer versions of MySQL.

System Configuration:

  • Mac OS 10.5.6
  • 2.16Ghz Intel Core Duo
  • 2 Gb RAM
  • PHP 5.2.6
  • MySQL Server 5.0.51b

Before You Buy that new DB

Monday, October 27th, 2008

Database Server Icon - Visio Database servers are tricky. Before jumping to the conclusion that more hardware is needed, make sure you’re getting the most out of the hardware you already have. Hiring a DBA is definitely the right way to go. They’ll quickly be able to point out where the pain points are in your configuration. If you can’t afford a DBA, you can use this quick guide to get a starting point on what might need adjustment.

At the system level, start by checking over the following items.

  • Does the machine have sufficient RAM?
  • How much idle CPU time is available?
  • Is the hard disk okay?
  • Is the networking speed sufficient?
  • Are there services that should run somewhere else?

Next investigate the application itself.

  • Is the RDBMS allowed to use all of the RAM available?
  • Are the tables indexed; properly?
  • Are the indexes up to date?
  • Are the statistics up to date?
  • Are there queries that can be offloaded; refactored; eliminated?

This, at minimum, will save time when your thinking about sticking more power in-front of the database. If you still decide to make the move, make sure you check over these values as soon as you are done or you may find your performance decreases when it should have increased.

Database Migration Tips

Saturday, October 25th, 2008

Database Server Icon - Visio Moving database servers is a nontrivial task that requires planning and forethought. This guide provides some tips for planning a migration of a small to midsize databases, but not yet data-warehousing in the 5Gb – 500Gb size. Of course the process will work for smaller databases, and even larger as well. Ultimately your mileage may vary. Depending on the software you’re running your database on, there may be alternate paths, such as replication, available to you. Spend time researching your options before embarking on this type of backup. They can save time and frustration.

This document will not help with sizing your systems, however, completing the “Prepwork” will get you thinking in the right direction as to what you’ll ultimately need to purchase. I highly recommend reading Refactoring SQL Applications by Stéphane Faroult before buying hardware for capacity reasons alone.

Prepwork

Possibly the most important thing you’ll do during the migration is the preparation work. Spend some time looking at what your moving and where you’re moving to up front, it will cause you to answer some questions earlier than later which results in not making rash decisions in the heat of the moment.

1. Inventory and Audit

It’s time to find all the dirty laundry. User’s with inappropriate access (like ’sa’ or ‘root’) for running queries are obvious examples. While it’s tempting to do all of your house cleaning up front, don’t. Document it all so it can be shopped around your organization.

  • Look at your user accounts on the RDBMS. Are all of them still currently used? Can any be removed?
  • Take a look at the schemas (databases). Are all of them still necessary? Can any be removed?
  • Take a look at the tables within each schema. Make note of large objects, you’ll want to do further investigation on these to ensure they’re really serving their intended purpose.
  • Make sure you’re backups are running with the intended frequency (and that they’re actually being generated).
  • Understand the underlying sub-system design and architecture you currently have.
  • Document the pain points you’re experiencing (assuming this is a capacity based move).
  • Pick a downtime window for the cutover and ensure you build in some extra time. The length of your downtime can be estimated using your Soggy run.

Armed with this data, you can begin identifying garbage that doesn’t need to move. An example from my experience is a ColdFusion client storage database running on Microsoft SQL Server. This database holds 2 tables CDATA and CGLOBAL each with ~680K rows. With transactional logs etc, the footprint of the database swelled to 7.5Gb. Let alone for a minute that keeping transactional logs on this table was overkill, we were also making significant changes to the application servers. All of this was irrelevant because of other infrastructure changes that were being made concurrently and the database was able to be dropped. This is a singular example of how some research early on can save time later. Moving chunks of data, to disk or over the wire, still takes time – as does the restoration process.

2. Plan the actual steps required to move

1Tb External iBook Hard Drive This is a page out of David Allen’s GTD. Obvious right? Backup the data and move it to the new server, restore it… done! Well, sure that’s one way to do it, but make sure you’ve got details hashed out on the “move it” bit. If you’re doing this change in a major metro area at rush hour, you’re opening some huge risks into your timeline. But, forgetting traffic; is it best for your environment and your technology? Can you handle the downtime? Consider the following options when flushing out your plan.

  • Replication, if available to you, is an ideal way to minimize the downtime. However, you’ll still need to push all data over the wire (or via disk) at some point to get the new equipment to the correct point in time for the log files to begin replication of data effectively.
  • Sending up to 20Gb of binary data over the wire is pretty quick, but after that, it may be faster to consider other options for moving data, including moving data via USB/Firewire/eSATA/(insert your favorite external storage method here) or even a whole server.
  • Physical data moving takes time and requires coordination, possibly at a datacenter (or multiple datacenters) and may require access be setup ahead of time.

Write down your execution plan, be optimized… you’ll run against this later to keep the noise out of your head.

When moving a large MySQL database between two geographically separated systems, you’re at the mercy of the public internet. Obviously, finding a time that the throughput for your route is best is ideal, but not always an option. After completing a full backup of an entire MySQL database, I pushed the 75Gb over the wire to the new equipment. That process took nearly 2 hours. Armed with the database footprint from your audit, you can use a download calculator to calculate a best case scenario for the transfer time. Keep in mind network congestion, route traffic and latency could increase this time – and those will all increase with distance.

Spring Cleaning

Grab a baseline of your performance. This is a great way to validate that the new solution will do what you expect it to. It’s always nice to see an increase in performance, but it’s better to have a quantified value of just how much increase you get per dollar spent. If you’ll be doing any re-factoring as part of the spring cleaning, be sure to benchmark after you finish that process again.

Begin your house cleaning list by truncating or dropping unnecessary data, tables and schemas. Clean out all of those old user accounts and get your house in order. Ensure you backup proccess has been generating good files and your ready to use those for a Soggy run.

If all has gone well with your process so far, you may be considering just keeping the equipment you’ve got! But if not, and a move is still in your future… get ready to do a Soggy run – note quite the real thing, but not a “dry” run either, because we want to time our process.

Soggy Run

Make note of the time… and get moving – stick to the script. Should you deviate (in order or adding additional steps) be sure to note it on the script so you’ll be prepared for the final run. Also note time deltas at intervals so you’ll know if somethings taking too long and your window might need to be extended. It’s possible if you’ll be using replication to manage the transfer, that you won’t need to execute this again, but if you do, it’ll be nice to have the actual steps you’ve taken.

  • Make your backups
  • Begin your plan for transfering the data
  • Ensure you’ve got all the tools, scripts, and equipment you need in the right place in the right order.
  • Restore the data
  • Validate that EVERYTHING worked. Don’t just look at numbers of databases, but spot check a few (if there are too many to check them all) and ensure the data you’re expecting to be there is there!
    If you are using replication, can you start it, does it work. Trying validating a database or two to ensure everything is right – after the migration is no time to realize you’ve missed data.
  • Try your environment against the database and ensure it performs as expected, not just from an accuracy standpoint, but from a performance one as well.
  • If your not using replication to handle the next steps, drop all databases from your server. The system will need to drop them when you re-create them and this takes time especially for large databases.
  • Make note of the final time. Add some padding to it and set that as your downtime window. Doubling it is safest, but 20% may be sufficient.

D-Day

It’s been rehearsed, the pitfalls discovered, and the process documented – it’s now down to simple execution steps. Ensure you have access, either via phone, or in person to people who can assist with the migration should something go wrong. It’s nice to have resources available to call on. Gather necessary materials and get going.

  • Make sure you’re well rested. Nothing is worse than trying to make good decisions when extremely tired.
  • Begin your planned downtime window with a clear head. Disable the necessary systems and put up any necessary messaging. Now is the time to ensure your data is perfect and not being altered.
  • Run the playbook you’ve created…
  • Bring the systems back online!
  • Once you’re done with the plan – validate that it’s all working and point your apps, users and the world at your newly completed creation.
  • Benchmark! Quantify the improvements you’ve made with similar load to what you’d been running before if possible.

Blissful Unions

Tuesday, October 14th, 2008

RGB Venn Diagram

RGB Venn Diagram

I’ve been sharpening the saw lately and decided to take a few minutes to clearly illustrate some SQL terminology that newbie developers might be troubled by. Nothing in here is magic or even comprehensive, but we all often overlook the inherent power of SQL engines when writing code.

If you’re unfamiliar with the RGB color space, it works basically like this. You add a bit of a color component by increasing one of the values. R=0, G=0, B=0 (or 0,0,0) is black, and R=15, G=15, B=15 (or 15,15,15) is white. The examples below are all created using the data table format included below.

UNION

Union takes two (or more) query results and provides you with the unique result. For example, if I had two queries 3, queries that each returned a segment of the color set with the color value = 15 and the other color values > 14, I could write this as three select statements with a form similar to the following, which selects the red subset SELECT r, g, b FROM colors WHERE r = 15 AND g >= 14 AND b >= 14. That gives me 4 rows, Set 1 below.

Now I’ll include another collection of colors: SELECT r, g, b FROM colors WHERE r BETWEEN 10 AND 12 AND g = 11 AND b BETWEEN 12 AND 14, labeled Set 2. And one more: SELECT r, g, b FROM colors WHERE r BETWEEN 9 AND 10 AND g >= 14 AND b BETWEEN 14 AND 16 The result is Set 3.

Set 1

r g b
15 14 14
15 14 15
15 15 14
15 15 15
Set 2

r g b
10 11 12
10 11 13
10 11 14
11 11 12
11 11 13
11 11 14
12 11 12
12 11 13
12 11 14
Set 3

r g b
9 11 14
9 11 15
10 11 14
10 11 15

And now all merged together using a UNION. Notice it removes the duplicate, highlighted in red above, saving us some time! The order by is just to make life easier when reading the newly merged results. As you can imagine, with a more complex dataset, this could be really handy!

SELECT r, g, b FROM colors WHERE r = 15 AND g >= 14 AND b >= 14
UNION
SELECT r, g, b FROM colors WHERE r BETWEEN 10 AND 12 AND g = 11 AND b BETWEEN 12 AND 14
UNION
SELECT r, g, b FROM colors WHERE r BETWEEN 9 AND 10 AND g = 11 AND b BETWEEN 14 AND 15
ORDER BY r,g,b
r g b
9 11 14
9 11 15
10 11 12
10 11 13
10 11 14
10 11 15
11 11 12
11 11 13
11 11 14
12 11 12
12 11 13
12 11 14
15 14 14
15 14 15
15 15 14
15 15 15

UNION ALL

As a quick example, using the same 3 queries and result sets from above, UNION ALL gives us the duplicate record.

SELECT r, g, b FROM colors WHERE r = 15 AND g >= 14 AND b >= 14
UNION ALL
SELECT r, g, b FROM colors WHERE r BETWEEN 10 AND 12 AND g = 11 AND b BETWEEN 12 AND 14
UNION ALL
SELECT r, g, b FROM colors WHERE r BETWEEN 9 AND 10 AND g = 11 AND b BETWEEN 14 AND 15
ORDER BY r,g,b
r g b
9 11 14
9 11 15
10 11 12
10 11 13
10 11 14
10 11 14
10 11 15
11 11 12
11 11 13
11 11 14
12 11 12
12 11 13
12 11 14
15 14 14
15 14 15
15 15 14
15 15 15

Tables

A table “colors” was created and into it I populated the full set of colors leveraging the ordinal int values of 0-15, resulting in 4,096 rows of colors. You can of course do this for all 256 values of RGB supported in the CSS color space, which would be more accurate for performance testing of your queries, but I digress…

CREATE TABLE `test`.`colors` (
  `r` INT(4),
  `g` INT(4),
  `b` INT(4),
  PRIMARY KEY (`r`, `g`, `b`)
)
CHARACTER SET utf8;

You can use this script to quickly populate your newly created table. There are lots of other ways to do it, but this was the fastest for me to write today.

$conn = mysqli_connect($server,$username,$password,$schema);
for($r=0; $r<16; $r++){
	for($g=0; $g<16; $g++){
		for($b=0;$b<16; $b++){
			$conn->query("INSERT INTO colors (r,g,b) VALUES ($r,$g,$b)");
		}
	}
}
$conn->close();

Sharpening the Saw

Thursday, October 2nd, 2008

Cutting through wood with a hand saw is a hard job. Pushing the blade back and forth through the wood slowly but surely making the cut deeper and deeper until it’s completed is a time consuming, labor intensive task. Once completed, there’s almost always another piece that needs to be cut.

Often in our daily tasks, we find ourselves pushing and pulling the saw blade back and forth, moving a project along towards it’s expected completion date. As soon as that project is completed, often before it’s completed, another one presents itself and requires our attention. This is very similar to cutting through the log. We often feel taking even the shortest break will put us off our course, the work will pile up, and we’ll never get it all done.

You’re wrong. Sharp tools cut faster.

Franklin Covey has taught us this already. Taking a 10-20 minute break to sharpen the saw can make the cutting process go faster. Increasing productivity and of course resulting in completing the tasks quicker. Taking the time to obtain the needed skills is a hard discipline, but it must be done and done now!

Think about your current and future task list. Now, take 10 minutes to think about your most mundane task and research to see if there’s a book, course, webinar or other resource you can tap to learn more about it. Even the most cursory overview will teach you some nuance that improves your productivity – sometimes – you’ll learn your understanding of a topic is really far more basic than you realize. The added time spent on yourself will help you to complete that mundane task perhaps a little more efficiently, giving you more time to complete the task at hand.

What saw am I sharpening?

I’ve been writing simple, and some not so simple, CRUD (CReate Update Delete) SQL for most of my professional career. I find it boring and tedious, but a necessary evil in the Web 2.0 world. This week I started reading Refactoring SQL Applications by Stéphane Faroult and Joe Celko’s Thinking in Sets. What I found was that many applications I’ve seen (and written) hardly tap the power of SQL and that there are some major mistakes that the PHP/Perl/ColdFusion/ASP.NET/C# programming manuals proliferate in the over simplification of relational database design. I’m now realizing that the majority of code I’ve seen and worked on is actually using SQL as a giant persistent hashmap!

These books are causing me to think about not just the CRUD statements differently, but how my applications interact with the persistent storage engines web apps interact with every day.

Take a few minutes today to sharpen your saw.

Data Gotcha’s or Learn to Love CAST()

Wednesday, October 1st, 2008

PHP Loves Microsoft SQL Server In the last post, I explained my work around to get data out of a Microsoft SQL Server using stored procedures. Since then, I’ve continued learning more about the limitations of the FreeTDS driver in PHP and hit another snafu when working with the data. Fields that are exported as VARCHAR are truncated at 256 characters! Whoa! big gotcha when today’s database tables might contain values in VARCHAR(1024) or even bigger! So what’s a PHP script to do? CAST the values!!!

 
-- original statement
SELECT myBigVarChar 
     FROM myTable 
     WHERE myId = 1;
 
-- revised statement
SELECT CAST(myBigVarChar AS text) AS myBigVarChar 
     FROM myTable 
     WHERE myId = 1;

Notice the statement’s a little more unruly to look at, but is functionally the same. The case to text allows the FreeTDS driver to properly convert the data. I also noticed this issue with UNIQUEIDENTIFIER type columns coming through as a BLOB. My fix was to CAST the identifier to a VARCHAR(36) to safely hold the string representation of the UUID.

Hope this saves you some head scratching!

Running a Stored Procedure from MS-SQL Server with PHP on Linux

Wednesday, October 1st, 2008

PHP Loves Microsoft SQL Server After some effort getting the MSSQL drivers in PHP, I was ready to hook into the RDBMS and use my nicely developed stored procedures. FreeTDS required minimal setup and configuration and I was able to establish a connection and run simple queries against tables in no-time.

However, when I attempted to execute stored procedures, they would fail. To make troubleshooting harder, the MSSQL functions do not return error messages — at least none that I could access in PHP 5.1. My work around to the proper methodology init(), bind(), and execute() process was to use query() and pass in the procedure and parameters as needed. It’s slightly less elegant but opens the data up to you.

// Establish the connection and run the procedure
$connection = mssql_connect("123.123.123.123","db_login","db_password");
mssql_select_db("db_schemaname",$connection);
 
// Run the procedure
$username = "user"; // sanitized!!!
$password = "secret_password"; // sanitized!!!
$result = mssql_query("CheckLogin '$username', '$password'", $connection);
 
// handle the result as needed

The standard disclaimers apply about validating, escaping and bullet proofing any data passed into Microsoft SQL Server using this method. Seriously, heed these warnings, sanitize any data, or you will find your site vulnerable to SQL Injection attacks which are serious business.

Not So Fast! MySQL Stored Procedures Are Not Panacea

Wednesday, May 21st, 2008

MySQL Logo MySQL 5.0 introduced stored procedures, but are you taking advantage of them? MySQL claims a 50% lift in performance just having an execution plan on hand, but what real world performance boost can you expect to see? I set out to discover just that and found the results unsatisfying.

Column Type Indexed
id int(11) unsigned Primary Key
content varchar(4096) No

I then conducted a test which would closely mimic a real world example to determine just that. I started with a table with a few million rows of variable length data in it. I then created a stored procedure that would as closely as possible match my hand coded SQL query. Then I created a simple PHP script to test it all out.

CREATE PROCEDURE sp_getdata (i INT(11))
BEGIN
	SELECT * FROM test_table WHERE id = t;
END;
$iterations = 100000;
$conn = mysqli_database($server, $user, $pass, $schema);
 
// find the extents of our random range and make sure the connection actually opens
$max = $conn->query("select max(id) as maxid from test_table");
$max = $max->fetch_assoc();
$max = $max['maxid'];
 
$sql = "SELECT * FROM test_table WHERE id = %d";
$procedure = "CALL sp_getmessage(%d)";
 
// run the stored procedure query
for($i=0;$i<$iterations; $i++){
     $query = sprintf($procedure, rand(1,$max));
     $conn->multi_query($query);
     $result = $conn->store_result();
}
print number_format(microtime(true)-$t2,2) . "s\n";
 
// run the straight sql query
print "Test SQL Query: ";
$t1 = microtime(true);
for($i=0; $i<$iterations; $i++){
     $query = sprintf($sql, rand(1,$max));
     $result = $conn->query($sql);
}
print number_format(microtime(true)-$t1,2) . "s\n";

I then ran the test 5 times to eliminate the so called warm-up period which MySQL would inevitably use to more accurately answer future questions. I then ran the test a number of times swapping the position of the two “tests” within the code so that neither received an unfair advantage. The results are below:

Run SQL Stored Procedure Order
1. 0.28s 0.39s Proc/SQL
2. 0.28s 0.37s Proc/SQL
3. 0.28s 0.41s Proc/SQL
4. 0.28s 0.38s Proc/SQL
5. 0.28s 0.39s Proc/SQL
6. 0.28s 0.80s SQL/Proc
7. 0.28s 0.82s SQL/Proc
8. 0.28s 0.81s SQL/Proc
9. 0.28s 0.82s SQL/Proc
10. 0.27s 0.80s SQL/Proc

I was taken by surprise by two obvious trends. First, the SQL statement was amazingly consistent regardless of when it was run. Second, not only was the stored procedure slower, it was significantly slower if a batch of plain old SQL statement executions were run first.

Next I decided to track the response time of each individual query that I ran to get a better look at the execution time costs for each. I expected to see a bell curve showing some queries running very fast and others running very slow with the majority falling in the middle somewhere. I had to increase the sensitivity of my timer millionths of a second to get a good picture of just how fast things were or were not running. Again I was surprised by the results. The basic query was still much faster in total elapsed time, but suddenly the performance of the stored procedure began showing itself. I ran the tests the same way as before and swapped the order in which they ran to eliminate preferential execution order. The following table gives results from one run of 100,000 so you can see the results first hand. The method was to insert the time value into an array and increment the value of that index by one each time a query executed in that time. So the first value you see is also the first query that gets sent to the server. This is important because for the stored procedure, you’ll notice it’s MUCH longer than all subsequent runs. Oddly, the query to get the maxium value had already been run so I’m not sure what causes the increase delay in runtime.

Stored Procedure Basic SQL Statement
    [0.00144] => 1
    [0.00005] => 2
    [0.00002] => 52
    [0.00001] => 99938
    [0.00003] => 6
    [0.00004] => 1
    [0.00063] => 1
    [0.00040] => 1
    [0.00047] => 16
    [0.00049] => 42172
    [0.00048] => 26
    [0.00299] => 21
    [0.00050] => 1735
    [0.00124] => 88
    [0.00074] => 517
    [0.00249] => 32
    [0.00224] => 38
    [0.00174] => 171
    [0.00149] => 138
    [0.00274] => 22
    [0.00099] => 139
    [0.00324] => 19
    [0.00399] => 12
    [0.00374] => 12
    [0.00349] => 12
    [0.00199] => 102
    [0.00054] => 1
    [0.00043] => 1
    [0.00075] => 35
    [0.00046] => 5
    [0.00250] => 2
    [0.00024] => 49970
    [0.00100] => 4
    [0.00025] => 4657
    [0.00125] => 6
    [0.00023] => 11
    [0.00175] => 2
    [0.00022] => 6
    [0.00449] => 4
    [0.00028] => 2
    [0.00052] => 1
    [0.00044] => 1
    [0.00200] => 1
    [0.00030] => 1
    [0.00041] => 1
    [0.00026] => 3
    [0.00027] => 4
    [0.00070] => 1
    [0.00499] => 1
    [0.00150] => 2
    [0.29807] => 1
    [0.00275] => 1
    [0.00424] => 2

Clearly the fluctuation in times we see in the first table are largely related to the first query we execute. After that initial procedure call, stored procedures run consistently fast, faster in fact than the plain old SQL query. But back to the “real world test”. How often do we create a single page that executes 100,000 queries at a time? More likely its 2-20 queries which are all very different than the others in the group. Given that variable, which is really the best solution? Ultimately it’s going to depend on your setup. This was a mostly idle app server communicating over a 100Mb connection to a moderately loaded DB server with ample RAM and disk performance.

Future exploration areas will be to research the most efficient way to execute stored procedures for maximum performance and deciding if a stored procedure is faster for more complex queries involving joins etc. If anyone has some good directions I can head in here, I’m all ears.

© 1998-2008 AF-Design, All rights reserved.