Posts Tagged ‘what works’

Deleting Data From InnoDB

Saturday, October 24th, 2009

MySQL Logo Problem: We are given a large MySQL database table that no longer fits in your system’s working memory. You need to prune the data since a significant portion of this data is no longer relevant to keep in this table. Our expectation is ~75% of the data will remain in the table because of a uniform and random distribution of values in col1 and col2. How then, do we go about pruning this table as efficiently as possible?

The table structure is as follows:

describe big_table;
+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| id        | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment | 
| col1      | bigint(20) unsigned | NO   | MUL | NULL    |                | 
| col2      | bigint(20) unsigned | NO   | MUL | NULL    |                | 
| chardata  | varchar(35)         | YES  |     | NULL    |                | 
+-----------+---------------------+------+-----+---------+----------------+

Solution 1: Delete

We can simply run a query against the database that will delete all records that are irrelevant. The idea here is that we use a single query that’s easily readable to filter out the unwanted records. For this example we’ll assume we are sharding this data and want to keep all odd values for either col1 or col2 in this table. We can use MySQL’s MOD() function and check for even values and delete them. We expect this query to remove ~25% of our table.

DELETE FROM big_table WHERE MOD(col1,2) = 0 AND MOD(col2, 2) = 0;

Solution 2: Create -> Insert -> Rename -> Drop (CIRD)

Another solution is to create a new table identical to the original table then simply insert the records we want to keep into the new table. Once we have all the records we wish to keep, we simply rename the tables and then drop our original table. The difference in this query is to think about what we want to keep as opposed to what we want to get rid of. We expect this method to insert ~75% of the original table.

CREATE TABLE big_table_copy LIKE big_table;
INSERT INTO big_table_copy SELECT * FROM big_table WHERE MOD(col1, 2) = 1 OR MOD(col2, 2) = 1;
RENAME TABLE big_table TO big_table_old, big_table_copy TO big_table;
DROP TABLE big_table_old;

So which is faster?

It turns out it’s considerably faster to use solution 2. Solution 1 finished preparing our test table in just over 33 minutes. Solution 2 completed the inverse task in just over 5 1/2 minutes. The test data was 1.58 million rows of randomly generated data. The MySQL server was cripped to a 32Mb buffer pool to closely mimic our real world workload which would be heavily IO bound. The on disk file was about 260Mb. Using CIRD is about 6x faster than a straight delete. This probably isn’t surprising as it’s been reported for a long time that DELETE is ~10x slower than INSERT with InnoDB, but I wanted to validate these findings to determine if they still held true for newer versions (5.1.30) of MySQL.

Fotosync

Friday, October 23rd, 2009

AF-Design is proud to announce the launch of Fotosync, a new Facebook application. Fotosync allows users to continue using their current photo sharing solutions while still getting the benefits of sharing on Facebook. Users simply install the application, provide their publicly accessibly username at the services they wish to sync and the application does the rest.


Fotosync

Fotosync currently supports Flickr, Photobucket and Picasa with other services coming soon. These services retain high-resolution photographs, great for cloud backup and a service not provided by Facebook – as well as offering a number of additional services. Fotosync offers a premium version and of course a free version.


premium_free_fotosync

API’s Should Be Free

Wednesday, October 21st, 2009

Nothing can stifle development faster than putting a price tag on the development tools for your product. A case in point is email provider iContact who for whatever reason requires developers to register for a $9.95 per month account to create an application on their platform. Imagine for a minute how different the Facebook ecosystem would have been if they had charged developers to integrate with Platform. Many of iContact’s competitors including those featured on the iContact website, Constant Contact and Bronto, provide easy access to their development API’s. I don’t know if there’s a correlation, but these providers have been continuing to grow while iContact’s growth has remained relatively flat over the last 6 months.

iContact vs. Constant Contact vs. Bronto

There are business reasons why it’s probably important to control what applications make it onto a platform. Quality control, security of client data, inability to handle scale are all good reasons to keep people out. If this is what you need to do – don’t offer a public API.

The spectrum of current solutions for controlling access is varied. Apple’s often criticized application review process is one way to ensure that only “quality” applications become available to clients. On the other extreme, also not without criticism, Facebook’s more capitalistic approach allows users to ultimately decide what applications to use and thus drives the ecosystem from the other direction. Interestingly, regardless of the application approval process issues, anyone can create a FREE developer accounts, explore and interact with the SDK and Facebook API and start building applications. In fact, according to Alexa, only 2 of the Top 20 US Web Properties (ESPN and Disney’s GO) don’t offer free access to their API.

I strongly advise iContact and any other parties who sell their software as a service to provide an environment that encourages developers to interact with your service or don’t bother at all. Allowing developers to add value to your clients will always result in a happier customer and may even give you a competitive advantage in the marketplace, especially one as heavily saturated as email marketing.

Redirect Clients While Processing Continues

Friday, October 16th, 2009

PHP LogoGenerally, if you need to have the browser see a page while you continue executing the request I would use a queue system like Gearman or Amazon’s SQS. However, in some rare cases running the code in the template requested by the user is just as fast as communicating with a remote queue. For those instances, redirecting the client while processing finishes makes sense.

<?php
// Redirect the client
ignore_user_abort(true);
set_time_limit(1);
ob_start();
header("Location: http://www.example.com/");
header("Connection: close");
header("Content-Length: " . ob_get_length());
while(@ob_end_flush());
flush();
 
// Continue processing
sleep(10);
?>

An example in action? Adding an entry to a database server on a redirect script. That DB server may be very busy and so take longer than we’d like to respond. Using this redirect code we can ensure the browser bounces to the next page as quickly as possible.

Using newFetchPersonAppDataRequest on MySpace

Wednesday, July 15th, 2009

OpenSocial Site Logo Storing application data on the OpenSocial host is a great way to offload some unnecessary database and application server load. Why request a preference such as a skin for a user profile from your servers if we can just let the container handle it? MySpace allows for ~1K of data storage per user per application. However, there is a bug with the method newFetchPersonAppDataRequest when added as the only item of a DataRequest. Calling send on the request doesn’t actually do anything! It returns a DataResponse object with no data. As a work around, grab some other information to ensure that the request actually is sent to the container’s server. I used the owner data in this example.

View Code JAVASCRIPT
// The function to load the application data
function getAppData(){
	var req = opensocial.newDataRequest();
	var owner = opensocial.newIdSpec({"userId":"OWNER"});
	req.add(req.newFetchPersonRequest(owner), "owner");
	req.add(req.newFetchPersonAppDataRequest(owner, "appdata"), "owner_appdata");
	req.send(getAppDataCallback);
	return;
}
 
// The callback for getAppData()
function getAppDataCallback(d){
	var owner = d.get("owner"); // if you need it, use it!
	var data = d.get("owner_appdata");
	if(data.hadError()){
		// handle the error appropriately
		return;
	}
	// do whatever your program needs to do with the data
}

2 Handy Utilities for OS X Developers

Wednesday, March 18th, 2009

Over the last few weeks I fought with getting my Mac OS X machine to mount a NFS share on Linux remotely. I’ve tried different port numbers, different security models, even SSH tunneling to no avail. I’ve read mailing lists that say it can’t be done, and others that say certain kernel versions work and others don’t. I’ve been able to actually edit files for 2-3 minutes before the connection would drop for no apparent reason. In other tests, I was able to view files, but due to some permissions errors, I was unable to make changes. After parsing through the miriad of freely given advice online, I began seeking alternatives.

The key requirements for me were simple:

  1. Be able to browse the directory structure using Mac OS X finder.
  2. Be able to save/copy/move/delete files without being prompted for my password.
  3. Be able to edit files using any editor of my choosing.

ExpanDrive

ExanDrive I have found the tool I’ve been looking for! ExpanDrive allows me to use sftp, basically ssh to manage my files remotely but leverage my super handy Mac tools to do the editing. This is a huge win for productivity. The publisher provides a 30 day free trial, but I knew instantly this was what I’ve been looking for. The package runs $39.95. I expect it will pay for itself in less than a week of development. There’s lots of great support through getsatisfaction.com for common questions like, “how do I access ec2?” which didn’t directly answer my question, but got me sufficient information to troubleshoot my own SSH connection issues.

I’m still a little disappointed I was unable to get NFS working as I had hoped, but this solution definitely makes my life easier.

Meerkat

Meerkat Another great tool that I discovered today is Meerkat. As you’ll recall I said I was trying to use SSH tunnels to access the NFS service. After trying 8 different tools (that were all junk) I found Meerkat. The software is dead simple and provides rich access to pretty much everything I could want. Using Meerkat you can leverage an SSH login into one machine to provide access to a port bound service somewhere else. I see myself using this to grab remote access to a MySQL server parked behind a firewall via an SSH tunnel to a server that can see it. Other tools may have more “options”, but Meerkat actually works. I haven’t bitten the bullet to buy this one yet, I want to see how much I actually use it first, but at $19.95 – it’s a bargain as well and may quickly pay for itself.

I also wanted to give credit to Alan Watson who’s blog post about using ExpanDrive even put them on the radar for me. He also blogged on Meerkat which is how I found him in the first place. Thank you Alan, you made my day!

Twitter Like Timestamps

Tuesday, February 10th, 2009

Time Strings on Autolists.com Following up on the post about Facebook Like Datelines, this post provides a simple function for creating Twitter like aged timestamps in natural language. Looking at a date time isn’t always helpful, for example “2/9/2004 19:30am” doesn’t give the reader a quick understanding if this is current or not. Twitter, Facebook and other sites uses the difference in times to provide a more meaningful expression of when an action took place which also provides relevance to current events. The great part about using this method is the need to handle time-zones goes away because the calculation of the time elapsed is relative regardless of what time-zone the data is tagged in. See it in action with the General Motors news tags on AutoLists.

The function below simply displays an English string that provides the difference in times. This is obviously in English, but can be internationalized very easily. Twitter stops at 24 hours, but this extends the natural language formatting out to about 1 week then reverts to a full precision time-stamp. As you can see by the very simple math and logic involved, you can quickly extend this as far out as you would like.

function time_offset($t, $f = 'h:ma M. j Y T'){
	$o = time() - $t;
	switch($o){
		case($o <= 1): return "just now"; break;
		case($o < 20): return $o . " seconds ago"; break;
		case($o < 40): return "half a minute ago"; break;
		case($o < 60): return "less than a minute ago"; break;
		case($o <= 90): return "1 minute ago"; break;
		case($o <= 59*60): return round($o / 60) . " minutes ago"; break;
		case($o <= 60*60*1.5): return "1 hour ago"; break;
		case($o <= 60*60*24): return round($o / 60 / 60) . " hours ago"; break;
		case($o <= 60*60*24*1.5): return "1 day ago"; break;
		case($o < 60*60*24*7): return round($o / 60 / 60 / 24) . " days ago"; break;
		case($o <= 60*60*24*9): return "1 week ago"; break;
		default: return date($f, $t);
	}
}

Facebook Datelines For Your Data

Tuesday, February 10th, 2009

Autolists.com Date String Example I really like the date strings in the Facebook Newsfeed. They somehow make the data feel more personal when they relate to me with more natural language “Today”. This in place of callous standardized formatting so many websites present such as “2/10/2009″ or worse yet “02/10/2004″ or even worse still “2009-02-10″. The following code illustrates how you can create a quick and easy formatting of data that’s ordered by date and output it in a meaningful way to your users. You can see an example of this in action on the top stories feed on AutoLists.com as well.

The one flaw with this method as presented here is it is tied to the webserver’s time zone and the developers regionalization preferences. This, however, can easily be extended to match the users time-zone and internationalization needs with a little more effort. In the interest of simplicity I am providing it as simple as possible to illustrate the logic. The dateline class follows the code listing.

// Query for your data that contains a datetime field
 
$date_str = today_yesterday(time());
while($record = $query_result->fetch_assoc()){
	if($date_str != today_yesterday(strtotime($record['date_published']))){
		$date_str = today_yesterday(strtotime($record['date_published']));
		print '<div class="dateline">" . $date_str . "</div>';
	}
 
	// Do whatever logic you need to with each record here...
	printf($record_format, $record['title'], $record['link'], $record['image']);
 
}
 
// Date String Formatting Function
function today_yesterday($t, $f = 'l, F j'){
	if(date("Ymd") == date("Ymd", $t)){
		return "Today";
	} elseif (date("Ymd", time()-(24*60*60)) == date("Ymd",$t)){
		return "Yesterday";
	} else {
		return date($f, $t);
	}
}
div.dateline { border-bottom:1px solid #ccc; color: #ccc; font-size:12pt; }

Memcached with PHP on Mac OS X

Saturday, October 18th, 2008

Nate Haug provides a great script for installing memcached along with some very detailed instructions on setting up a sandbox environment. I’m not using his MAMP sandbox, instead opting for the built in PHP / Apache install, so I needed to change a few things from his tutorial. My system is a fully updated Intel MacBook Pro running OS X 10.5.5 with the Xcode tools installed – YMMV. PHP is currently reporting version 5.2.6.

  1. I added my revised start script for memcached.
  2. The PHP version that ships with OS X doesn’t have PECL, so I downloaded the source and compiled manually.
    phpize; configure; make; sudo make install
  3. Edit to /etc/php.ini: Changed: extension_dir = /usr/lib/php/extensions/no-debug-non-zts-20060613/
  4. Edit to /etc/php.ini: Added: extension=memcache.so

You can skip his Apache scripts. Restart apache by restarting Web Sharing in the System Preferences.

The major changes I made from Nate’s memcached startup script were the singular instance and binding the service to localhost (127.0.0.1) only. This keeps memcached slightly more secure by only having it listen on the loopback adapter. If you need more space, just change the -m attribute to be higher; it’s measured in MB.

# /bin/sh
memcached -m 1 -l 127.0.0.1 -p 11211 -d

NOTE: As with any other service running on your system, opening a web server exposes your system to potential attack and worse. Be sure to keep production data away from your test environment. Someone at Starbucks, sharing your WiFi connection, may be surfing your development site too. Consider yourself warned!

Blissful Unions

Tuesday, October 14th, 2008

RGB Venn Diagram

RGB Venn Diagram

I’ve been sharpening the saw lately and decided to take a few minutes to clearly illustrate some SQL terminology that newbie developers might be troubled by. Nothing in here is magic or even comprehensive, but we all often overlook the inherent power of SQL engines when writing code.

If you’re unfamiliar with the RGB color space, it works basically like this. You add a bit of a color component by increasing one of the values. R=0, G=0, B=0 (or 0,0,0) is black, and R=15, G=15, B=15 (or 15,15,15) is white. The examples below are all created using the data table format included below.

UNION

Union takes two (or more) query results and provides you with the unique result. For example, if I had two queries 3, queries that each returned a segment of the color set with the color value = 15 and the other color values > 14, I could write this as three select statements with a form similar to the following, which selects the red subset SELECT r, g, b FROM colors WHERE r = 15 AND g >= 14 AND b >= 14. That gives me 4 rows, Set 1 below.

Now I’ll include another collection of colors: SELECT r, g, b FROM colors WHERE r BETWEEN 10 AND 12 AND g = 11 AND b BETWEEN 12 AND 14, labeled Set 2. And one more: SELECT r, g, b FROM colors WHERE r BETWEEN 9 AND 10 AND g >= 14 AND b BETWEEN 14 AND 16 The result is Set 3.

Set 1

r g b
15 14 14
15 14 15
15 15 14
15 15 15
Set 2

r g b
10 11 12
10 11 13
10 11 14
11 11 12
11 11 13
11 11 14
12 11 12
12 11 13
12 11 14
Set 3

r g b
9 11 14
9 11 15
10 11 14
10 11 15

And now all merged together using a UNION. Notice it removes the duplicate, highlighted in red above, saving us some time! The order by is just to make life easier when reading the newly merged results. As you can imagine, with a more complex dataset, this could be really handy!

SELECT r, g, b FROM colors WHERE r = 15 AND g >= 14 AND b >= 14
UNION
SELECT r, g, b FROM colors WHERE r BETWEEN 10 AND 12 AND g = 11 AND b BETWEEN 12 AND 14
UNION
SELECT r, g, b FROM colors WHERE r BETWEEN 9 AND 10 AND g = 11 AND b BETWEEN 14 AND 15
ORDER BY r,g,b
r g b
9 11 14
9 11 15
10 11 12
10 11 13
10 11 14
10 11 15
11 11 12
11 11 13
11 11 14
12 11 12
12 11 13
12 11 14
15 14 14
15 14 15
15 15 14
15 15 15

UNION ALL

As a quick example, using the same 3 queries and result sets from above, UNION ALL gives us the duplicate record.

SELECT r, g, b FROM colors WHERE r = 15 AND g >= 14 AND b >= 14
UNION ALL
SELECT r, g, b FROM colors WHERE r BETWEEN 10 AND 12 AND g = 11 AND b BETWEEN 12 AND 14
UNION ALL
SELECT r, g, b FROM colors WHERE r BETWEEN 9 AND 10 AND g = 11 AND b BETWEEN 14 AND 15
ORDER BY r,g,b
r g b
9 11 14
9 11 15
10 11 12
10 11 13
10 11 14
10 11 14
10 11 15
11 11 12
11 11 13
11 11 14
12 11 12
12 11 13
12 11 14
15 14 14
15 14 15
15 15 14
15 15 15

Tables

A table “colors” was created and into it I populated the full set of colors leveraging the ordinal int values of 0-15, resulting in 4,096 rows of colors. You can of course do this for all 256 values of RGB supported in the CSS color space, which would be more accurate for performance testing of your queries, but I digress…

CREATE TABLE `test`.`colors` (
  `r` INT(4),
  `g` INT(4),
  `b` INT(4),
  PRIMARY KEY (`r`, `g`, `b`)
)
CHARACTER SET utf8;

You can use this script to quickly populate your newly created table. There are lots of other ways to do it, but this was the fastest for me to write today.

$conn = mysqli_connect($server,$username,$password,$schema);
for($r=0; $r<16; $r++){
	for($g=0; $g<16; $g++){
		for($b=0;$b<16; $b++){
			$conn->query("INSERT INTO colors (r,g,b) VALUES ($r,$g,$b)");
		}
	}
}
$conn->close();
© 1998-2008 AF-Design, All rights reserved.