Evenly distributing data on a group of MySQL databases

I recently read an article about the way pinterest save data. After some rapid growth, they experimented with a variety of setups (which did not work too well), they devised a simple (relatively speaking) system to save data across an array of MySQL databases. The idea behind this being that you could initially put all the databases on a single server and as demand grows, move these databases out to their own servers.

Part of this approach involved saving data for a specific URL. This involved doing a hash and running a modulus operation on a URL that returns a number between a certain value (e.g. 1 – 4096) to determine where the data would be stored for that URL. So for example, a hash would be ran on the string ‘example.com’ and that would return 3 which would place it on db0003, ‘example.com/1’ would return 2034 which would place it on ‘db2034’ and so forth. The idea around this being that data would be evenly distributed between the databases.

I’ve been thinking about a project involving saving large amounts of data in a distributed fashion and was quite taken by this because of it’s simple approach to sharding. However, how evenly data were distributed was something I wondered about. After some initial poking around, it seemed that crc32 was a good candidate due to it not being particularly resource intensive compared to md5 and sha*.

However, two questions sprang to mind:

* How evenly was distributed between the shards.
* Are there no discrepancies between the hash function

I was pretty sure the answers to the questions would be ‘quite well’, and ‘there are no discrepancies’ respectively but did some tests anyway. After all, the last thing we want is for one implementation to give a different result from another.

The test strings were ran on the majestic millions top domains and what the code does is create an array of all the shard id’s with each one initialized at 0. Then go through that majestic millions text file line by line and does a hash (and modulus) to get the shard id. Once the shard id is retrieved, iterate the array with the shard id’s by 1. The graph below shows the result of those hashes and shows how many domains each bucket has.

The x axis shows the shard id and the y axis shows the number of records in each shard. Click to embiggen

What we can see from eyeballing the graph it is that data are roughly evenly distributed between the shards. A few outliers here and there but most shards have between 100 and 140 items. Another script which compares the shard result of each implementation demonstrated there were no discrepancies either.

All code can be found here. Any questions, feel free to comment and I’ll get back asap.

Move your projects out of the basement

Lets start off with keeping projects out of the *basement and by that I mean preventing projects from being hidden away and only accessible and used by yourself.Do read on if the following sounds familiar:
  • Start project
  • Finish it (get it to a state where it does what you want)
  • Start using it in some capacity for yourself.
  • Briefly contemplate promoting it but: make excuses that it needs more work, shrugging, hiding under your desk.
  • Start a completely new project

I have a few tips listed below which can help get the project out there. They’re all points I keep in mind personally too.

The project itself

Decent Readme

The first is to write a decent readme explaining what the software does and ideally has all the key stuff like how to contribute, how to ask for help and so forth.

Get them up and running as quickly

Ideally a standalone exe or a single command such as pip install example or apt-get install example. Go on the assumption that a user is mildly interested and may give an installation attempt a few minutes (if you’re lucky) and outline this with in a quickstart heading in the readme.

Getting it out in the open

Find forum posts where people are asking questions.

Think of what questions a user would ask that would lead them to your project. The more specific the better. For example, with the MRISA project I literally googled ‘reverse image search api’ found a user asking for one on stack overflow and popped. Basically all the traffic comes from that post and the project has around 80 stars and multiple contributions. Not bad for 5 mins of ‘marketing’!

The real trick though is to make sure what you are posting is genuinely helping. If you just randomly or insensitively plug your project, it’s going to upset people and backfire.

Submit your project to relevant sites

If your project uses an api, library or has any other dependancy, check the sites and see if they have a place to submit projects.For instance, digitalocean has a projects page which I recently had a project accepted on. The worse they can say is no (I say that as though it’s such a little thing!) and if they come back with suggestions it’s a win because you know what to improve.

I think those four points are particularly useful as they are quite fundamental, should be quickly and easily achievable but have the potential to provide high returns(plenty of visits and users) for minimal investment (takes relatively litte time to implement).

Agree/disagree, have something to add? Let me know in the comments section. Thanks for reading!

* I literally keep my projects on a 1u server in the basement running gogs. But I’m gradually starting to push things onto github

Monitoring your DNS records

Sometimes DNS servers just drop hostnames for a variety of reasons. Perhaps a customer forgets about a bill and their DNS records are wiped or there is an edit to a record, it’s not quite right and needs to be reverted and so on.

In an ideal world this would never happen but it does seem to every now and again. In most cases, this can be recovered but it can be a pain and also time consuming.This is where DNS Historian comes in – an applicaton which monitors and records DNS changes. It also provides an http api to new hostnames can be added and existing ones can be queried.  To find out more, check out: http://dns-historian.mage.me.uk/

Calculating the distance between a set of postcodes and ranking them in order.

Overview

There are sometimes instances where you have a set of locations and you need to rank them in order. A perfect example of this is allowing customers to find out the closest branch of a shop. The example here involves a set of ice cream shops and showing which one is closest to the postcode we specify.

strawberry_ice_cream-t2

You can see the code here in action at: charitiesneed.co.uk. As always any patches to the code are very welcome

Getting started with the example

The first step is to download a copy of the source code and sql script which you can do using git:

git clone https://github.com/vixre/postcodes_ordered_by_distance

or downloading it all in a zip file

Create a new database and import the table with the UK postcodes and their corresponding lat and lon (UK postcode database was provided by freemaptools.com).

mysql> create database example_db;

then exit back into your shell, cd to the source code directory and run:

mysql> mysql -u username -p -h localhost example_db < database.sql

That’s the database all set up so now take a look at the example (search.php).

<?php
	require_once('distance.php');
	date_default_timezone_set('Europe/London');
	// connect to database
	$db_server = 'localhost';
	$db_user = 'root';
	$db_pass = 'password';
	$db_name = 'example_db';
	$conn = new mysqli($db_server, $db_user, $db_pass, $db_name);</code>
	
	if ($conn->connect_error) {
	    trigger_error('Database connection failed: ' . $conn->connect_error, E_USER_ERROR);
	}
	
	// The origin post code
	$_REQUEST['post_code'] = 'YO31 8UB';
	
	// Remove all spaces in post code
	$post_code_search = str_replace(' ', '', $_REQUEST['post_code']);
	
	$page_number = (!empty($_REQUEST['page_number'])) ? $_REQUEST['page_number'] : $page_number = 1;
	
	$distance = new distance;
	
	$distance->entities_per_page = 10;
	$distance->offset = ($page_number - 1) * $distance->entities_per_page;
	
	// Initialize map object and set location origin
	$distance->origin = $post_code_search;
	
	$distance->populate_coordinates();
	$returned_array = $distance->build_result();
	
	// Encoded in json
	echo json_encode($returned_array);
?>

Essentially what this does is takes a postcode (YO31 8UB), looks up the corresponding lat and lon coordinates and runs a SQL query using the haversine formula and returns the ice cream shops in order of distance. So essentially the customer knows where the nearest place to get an ice cream is. Once you have entered your database credentials to search.php you should be able to run:

php search.php

which should return (distance in miles):

  Array
    (
     [id] => 1
     [name] => Archers Ice Cream
     [distance] => 29.00979938205737
    )
    Array
    (
     [id] => 7
     [name] => Brymor Ice Cream
     [distance] => 157.5704259332447
    )
    Array
    (
     [id] => 6
     [name] => Wensleydale Ice Cream
     [distance] => 204.28282924774888
    )
    Array
    (
     [id] => 2
     [name] => Walls factory
     [distance] => 214.04476274067244
    )

There you have it! We can now see that out of all the ice cream shops in the database (as the crow flies) ‘Archers Ice Cream’ is the nearest. Hopefully this example will help you gain an idea of how you can implement this functionality yourself into the projects you are working on.

Need help or not exactly what you are looking for?

Vixre is a consulting and software development company based in Northallerton, England. We’re happy to help you with implementing the distance calculation, data preparation, processing, server setup and such. Just let us know what you would like to achieve and we will see if we can be of service.

Possible improvements

Investigating the possibility of integration with the postcode.io API. This would likely involve porting the existing code (worth chatting with the postcode.io team before doing this).

Currently the class only provides an ‘as the crow flies‘ distance calculation. Using a path routing tool (e.g. OSRM) would be very useful and should allow for distance estimates that take into account roads and terrain.

Automated wordpress installation in under 30 seconds

Overview

Recently a client expressed his interest in teaching people how to use wordpress. The goal was to give each student in the class their own site so they could work on it throughout the day to make the lesson practical and give the students some hands on experience. It’s also great because the student has something they can take home with them and can continue to work on. I got through setting up 3 before I realized it had to be automated.

wordpress-logo-stacked-rgb

Why not just have each student sign up for a wordpress.com account?

WordPress.com is great, someone just has to sign up and poof, they have a site. The issue is flexibility and one of the major ones is lack of plugins. Currently wordpress.com has a list of plugins and as far as I’m aware you are not able to install any plugin you wish – only the ones they have on their set list. Considering the abundance of great plugins and that most users can benefit from them, it’s a major drawback.

Technical Overview

It’s a bash script that starts at the beginning by creating a user on the server, installs the database, creates relevant directories, downloads wordpress and assigns permissions and all the way to using the cloudflare DNS api to create the records and creating a welcome message. The idea was to just type one command and a site is ready for use like so:

./install.sh test123

which would result in a brand new wordpress site at: test123.dev.vixre.co.uk

It does just that and has saved me quite a lot of time. If you’re inclined to have a look at the code, please feel free to do so.

https://github.com/vixre/automatically_install_wordpress

If you have any questions or would like something changing or enhancing, we are available.

Server details: Debian stable, Apache2, MySQL/MariaDB, PHP, cURL, wget

Overall message

I think the key message is try to take time out and be aware of how you are working and whether it can be improved. If you are familiar with the process and find it is taking up more time than you would like, exploring how to automate the process is definitely something to look in to and something we can help with.

MySQL UK post code table with corresponding longitude and latitude

(Edit March 2016) Please note the freemaptools site now maintains a MySQL version so it would probably be best to use theirs instead.

One of the projects I have been working on recently involved the calculation of the distance between two postcodes. Google provides a decent API and retrieving the distance between two post codes with map routing is very simple. For example to calculate the distance between Fountains Abbey and Northallerton high street using their postcodes would only require:

https://maps.googleapis.com/maps/api/distancematrix/json?origins=HG43DY&destinations=DL78LW&sensor=false&units=imperial

The downside to this is that there are request limits imposed on this API and you can only make a certain amount before you need to hand over some cash to Google. Due to the nature of the project, I realized these limits would quickly be met and as a result found the Google API to be unsuitable.
An alternative would be to find a database with all the UK post codes and their longitude and latitude coordinates. You could then easily lookup the coordinates of a post code and pass them to a routing engine (e.g. Open Source Routing Machine) to calculate distances by road or potentially just a simple ‘as the crow flies’ distance calculation using trigonometry.Open Source Routing Machine Logo

It didn’t take long to find the fantastic freemaptools.com which has a csv containing all UK postcodes and their coordinates. Unfortunately they only had an MSSQL version so I whipped up a MySQL version which worked nicely and is just below for your perusal and use. The same goes for the output the script generates:

Output of the script: mysql_postcodes.zip (35M) Last updated 01/07/2014

The PHP script on github

Readme

PHP code for taking the csv found at http://www.freemaptools.com/download-uk-postcode-lat-lng.htm and importing it into a mysql database. You can use the script in two ways:

  1. Fill in the database details at the start of the file and run the queries as they are built.

    php import.php postcodes.csv

  2. Specify ‘echo’ as the 2nd argument and output the sql to a file.

    php import.php postcodes.csv echo > mysql_postcodes.sql