My system has a daily job to collect and aggregate the IP Blocks distributed by the five registrars into a single database, and then provide high-level WHOIS information for this data. If you want to try this yourself, the interface here.
On an extremely irregular basis, I aggregate the statistics from this data, and present it to the masses for the examination. I might actually automate this data someday!So, for August 2007 (as of August 21, 2007), here are the aggregated IP distribution statistics broken down by registrar and country.
Starting with the country data, I figured that a sane cutoff for values would be 1,000,000 IP addresses. The list here is mostly unchanged, except for the addition of the European Union (EU). The EU is a recent addition, as it was not listed in the Country Code lookup table I had been using, and had to be added manually on August 20, 2007.
Category: GrabIP
Geographic IP database using PERL, PHP and MySQL — UPDATE: September 16 2008
Updated September 16 2008 to reflect the numerous changes that have resulted since the original article was posted in 2005 – smp
Targeting Web site content to the specific visitors who view the site is a very important marketing advantage. Being able to track incoming visitors by the country that they originate from is an additional item that can assist companies in ensuring that visitors are presented with relevant content. This may seem like a daunting task, but it can be achieved with a high degree of accuracy using publicly accessible data, and Open Source software.
IP to Country Mapping
The idea for IP to Country mapping is one that has started to appear more frequently on the Internet in recent months. All GeoIP systems do warn users that they are not 100% accurate. The accuracy of GeoIP mapping can be affected by things such as large corporate or ISP networks where traffic is routed out a small number of public access points, regardless of the traffic’s point of origin.
Making IP Addresses Searchable
The first issue that needs to be addressed is how to determine if an IP Address is in one of the ranges that is defined as originating from a distinct country. The simplest way to range-match IP Addresses is to abandon the dotted-quad notation we are all familiar with, and convert the IP Address to an IP Number.
All IP Addresses can be converted into decimal numbers that fall into a known range between 0 and (2^32)-1 (4294967295). In reality, the range is even smaller than that, as public IP Addresses fall between 0.0.0.0 (IP Number: 0) and 223.255.255.255 (IP Number: 3758096383).
A quick search of the Web showed me that there is a way to create functions to convert IP Addresses to their numerical (IP Number) equivalent and reverse the process.
sub ip2long { return sprintf("%u",unpack("l*", pack("l*", unpack("N*", inet_aton(shift))))); } sub long2ip { return inet_ntoa(pack("N*", shift)); } <--snip--> if ($start_ip =~ /\d+\.\d+\.\d+\.\d+/) { ## my $ip_address = shift($start_ip); chomp($start_ip); $long_start = ip2long($start_ip); ## print "$ip_address converts to $ip_number\n"; }
If you are using PHP in your applications, this conversion process is made even easier by native function calls.
Convert IP Address to IP Number $ip_number = sprintf("%u",ip2long($ip_address));
IP Address Location Data
Now that we have settled on a format for the IP data to be used in the database, we now have to find IP data that allows us to map IP Addresses to countries. This is easier than it sounds, as this data is centrally held by the 5 regional IP Registries — ARIN, RIPE, APNIC, LACNIC, and AFRINIC. After poking around in the depths of their Web sites, I found that they actually provide text formatted versions of the allocated and assigned IP ranges that they are responsible for. All of the registries use the same format, which makes parsing the these files a simple process.
As of March, 2005, the distribution of IPV4 networks in the database by registry is:
+----------+--------------+ | registry | num_networks | +----------+--------------+ | arin | 36073 | | ripencc | 14813 | | apnic | 10474 | | lacnic | 1460 | | afrinic | 443 | +----------+--------------+ 5 rows in set (0.52 sec)
I chose to use the PERL module WWW::CURL to retrieve the files. You could re-write the application to use LWP or some other method on systems where cURL is not supported, as it is a simple file download over FTP. I update the data once a day, which may at first appear excessive. However, I have seen upwards of 40-50 new rows added to the database in a single day.
Some may ask why I chose to write the downloaded files to a file rather than immediately inserting them into the database. Using this two-step process gives me the ability to manually rollback to an older database if there is a problem retrieving one of the registry files. I have set an arbitrary limit of 75,000 lines for the entire aggregated file; if the file is less than that, the remainder of the process is aborted.
The data retrieved from the registries is in the following format.
Registry Raw Data Format
<snip> apnic|CN|ipv4|202.127.4.0|256|19950610|assigned apnic|BN|ipv4|202.160.0.0|2048|19950610|allocated apnic|NP|asn|4613|1|19950611|allocated apnic|LK|ipv4|203.143.0.0|1024|19950612|allocated apnic|MO|asn|4609|1|19950615|allocated apnic|KR|asn|4670|1|19950616|allocated apnic|SB|ipv4|202.63.254.0|512|19950618|assigned apnic|JP|ipv4|202.232.0.0|262144|19950618|allocated apnic|SG|ipv4|203.127.192.0|8192|19950618|allocated apnic|PK|asn|4615|1|19950629|allocated apnic|HK|asn|4614|1|19950704|allocated </snip>
The fields are all “|” (pipe-character) separated, and are described below.
COLUMN VALUES --------------------------------------------------------------------- REGISTRY: apnic,arin,ripencc,lacnic,iana COUNTRY_CODE: One of 240 unique 2-character country codes or "*" ADDRESS_TYPE: asn,ipv4,ipv6 ADDRESS: Either the starting IP Address or AS Number or "*" NUMBER: Number of IPs in range or "1" if ADDRESS_TYPE is "asn" DATE: Date IP range or AS Number was added to database or "*" RANGE_TYPE: "allocated" -> borrowed; "assigned" -> owned
Storing the Data in MySQL
To store the data, I created a two-table MySQL database named “ip_registry”, using the script below.
Database creation statement for ‘ip_registry’
CREATE DATABASE ip_registry;
Table structure for table ‘country_code’
CREATE TABLE ip_registry.country_code ( code char(2) default NULL, country varchar(50) default NULL, UNIQUE KEY code (code) );
Table structure for table ‘ip_map’
CREATE TABLE ip_registry.ip_map ( code char(2) default NULL, registry char(10) default NULL, ip_from double default NULL, ip_to double default NULL, UNIQUE KEY registry (registry,ip_from,ip_to) );
As of March 2005 September 2008, the “ip_map” data table for my system runs to 63,263 88,000+ rows. This value will change daily, and may decrease suddenly at times. The registries make an effort to aggregate as many IP networks as possible into the largest possible contiguous block, and this aggregation process will reduce the number of individual entries by 2,000 – 3,000 rows in a single day.
The recognized standard for country codes is ISO 3166. In this standard, each nation is assigned a unique, two-character code. The ONLY exception I found to this rule is that, for historical reasons, the IP registries have entries for the United Kingdom listed with two country codes (GB and UK). I could have corrected this in the Perl script by standardizing on a single country code, but I preferred the solution of adding another row to the “country_code” table.
From the raw Registry data, I determined that only four of the fields useful for the project that I was working on: REGISTRY, COUNTRY_CODE, ADDRESS, and NUMBER. I then wrote PERL code to read the raw IP Registry data from the data file I created previously, convert the starting IP address to a number, use this starting IP Number that to generate the end IP Number, and then insert the rows into a database.
PERL: IP Number conversion and database insert
<--snip--> if ($line_count >= 115000) { ## Establish Database Connection print "\n\nOpening database connection"; my $dbh = DBI->connect("DBI:mysql:host=[host];database=logger","[username]","[password]",{PrintError=>0}); ## Remove existing values my $sth = $dbh->do("TRUNCATE TABLE ip_map"); print " --> Data from ip_map table dropped"; $sth = $dbh->prepare("INSERT into ip_map values (?,?,?,?,?,?)"); $count = 0; print " --> Completed\n"; ## Insert Data Into Database print "Inserting data into the database"; open (PROCESS, "<$file"); while ($line = ) { chomp ($line); if (($line =~ m/\|ipv4\|/) and ($line !~ m/\|\*\|/)) { ($registrar,$country_code,$item_type,$start_ip,$num_ip,$entry_date,$registry_type) = split(/\|/, $line); $long_start = 0; if ($start_ip =~ /\d+\.\d+\.\d+\.\d+/) { ## my $ip_address = shift($start_ip); chomp($start_ip); $long_start = ip2long($start_ip); ## print "$ip_address converts to $ip_number\n"; $long_end = $long_start + ($num_ip-1); $count += $sth->execute($country_code,$registrar,$long_start,$long_end,$num_ip,$start_ip); } } } } <--snip-->
The “TRUNCATE” statement in the script has the affect of dropping the table and re-creating it using the column names and types defined in the initial create statement. It is easier to rebuild the data table each time new data is inserted to ensure that duplicates and overlaps do not enter into the database.
Why is the value of “$long_end” defined by “$long_start + ($num_ip-1)”? The IP address ranges delivered by the registries count the starting value as one of the items in the set — i.e. counting using ordinal numbers.
START_IP: 12.236.236.0 END_IP: 12.236.236.255 NUMBER_IP: 256
If cardinal numbering is used to calculate the address range, incorrect IP addresses will be generated.
IP Number Calculations
WRONG! 216853760 = 216853504 + 256 -> END_IP = 12.236.237.0
RIGHT! 216853759 = 216853504 + (256-1) -> END_IP = 12.236.236.255
I have also added a sanity-check that stops the insertion process if the number of lines in the data file is less than 75,0000 115,000. This would prevents the creation of a truncated database if one of the registries does not update their data files or the script is unable to retrieve the data files. The value of 75,000 115,000 appears high, but the data files that the data is extracted from also contain autonomous system and IPV6 data, as well as the IPV4 data that is inserted into the database. Currently, the data file runs to over 90,000 130,000 lines, so the 75,000 115,000 line barrier seems very reasonable to prevent inserting a broken dataset.
Querying the Database
Now that the database is constructed, we can start to run queries against it.
mysql> select ip.code,ip.registry, ip.ip_from, ip.ip_to, co.country -> from ip_map ip, country_code co -> where (ip.code = 'IS') and (ip.code = co.code); +------+----------+------------+------------+---------+ | code | registry | ip_from | ip_to | country | +------+----------+------------+------------+---------+ | IS | ripencc | 1049722880 | 1049731071 | ICELAND | | IS | ripencc | 1359937536 | 1359970303 | ICELAND | | IS | ripencc | 1383088128 | 1383096319 | ICELAND | | IS | ripencc | 1385447424 | 1385455615 | ICELAND | | IS | ripencc | 1390215168 | 1390280703 | ICELAND | | IS | ripencc | 1403846656 | 1403863039 | ICELAND | | IS | ripencc | 1433681920 | 1433690111 | ICELAND | | IS | ripencc | 1439023104 | 1439039487 | ICELAND | | IS | ripencc | 1440481280 | 1440514047 | ICELAND | | IS | ripencc | 2644312064 | 2644377599 | ICELAND | | IS | ripencc | 3238264832 | 3238330367 | ICELAND | | IS | ripencc | 3245150208 | 3245154303 | ICELAND | | IS | ripencc | 3261718528 | 3261726719 | ICELAND | | IS | ripencc | 3264217088 | 3264282623 | ICELAND | | IS | ripencc | 3556884480 | 3556892671 | ICELAND | | IS | ripencc | 3558785024 | 3558793215 | ICELAND | | IS | ripencc | 3565084672 | 3565092863 | ICELAND | | IS | ripencc | 3584524288 | 3584532479 | ICELAND | | IS | ripencc | 3585114112 | 3585122303 | ICELAND | | IS | ripencc | 3585433600 | 3585441791 | ICELAND | | IS | ripencc | 3586023424 | 3586031615 | ICELAND | | IS | ripencc | 3587538944 | 3587547135 | ICELAND | | IS | ripencc | 3587981312 | 3587997695 | ICELAND | | IS | ripencc | 3641278464 | 3641282559 | ICELAND | | IS | ripencc | 3642535936 | 3642540031 | ICELAND | | IS | ripencc | 3650592768 | 3650596863 | ICELAND | | IS | ripencc | 3650596864 | 3650600959 | ICELAND | | IS | ripencc | 3651915776 | 3651919871 | ICELAND | +------+----------+------------+------------+---------+ 28 rows in set (0.09 sec)
So the database structure is sound. It is important to build the file using all four five registries; even though Iceland is now covered by RIPE, older IP allocations and assignments were been handled by both RIPE and ARIN.
Having the registry information helps build in the flexibility to add a WHOIS functionality using this database, something that I have done on for my own IP Tracking and logging tool. This allows for further drilldowns on the data, beyond the scope of this article.
The main item that will be of interest to most Web site administrators is that they can now build dynamic pages using a data source which tracks their visitors’ announced IP address to the country of origin with a high degree of accuracy. This is particulary useful if you are attempting to distribute users to geographically diverse mirror sites. You can also do fun things, such as displaying the flag of the country that the visitor is coming from.
A generic copy of the registry retrieval and database insertion script can be yours just by contacting me.
GrabIP: This week’s IP Registrar Stats
Here are the IP Block and Address allocation by registrar, as well as the number of officially allocated IP Addresses in each country (with 65536 Hosts or more) as of August 17, 2005 00:30 UTC.
This data has been extracted from the GrabIP database
IP REGISTRAR BREAKDOWN
registry NUM_HOSTS NUM_BLOCK -------- ---------- --------- arin 1481414400 36355 ripencc 361039644 18864 apnic 285163264 11402 lacnic 38433536 1534 afrinic 10534912 473
NUMBER OF ALLOCATED IP ADDRESSES BY COUNTRY (65536 Hosts or more)**
COUNTRY COUNTRY_CODE NUM_HOSTS NUM_BLOCKS ------------------------------------------ ------------ ---------- ---------- UNITED STATES US 1313949696 31216 JAPAN JP 140817920 1502 EUROPEAN UNION EU 113810588 4595 CHINA CN 69473536 871 CANADA CA 65890048 4921 GERMANY DE 50366576 1651 FRANCE FR 41834624 1188 UNITED KINGDOM UK 39048024 1630 KOREA, REPUBLIC OF KR 37513472 350 UNITED KINGDOM GB 33562368 25 AUSTRALIA AU 26410240 4743 NETHERLANDS NL 16426144 995 SPAIN ES 16301856 417 ITALY IT 16213856 728 TAIWAN, PROVINCE OF CHINA TW 15157248 322 BRAZIL BR 15075072 261 SWEDEN SE 12033664 526 MEXICO MX 11030528 154 RUSSIAN FEDERATION RU 10389536 1150 SOUTH AFRICA ZA 8184064 251 POLAND PL 7318686 704 FINLAND FI 7100800 260 SWITZERLAND CH 6875584 754 DENMARK DK 6839072 274 HONG KONG HK 6124800 534 INDIA IN 5891072 352 AUSTRIA AT 5667232 383 NORWAY NO 5054240 242 BELGIUM BE 4309760 290 NEW ZEALAND NZ 4274944 1126 TURKEY TR 3664064 167 CHILE CL 3002112 314 ARGENTINA AR 2923520 237 IRELAND IE 2903200 114 THAILAND TH 2853376 295 CZECH REPUBLIC CZ 2804480 186 ROMANIA RO 2764160 274 ISRAEL IL 2726272 118 SINGAPORE SG 2677760 304 PORTUGAL PT 2604864 96 MALAYSIA MY 2569472 191 HUNGARY HU 2220288 131 GREECE GR 1766400 161 VENEZUELA VE 1753088 79 UKRAINE UA 1354458 618 BULGARIA BG 1352192 191 EGYPT EG 1280768 55 INDONESIA ID 1267200 224 COSTA RICA CR 1254144 12 PHILIPPINES PH 1136384 186 SLOVAKIA SK 975872 88 UNITED ARAB EMIRATES AE 964608 14 SAUDI ARABIA SA 902400 79 COLOMBIA CO 864768 102 LATVIA LV 798528 101 LITHUANIA LT 788352 66 IRAN, ISLAMIC REPUBLIC OF IR 778752 52 SLOVENIA SI 667392 100 PERU PE 646656 58 ESTONIA EE 645376 44 VIET NAM VN 625664 38 ICELAND IS 507904 28 PANAMA PA 467712 30 YUGOSLAVIA YU 347392 35 PAKISTAN PK 345600 77 CROATIA HR 338240 32 LUXEMBOURG LU 307968 50 KUWAIT KW 297728 24 CYPRUS CY 274720 58 BOLIVIA BO 263424 20 KAZAKHSTAN KZ 242688 27 ECUADOR EC 242432 125 MALTA MT 240384 31 MOROCCO MA 233728 9 PUERTO RICO PR 218880 29 URUGUAY UY 206592 25 JORDAN JO 191744 29 BANGLADESH BD 178688 52 QATAR QA 176128 6 PALESTINIAN TERRITORY, OCCUPIED PS 173824 17 BRUNEI DARUSSALAM BN 172544 11 SRI LANKA LK 167936 30 OMAN OM 163840 5 NIGERIA NG 159744 24 EL SALVADOR SV 156416 12 GUATEMALA GT 145152 19 MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF MK 143360 11 DOMINICAN REPUBLIC DO 128768 21 MACAO MO 127232 15 BOSNIA AND HERZEGOVINA BA 124928 23 BELARUS BY 119552 16 GEORGIA GE 118784 18 LEBANON LB 112640 16 KENYA KE 108800 23 TUNISIA TN 106752 4 CUBA CU 95232 12 AZERBAIJAN AZ 92160 11 ALGERIA DZ 90880 16 FIJI FJ 88832 10 ARMENIA AM 86016 10 NICARAGUA NI 81152 9 UZBEKISTAN UZ 77824 9 GHANA GH 74496 13 BERMUDA BM 66048 18 KYRGYZSTAN KG 65536 9
I will try and post these weekly.
**Please note that the total number of IP Addresses for the United Kingdom is the sum of totals for the two unique country codes used when registering IP Blocks — UK and GB.
IP-Country Database Information Offline Permanently
I have been getting a great deal of interest in the IP-to-Country data that I have been working on over the last few years. However, I have had to take this data down from my site.
- The bandwidth costs are starting to become noticeable
- My old hardware is starting to creak under the weight
- There are commercial sources for this data, and frankly, I say more power to them if they can make money of this; I sure haven’t
It’s been a depressing week here in the Newest Industry Factory, and we are starting to have to dim the lights in some corners, turn off some of the machines, and [gulp!] consider other drastic measures.
To paraphrase the Bare Naked Ladies: “All this food ain’t free”.
I apologize for the inconvenience, and hope that I can re-activate these pages in the future.
UPDATE: GrabIP is now back up, but limited to 5 requests in a 24-hour period.