Tuesday, December 17, 2013

Wordpress Import Blogs From Production to Development Environment

We have a wordpress blog environment for a school sports portal where every state, league, school and team has their own blog. We have almost 160000 blogs in the system so far and which is growing. Often for development and QA purposes, the development team needs to have one or two blogs in their local environment. At the begining, as a DBA, I had to take backup of the blog’s tables and then import those tables in the QA environment and also create records at wp_blogs table. Besides, we have a sports adminitrative system and in wordpress database we setup some link tables with that system with tables like -
admin_school - related to school info
admin_team - related to team info
admin_league - related to league info
admin_region - related to state info

When I found that, it has become a regular task to import blogs from latest backup of production to development environment, I created a script “import_blog.py to automate the task.
python import_blog.py -p <config_file> blog_1 blog 2 … blog_N

I created the script on python 2.6. It reads a configuration files as below:



hyperdb_factor = 1000

It connects with source Mysql DB Server where we keep our daily backups which consist of one mysqldump file for each blog and one large dump file for common tables like wp_blogs. It finds the latest backup using the function :

def find_backup_dir(ssh):    
 stdin, stdout, stderr = ssh.exec_command('cd /BACKUP; ls')
 cmd_out = stdout.readlines()
 cmd_out= ','.join(cmd_out)
 backup_dirs = re.findall(r'dsdb_\w+', cmd_out)
 backup_dir = backup_dirs[len(backup_dirs)-2]
 backup_dir=source_dbbackupdir + '/' + backup_dir
 #print backup_dir
 return backup_dir

Then it tries to get the given blogs backup files and the common dump files using the function
find_backup_file_for_blogs  which calls find_backup_file_for_blog. Then it makes a tar file of the blog backup files and copy that tar file in target db server using ssh. I used python paramiko library for executing ssh commands.

Then at the target server, the script extracts the copied tar file to “import_blog_dir” config value using function “extract_import_blogs_gz” and modify the url of the backup files contains from production to development using sed tool in the function “make_dumpfiles_for_target_domain”.

def make_dumpfiles_for_target_domain(import_blog_dir,target_domain,source_domain):
 dump_files =  os.listdir(import_blog_dir)
 for f in dump_files:
  cmd = "sed  -i 's/" + source_domain + "/" + target_domain + "/g' " + f
  #print cmd
  cmd = "sed  -i 's/photos." + target_domain + "/photos." + source_domain + "/g' " + f
  #print cmd
  cmd = "sed  -i 's/videos." + target_domain + "/videos." + source_domain + "/g' " + f
  #print cmd
  cmd = "sed  -i 's/media." + target_domain + "/media." + source_domain + "/g' " + f
  #print cmd
  return import_blog_dir

Then it connects to target mysqldb and import those files to the target database and also import the wp_blogs record for that blog.

The script is available at

Continue reading →

pgHA - Postgresql+Pgbouncer High Availability Tools

I was working on a project to setup easy failover system for Postgresql Master-Slave replication where all app servers were connected to the dbserver through pgbouncer. I had to configure such a infrastructure where we could easily promote Slave to master in case of failover.

To achieve this, I found a good tool named pgHA, developed by OpenSCG (https://bitbucket.org/openscg/pgha).

pgHA - working mechanism

pgHA is a perl based tool and it has a file called failoverd.pl which is actually the heart of the tool. Installation procedures described at README file.

pgHA works at the slave server. It reads some configuration parameters from pgha.conf and works as a background process in the server.
When we executes
perl failoverd.pl -c ../cfg/pgha.conf --start
it modifies the app servers pgbouncer config files with pgbouncer_failover_config and starts a background process. In order to modify the pgbouncer.ini files in the app servers, pgHA requires password less access to the app servers from slave db server.

When we executes
perl failoverd.pl -c pgha.conf --pushButton
it creates the trigger file in the slave which allows the slave to act as master and restarts the app servers' pgbouncer so that it reads pgbouncer.ini directing to new master.

pgHA also supports automatic failover.

It also reads some configuration parameters to kill all database connections through pgbouncer to the old master and then resume with the new master db server. These parameters are -

# Which databases should be part of failover
# pgBouncer command to pause / fence the db's
# pgBouncer command to reload the config file
# pgBouncer command to unlock the db's

Plug pgHA with Postgresql 8.4:

pgHA was designed to work with Postgresql Streaming Replication and pgbouncer infrastructure. In my end, we were running with Postgresql 8.4 Log Shipping Replication. As pgHA was executing queries in the Slave server to find out if the slave is alive or not, we were not able to implement it at our site. So I decided to modify the pgHA source a bit to work with Postgresql 8.4 Log Shipping.

In Postgresql 8.4, it was not possible to execute queries on slave as it had no hot_standby feature. So I added a function in the source of pgHA, which checked that if any service was running at the slave with the given port and if it would find yes, then it would read the pg_controldata and looked for 'in archive recoverymode'. With this function, I also added some parameters which were required for the execution of the function.

The function was:

# Sub: CheckPgDatabase
# Inputs:
#       dbport- Slave Postgres db port  
# Output:
#       isDBUp      - (Boolean) 1 if database is up, 0 if it is down
#       Check if Postgresql is running on that port using netstat command. The OS user that is running pgha should have the permission to
#       execute netstat command. If it founds the Postgresql is running on that port, to make sure it parses the value from pg_controldata and
#       check "Database cluster state". The value should be "in archive recovery" for a running Postgresql instance with pg_standby.
sub CheckSlaveStatusInArchiveRecoveryMode
my ( $dbport ) = @_;
  my $isDBUp=1;
  $logger->debug("Attempting connect: ".$connStr);  
  if ($isDBUp==1)
  #Checking if there is any service running on the dbport 
 my $cmd = 'netstat -a | grep '. $dbport;
 my $var = qx{$cmd};
 if ($var)
  #print "\n" . $cmd; 
   #print "\n\t Slave is running" ;
 else #if ($var)
  #print "\n\t Slave is NOT running" ; 
 } #if ($var)
  } #if ($isDBUp==1)
  if ($isDBUp==1)
  # if the above passed, then check what the pg_controldata says
 my $cmd = $v_slavePgengine . '/pg_controldata ' .  $v_slavePgdata . ' | grep "Database cluster state"';
 my $var = qx{$cmd};
 if ($var =~ /in archive recovery/)
   print "\n\t Database is in archive recovery mode\n";
  print "\n\t ". $cmd;
  print "\n\t Database in down\n";
  } #if ($isDBUp==1)
  return $isDBUp;
} #sub CheckSlaveStatusInArchiveRecoveryMode

The pgHA code was written so self-descriptive that I was able to push my desired function in the code. I communicated with the pgHA developer and he really liked the solution and ready to push it in his code. That was really inspirational for me.


Continue reading →
Tuesday, December 3, 2013

Postgresql online space reclaim using pg_repack

Working with the Postgresql databases, often we hear the term table bloat and index bloat. Postgresql uses MVCC technology, for which deleted or updated rows remain in the table and it creates index and table bloat. Earlier, we used to work with VACUUM FULL or CLUSTER operation, in order to, reclaim the unused space of a table. VACUUM FULL and CLUSTER both locks the table so that users cannot make any changes on it.
So in 24X7 production system, it was difficult to to execute those operation. Searching a bit, I found the Postgresql extension project pg_repack.

What is pg_repack:

Accoring to https://github.com/reorg/pg_repack,
pg_repack is a PostgreSQL extension which lets you remove bloat from tables and indexes, and optionally restore the physical order of clustered indexes. Unlike CLUSTER and VACUUM FULL it works online, without holding an exclusive lock on the processed tables during processing. pg_repack is efficient to boot, with performance comparable to using CLUSTER directly.
pg_repack 1.1.8 was released as a drop-in replacement for pg_reorg, addressing some of the shortcomings of the last pg_reorg version (such as support for PostgreSQL 9.2 and EXTENSION packaging) and known bugs.

Install pg_repack:

Download pg_repack from   http://pgxn.org/dist/pg_repack/.
Installation of pg_repack quite easy and simple. pg_repack looks for pg_config, so make sure pg_config is available.

cp -R pg_repack /source/postgresql-9.2.4/contrib/pg_repack
cd pg_repack
make install

Now at psql client

psql# select * from pg_available_extensions where name = 'pg_repack';
name              | pg_repack
default_version   | 1.1.8
installed_version | 1.1.8
comment           | Reorganize tables in PostgreSQL databases with minimal locks

pg_repack creates a schema "repack" in the database;

Tables in repack:

\dv repack.*
            List of relations
 Schema |     Name     | Type |  Owner
 repack | primary_keys | view | postgres
 repack | tables       | view | postgres
(2 rows)

Functions in repack:

\df repack.*
 Schema |           Name            |  Type
 repack | array_accum                | agg
 repack | conflicted_triggers        | normal
 repack | disable_autovacuum         | normal
 repack | get_assign                 | normal
 repack | get_columns_for_create_as  | normal
 repack | get_compare_pkey           | normal
 repack | get_create_index_type      | normal
 repack | get_create_trigger         | normal
 repack | get_drop_columns           | normal
 repack | get_enable_trigger         | normal
 repack | get_index_columns          | normal
 repack | get_index_keys             | normal
 repack | oid2text                   | normal
 repack | repack_apply               | normal
 repack | repack_drop                | normal
 repack | repack_indexdef            | normal
 repack | repack_swap                | normal
 repack | repack_trigger             | trigger
 repack | version                    | normal
 repack | version_sql                | normal
(20 rows)                                                                                                                                                 

Example with a dummy table:

Create a table:


Populate the table:

INSERT INTO t1 SELECT c , 'a'||c FROM generate_series(1,10000000) as c;
INSERT 0 10000000

Check table size: 

 select pg_size_pretty(pg_table_size('t1')); pg_size_pretty
 539 MB
(1 row)

Delete records from table:

DELETE FROM t1 WHERE i < 100000;
DELETE 99999
SELECT pg_size_pretty(pg_table_size('t1'));
 539 MB
(1 row)
DELETE FROM t1 WHERE i < 900000;
DELETE 400000
select pg_size_pretty(pg_table_size('t1'));
 539 MB
(1 row)

Update some records:
Update t1 SET nam = 'c'||i WHERE i < 100000; 

Check table size: 

SELECT pg_size_pretty(pg_table_size('t1'));
 828 MB
(1 row)

Execute pg_repack command:

pg_repack -d postgres -t t1 --no-order

Online update during repack:

update t1 set nam = 'c'||i where i = 1000000;

Check table size after repack:

select pg_size_pretty(pg_table_size('t1'));
490 MB
(1 row)

Moitoring locks:

Using the following query, we monitored the lock during repack

 SELECT locktype, mode, relation, c.relname
 FROM pg_locks as l
 INNER JOIN pg_class as c
   ON l.relation = c.oid;

 locktype |       mode       | relation |          relname

 relation | AccessShareLock  |    33608 | t1_pkey
 relation | AccessShareLock  |    33605 | t1
 relation | RowExclusiveLock |    33641 | log_33605_pkey
 relation | RowExclusiveLock |    33634 | log_33605
 relation | AccessShareLock  |    11069 | pg_locks
 relation | AccessShareLock  |     2663 | pg_class_relname_nsp_index
 relation | AccessShareLock  |     2662 | pg_class_oid_index
 relation | AccessShareLock  |     1259 | pg_class
 relation | AccessShareLock  |    11069 | pg_locks
 relation | SIReadLock       |    33605 | t1
 relation | SIReadLock       |    33634 | log_33605
(11 rows)

After repack operation, locks were -
SELECT locktype, mode, relation, c.relname
FROM pg_locks as l
INNER JOIN pg_class as c
  ON l.relation = c.oid;

 locktype |      mode       | relation |          relname

 relation | AccessShareLock |     2663 | pg_class_relname_nsp_index
 relation | AccessShareLock |     2662 | pg_class_oid_index
 relation | AccessShareLock |     1259 | pg_class
 relation | AccessShareLock |    11069 | pg_locks
(4 rows)

Internal procedures:

pg_repack follows the steps in order to reclaim the unused space with minimal locking

1. Setup workspaces and a trigger.                                                                                      
2. Copy tuples into temp table.                                                                                        
3. Create indexes on temp table.                                                                                        
4. Apply log to temp table until no tuples are left in the log and all of the old transactions are finished.                 When All old transactions are finished, go to next step.
5. Swap - original table with new temp table                                                                                                 6. Drop - temporary tables and triggers and functions
7. Analyze - the new table                                                                                                            

Triggers on table:

pg_repack operation creates trigger on the specified table:

\d t1
      Table "public.t1"
 Column |  Type   | Modifiers
 i      | integer | not null
 nam    | bytea   |
    "t1_pkey" PRIMARY KEY, btree (i)
Triggers firing always:
    z_repack_trigger BEFORE INSERT OR DELETE OR UPDATE ON t1 FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('INSERT INTO repack.log_33648(pk, row) VALUES( CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.i)::repack.pk_33648) END, $2)')



Continue reading →