Monday, November 12, 2012

Collect Mysql table statistics for millions of tables running for Wordpress site

0 comments
I was administering the database for a Portal where we had millions of blogs. The site was build on Wordpress. So we had Mysql database with Wordpress Hyperdb setup. As each blogs consists of 9 separate tables, we had huge number of tables in the database. This huge number of tables were creating a great problem in database maintenance like we couldn't use any single command on INFORMATION_SCHEMA to get table status or something. However, we found that we can use SHOW TABLE STAUS in each database. As we had almost 150 databases, it was not possible to execute SHOW TABLES and then collect data for all. So we planned to create a script for the task.
The script would traverse all the databases and execute SHOW TABLE STATUS and store the result in a table named table_stats.table_stat.
Here, we created a new schema table_stats  to store all the statistics related information.
CREATE TABLE `tmp_table_stat` (
`TABLE_NAME` varchar(64) NOT NULL DEFAULT '',
`ENGINE` varchar(64) DEFAULT NULL,
`VERSION` bigint(21) unsigned DEFAULT NULL,
`ROW_FORMAT` varchar(10) DEFAULT NULL,
`TABLE_ROWS` bigint(21) unsigned DEFAULT NULL,
`AVG_ROW_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`MAX_DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`INDEX_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_FREE` bigint(21) unsigned DEFAULT NULL,
`AUTO_INCREMENT` bigint(21) unsigned DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CHECK_TIME` datetime DEFAULT NULL,
`TABLE_COLLATION` varchar(32) DEFAULT NULL,
`CHECKSUM` bigint(21) unsigned DEFAULT NULL,
`CREATE_OPTIONS` varchar(255) DEFAULT NULL,
`TABLE_COMMENT` varchar(80) NOT NULL DEFAULT ''
);


CREATE TABLE table_stats.`table_stat` (
`TABLE_SCHEMA` varchar(64) NOT NULL DEFAULT '',
`STAT_COLLECTION_TIME` datetime DEFAULT NOW();
`TABLE_NAME` varchar(64) NOT NULL DEFAULT '',
`ENGINE` varchar(64) DEFAULT NULL,
`VERSION` bigint(21) unsigned DEFAULT NULL,
`ROW_FORMAT` varchar(10) DEFAULT NULL,
`TABLE_ROWS` bigint(21) unsigned DEFAULT NULL,
`AVG_ROW_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`MAX_DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`INDEX_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_FREE` bigint(21) unsigned DEFAULT NULL,
`AUTO_INCREMENT` bigint(21) unsigned DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CHECK_TIME` datetime DEFAULT NULL,
`TABLE_COLLATION` varchar(32) DEFAULT NULL,
`CHECKSUM` bigint(21) unsigned DEFAULT NULL,
`CREATE_OPTIONS` varchar(255) DEFAULT NULL,
`TABLE_COMMENT` varchar(80) NOT NULL DEFAULT ''
);

The first table  `tmp_table_stat` holds statistics record for temporary purpose. It then push the results to  `table_stat`  with SCHEMA_NAME and STAT_COLLECTION_TIME.
All these tasks were done by a python scrip:

#!/usr/bin/python -tt
import sys
import commands
import os
import MySQLdb as mdb
db_server='localhost'
db_port='*****'
db_user='*****'
db_password='******'
db_name='table_stats'

def save_table_stat():
db_list = []
v_file = "/tmp/status.data"

mysql_con = mdb.connect(host=db_server, user=db_user,db="table_stats",port=int(db_port))
cur=mysql_con.cursor()
cur.execute("SHOW DATABASES LIKE 'test'")
rows = cur.fetchall()

for r in rows:

qry = "TRUNCATE TABLE table_stats.tmp_table_stat" 
cur.execute(qry)
db_name = str(r[0])
#print 'Dumping status for ' + db_name

qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Dumping stats' as command, 'Stats DB Name:: " + db_name +"'"
print qry
cur.execute(qry)

cmd = 'mysql -h ' + db_server + ' -P' + db_port + ' -u' + db_user + ' ' + db_name + ' -e "SHOW TABLE STATUS WHERE COMMENT NOT LIKE \'view%\' " -B -N > ' + v_file
#print cmd
os.system(cmd)
cmd = 'chmod 777 ' + v_file
print "Done"


qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Loading data' as command, 'Stats DB Name:: " + db_name +"'"
cur.execute(qry)
qry = "LOAD DATA INFILE '" + v_file + "' INTO TABLE table_stats.tmp_table_stat FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' "
#print qry
cur.execute(qry)


qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Parmanent store' as command, 'Stats DB Name:: " + db_name + "'"
cur.execute(qry)
qry = "INSERT INTO table_stats.table_stat SELECT '" + db_name + "' as TABLE_SCHEMA, now() as stat_collection_tome, t.* FROM table_stats.tmp_table_stat as t" 
#print qry
cur.execute(qry)

qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Completed', 'Stats DB Name:: " + db_name + "'"
cur.execute(qry)


cur.close()
def main():
save_table_stat()


if __name__ == '__main__':
main()


The script first collects all the database names in a tuple and then connect to each schema and execute SHOW TABLES. The result is written to a file and then load data from this file into a tmp table and consequently into the permanent statistics table.

Our plan was to collect all statistics data and then analyze on those to remove unused blogs and tables.




Leave a Reply

Labels