Wednesday, December 26, 2012

Rank Functions

0 comments
RANK(), DENSE_RANK() and ROW_NUMBER() are three important reporting functions in Postgresql. These are also used in Oracle.

RANK
Retuns unique serial number for each row, except for rows that have duplicate values, in which case the same ranking is assigned
and a gap appears in the sequence for each duplicate ranking.

DENSE_RANK
Retunrs unique serial number for each row, except for rows that have duplicate values, in which case the same ranking is assigned.

ROW_NUMBER
Returns unique serial number for each row. For rows that have duplicate values, numbers are randomly assigned.

Let we have a table as follows:-

SELECT * FROM t1;

 id  | points 
----+----
  1 | 19
  2 | 86
  3 | 24
  4 |  5
  5 | 56
  6 | 48
  7 | 55
  8 | 74
  9 | 69
 10 | 34
 11 | 96
 12 | 32
 13 | 35
 14 | 92
 15 | 18
 16 | 85
 17 | 25
 18 | 88
 19 | 59
 20 | 78
(20 rows)

There is no duplicate value in "points" column.
So using the following query, we get same result for rank(), dense_rank() and row_number().

SELECT *,
RANK() OVER( ORDER BY points DESC ) rnk,
DENSE_RANK() OVER(ORDER BY points DESC ) drank,
ROW_NUMBER() OVER( ORDER BY points DESC ) rnum
FROM t1;

  id  | points  | rnk | drank | rnum
----+----+-----+-------+------
 11 | 96 |   1 |     1 |    1
 14 | 92 |   2 |     2 |    2
 18 | 88 |   3 |     3 |    3
  2 | 86 |   4 |     4 |    4
 16 | 85 |   5 |     5 |    5
 20 | 78 |   6 |     6 |    6
  8 | 74 |   7 |     7 |    7
  9 | 69 |   8 |     8 |    8
 19 | 59 |   9 |     9 |    9
  5 | 56 |  10 |    10 |   10
  7 | 55 |  11 |    11 |   11
  6 | 48 |  12 |    12 |   12
 13 | 35 |  13 |    13 |   13
 10 | 34 |  14 |    14 |   14
 12 | 32 |  15 |    15 |   15
 17 | 25 |  16 |    16 |   16
  3 | 24 |  17 |    17 |   17
  1 | 19 |  18 |    18 |   18
 15 | 18 |  19 |    19 |   19
  4 |  5 |  20 |    20 |   20

Now, removing a random row from the table and execute the query again to see if the result is changed:

BEGIN;
DELETE from t1 where id=8;

SELECT *,
RANK() OVER( ORDER BY points DESC ) rnk,
DENSE_RANK() OVER(ORDER BY points DESC ) drank,
ROW_NUMBER() OVER( ORDER BY points DESC ) rnum
FROM t1;

 id  | points  | rnk | drank | rnum
----+----+-----+-------+------
 11 | 96 |   1 |     1 |    1
 14 | 92 |   2 |     2 |    2
 18 | 88 |   3 |     3 |    3
  2 | 86 |   4 |     4 |    4
 16 | 85 |   5 |     5 |    5
 20 | 78 |   6 |     6 |    6
  9 | 69 |   7 |     7 |    7
 19 | 59 |   8 |     8 |    8
  5 | 56 |   9 |     9 |    9
  7 | 55 |  10 |    10 |   10
  6 | 48 |  11 |    11 |   11
 13 | 35 |  12 |    12 |   12
 10 | 34 |  13 |    13 |   13
 12 | 32 |  14 |    14 |   14
 17 | 25 |  15 |    15 |   15
  3 | 24 |  16 |    16 |   16
  1 | 19 |  17 |    17 |   17
 15 | 18 |  18 |    18 |   18
  4 |  5 |  19 |    19 |   19
(19 rows)

ROLLBACK;

Again, there is no difference among the results of rank(), dense_rank() and row_number().

Now, I interted a new rocord in the table to make a duplicate value in "points" column.

INSERT INTO t1 VALUES (21, 78);
Now we have two records with 78 points. Using the query we got result :-

 id  | points  | rnk | drank | rnum
----+----+-----+-------+------
 11 | 96 |   1 |     1 |    1
 14 | 92 |   2 |     2 |    2
 18 | 88 |   3 |     3 |    3
  2 | 86 |   4 |     4 |    4
 16 | 85 |   5 |     5 |    5
 21 | 78 |   6 |     6 |    6
 20 | 78 |   6 |     6 |    7
  8 | 74 |   8 |     7 |    8
  9 | 69 |   9 |     8 |    9
 19 | 59 |  10 |     9 |   10
  5 | 56 |  11 |    10 |   11
  7 | 55 |  12 |    11 |   12
  6 | 48 |  13 |    12 |   13
 13 | 35 |  14 |    13 |   14
 10 | 34 |  15 |    14 |   15
 12 | 32 |  16 |    15 |   16
 17 | 25 |  17 |    16 |   17
  3 | 24 |  18 |    17 |   18
  1 | 19 |  19 |    18 |   19
 15 | 18 |  20 |    19 |   20
  4 |  5 |  21 |    20 |   21
(21 rows)

The difference is obvious.
For rank() function in "rnk" column, we get rank 6 for both ID 21 and 20 and the next rank is 8 and the last rank is 21. Here rank 7 is not used as rank 6 gets two records.
For dense_rank() function in "drank" column, we get rank 6 for both ID 21 and 20 and the next rank is 7 and the last rank is 20. Here no gap is created.
For row_number() function in "rnum" column, we get two different values for the same "points" as it only gives the row number instead of ranking.

Continue reading →
Tuesday, December 18, 2012

Remove log files using Python in Windows Server

0 comments

I was working in Windows Server 2003 where one of our application service was creating a large log files every day and when the  log file used to become more than 2 GB the application rotate it to another new log file. I had to create a script to remove those old unnecessary log files.

I created a script using used Python 2.7 environment and add it to schedule task. Now it has been working fine.

#!/usr/bin/python -tt

import sys
import os

def main():
  filelist = []
  path = r'C:\test\log'
  if os.path.exists(path) == True:
    os.chdir(path)
    for f in os.listdir(os.getcwd()):
       if f.endswith('log'):
         lastmod_date = os.path.getmtime(f)
         f_tuple = lastmod_date , f
         filelist.append(f_tuple)
  if len(filelist) > 1: # if we have more than one log files then go to remove
    filelist.sort() # sort according to the lastmod_date
    filelist.pop() # remove the last item from the list as it is the current one
    for i in range(0,len(filelist)):
      os.remove(path + '\\'+ filelist[i][1])
  return
 
if __name__ == '__main__':
 main()
 
Continue reading →
Friday, December 14, 2012

pg_restore error with invalid input syntax for integer: "U" and out of memory

0 comments
I was trying to restore a 6 Gb database in my development machine which is running on Centos 5.6 with memory 1 GB.
The postgresql.conf was like :

shared_memory = 128 MB
maintenance_work_mem = 300 MB
checkpoint_segment = 10  # as the disk space is limited
fsync=off
autocommit=off

During restoration I got error  as follows:

LOG:  checkpoints are occurring too frequently (22 seconds apart)
HINT:  Consider increasing the configuration parameter "checkpoint_segments".
pg_restore: out of memory
pg_restore: finished item 8570 TABLE DATA entity
pg_restore: [archiver] worker process failed: exit code 1
ERROR:  invalid input syntax for integer: "U"
CONTEXT:  COPY entity, line 2120568, column version: "U"
STATEMENT:  COPY entity (crmid, smcreatorid, smownerid, modifiedby, setype, description, createdtime, modifiedtime, viewedtime, status, version, presence, deleted, owner_type) FROM stdin;

LOG:  could not send data to client: Broken pipe
STATEMENT:  COPY entity (crmid, smcreatorid, smownerid, modifiedby, setype, description, createdtime, modifiedtime, viewedtime, status, version, presence, deleted, owner_type) FROM stdin;

During error my OS status:
free -m
             total       used       free     shared    buffers     cached
Mem:          1024        975         48          0          3        857
-/+ buffers/cache:        114        909
Swap:         1027          0       1027


I posted in forums, discussed with experts and modified the shared_buffers and maintenance_work_mem to lower. But still I was getting the error.

Then I removed this backup copy and made a new one and used it. Now I was able to restore database successfully.

Then I came to knew that this development server had an io error a few days ago during the restoration of the database from this backup and I think that caused the backup copy damaged.
Continue reading →
Monday, November 12, 2012

Collect Mysql table statistics for millions of tables running for Wordpress site

0 comments
I was administering the database for a Portal where we had millions of blogs. The site was build on Wordpress. So we had Mysql database with Wordpress Hyperdb setup. As each blogs consists of 9 separate tables, we had huge number of tables in the database. This huge number of tables were creating a great problem in database maintenance like we couldn't use any single command on INFORMATION_SCHEMA to get table status or something. However, we found that we can use SHOW TABLE STAUS in each database. As we had almost 150 databases, it was not possible to execute SHOW TABLES and then collect data for all. So we planned to create a script for the task.
The script would traverse all the databases and execute SHOW TABLE STATUS and store the result in a table named table_stats.table_stat.
Here, we created a new schema table_stats  to store all the statistics related information.
CREATE TABLE `tmp_table_stat` (
`TABLE_NAME` varchar(64) NOT NULL DEFAULT '',
`ENGINE` varchar(64) DEFAULT NULL,
`VERSION` bigint(21) unsigned DEFAULT NULL,
`ROW_FORMAT` varchar(10) DEFAULT NULL,
`TABLE_ROWS` bigint(21) unsigned DEFAULT NULL,
`AVG_ROW_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`MAX_DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`INDEX_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_FREE` bigint(21) unsigned DEFAULT NULL,
`AUTO_INCREMENT` bigint(21) unsigned DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CHECK_TIME` datetime DEFAULT NULL,
`TABLE_COLLATION` varchar(32) DEFAULT NULL,
`CHECKSUM` bigint(21) unsigned DEFAULT NULL,
`CREATE_OPTIONS` varchar(255) DEFAULT NULL,
`TABLE_COMMENT` varchar(80) NOT NULL DEFAULT ''
);


CREATE TABLE table_stats.`table_stat` (
`TABLE_SCHEMA` varchar(64) NOT NULL DEFAULT '',
`STAT_COLLECTION_TIME` datetime DEFAULT NOW();
`TABLE_NAME` varchar(64) NOT NULL DEFAULT '',
`ENGINE` varchar(64) DEFAULT NULL,
`VERSION` bigint(21) unsigned DEFAULT NULL,
`ROW_FORMAT` varchar(10) DEFAULT NULL,
`TABLE_ROWS` bigint(21) unsigned DEFAULT NULL,
`AVG_ROW_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`MAX_DATA_LENGTH` bigint(21) unsigned DEFAULT NULL,
`INDEX_LENGTH` bigint(21) unsigned DEFAULT NULL,
`DATA_FREE` bigint(21) unsigned DEFAULT NULL,
`AUTO_INCREMENT` bigint(21) unsigned DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CHECK_TIME` datetime DEFAULT NULL,
`TABLE_COLLATION` varchar(32) DEFAULT NULL,
`CHECKSUM` bigint(21) unsigned DEFAULT NULL,
`CREATE_OPTIONS` varchar(255) DEFAULT NULL,
`TABLE_COMMENT` varchar(80) NOT NULL DEFAULT ''
);

The first table  `tmp_table_stat` holds statistics record for temporary purpose. It then push the results to  `table_stat`  with SCHEMA_NAME and STAT_COLLECTION_TIME.
All these tasks were done by a python scrip:

#!/usr/bin/python -tt
import sys
import commands
import os
import MySQLdb as mdb
db_server='localhost'
db_port='*****'
db_user='*****'
db_password='******'
db_name='table_stats'

def save_table_stat():
db_list = []
v_file = "/tmp/status.data"

mysql_con = mdb.connect(host=db_server, user=db_user,db="table_stats",port=int(db_port))
cur=mysql_con.cursor()
cur.execute("SHOW DATABASES LIKE 'test'")
rows = cur.fetchall()

for r in rows:

qry = "TRUNCATE TABLE table_stats.tmp_table_stat" 
cur.execute(qry)
db_name = str(r[0])
#print 'Dumping status for ' + db_name

qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Dumping stats' as command, 'Stats DB Name:: " + db_name +"'"
print qry
cur.execute(qry)

cmd = 'mysql -h ' + db_server + ' -P' + db_port + ' -u' + db_user + ' ' + db_name + ' -e "SHOW TABLE STATUS WHERE COMMENT NOT LIKE \'view%\' " -B -N > ' + v_file
#print cmd
os.system(cmd)
cmd = 'chmod 777 ' + v_file
print "Done"


qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Loading data' as command, 'Stats DB Name:: " + db_name +"'"
cur.execute(qry)
qry = "LOAD DATA INFILE '" + v_file + "' INTO TABLE table_stats.tmp_table_stat FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' "
#print qry
cur.execute(qry)


qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Parmanent store' as command, 'Stats DB Name:: " + db_name + "'"
cur.execute(qry)
qry = "INSERT INTO table_stats.table_stat SELECT '" + db_name + "' as TABLE_SCHEMA, now() as stat_collection_tome, t.* FROM table_stats.tmp_table_stat as t" 
#print qry
cur.execute(qry)

qry = "INSERT INTO command_log (execute_time, command, object) SELECT now() as execute_time, 'Completed', 'Stats DB Name:: " + db_name + "'"
cur.execute(qry)


cur.close()
def main():
save_table_stat()


if __name__ == '__main__':
main()


The script first collects all the database names in a tuple and then connect to each schema and execute SHOW TABLES. The result is written to a file and then load data from this file into a tmp table and consequently into the permanent statistics table.

Our plan was to collect all statistics data and then analyze on those to remove unused blogs and tables.




Continue reading →
Tuesday, October 16, 2012

Getting used to with Postgresql Timestamp Conversion

0 comments

I was about to get knowledge on Postgresql Timezone conversion. The develoeprs need a query where they store a timestamp without timezone,
but when they match this timestamp with user values, database should convert both the timestamp values to one timezone.
Timezone conversion for Postgresql has described clearly in Postgresql documentation at 9.9.3.
http://www.postgresql.org/docs/9.1/static/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT

I went through it and tested it as follows.

I have a table as follows:

create table tmptab(id int, t timestamp);

insert into tmptab
select c, '2012-10-15 16:29:00'::timestamp without time zone as t
from generate_series(1,100) as c;

System timezone is US/Eastern which I get from timezone system variable.

select id, t,
(t::timestamp with time zone at time zone 'BDT') as from_est_to_bdt ,
(t::timestamp without time zone at time zone 'BDT') as from_bdt_to_est
from tmptab 

In the above query there are two columns - from_est_to_bdt and from_bdt_to_est

from_est_to_bdt - it follows the doc Table 9-28
Expression:  timestamp with time zone AT TIME ZONE zone
Return type: timestamp without time zone
Description: Convert given time stamp with time zone to the new time zone, with no time zone designation
Here, the value of t is considering as EST as our default timezone and then convert to BDT.

from_bdt_to_est - it follows the doc Table 9-28
Expression:  timestamp without time zone AT TIME ZONE zone
Return type: timestamp with time zone 
Description: Treat given time stamp without time zone as located in the specified time zone
Here, the value of t is considering in BDT and then convert to EST in our default timezone.

So, in our application, where user is going to give us a date from different location and we need to match that date with timestamp wihtout time zone values,
I did the following query :-
select *
from tmptab
where (t::timestamp with time zone at time zone 'BDT')::date  = '2012-10-17' 
And it worked for me.

BTW, I got the timezone names from pg_timezone_names.
select * from pg_timezone_names

Continue reading →
Wednesday, July 25, 2012

Mysql function to check a value is numeric or not

0 comments
In Mysql 5.1 I need a function which would return true if a column value was Numeric and false if it was not.
I searched the internet and found the link -
http://stackoverflow.com/questions/2756736/how-to-only-select-numeric-data-from-mysql

It gave me a good idea about how to do it. Then I wrote my one as follows :-

CREATE FUNCTION is_numeric (sIn text) RETURNS tinyint
DETERMINISTIC
NO SQL
   RETURN sIn REGEXP '^-*[[:digit:]]+(\.[[:digit:]]+)?$' ;
 
Description -   



  • ^ = match at the begining of the string
  • $ = match at the end of the string
  • [:digit:] = digit class
  • ^[[:digit:]]+$ = from the begining to the end only digit characters
  • -* = 0 or more occurrances of - (minus sign)
  • (\.[[:digit:]]+)? = if a .(dot) is used then there must be 1 or more digit after that and by using (..)?, we ensured 0 or more occurrences of the item




  
Continue reading →
Tuesday, July 24, 2012

Mysql Query tune for a wordpress site

0 comments


Yesterday one of the developers of my team came to me with the following query which would summarize user ratings in the homepage of our wordpress site.
The query was taking 22 sec and it was not tolerable for a home page to take 22 sec every time the page load.

The basic query was as follows -

SELECT
userids.user_id,
u.user_login,
metaFirst.meta_value AS first_name,
metaPrimarySchool.meta_value AS primary_school_blog_id,
b.alias AS primary_school_blog_alias,
SUM(cp.measures) AS measures
FROM users AS u
INNER JOIN (
SELECT DISTINCT user_id FROM usermeta WHERE meta_key LIKE '%_capabilities'
AND
(
meta_value LIKE '%abc%'
)
) AS userids ON (u.ID = userids.user_id AND u.ID<>1)
INNER JOIN usermeta metaFirst ON (metaFirst.user_id=userids.user_id AND metaFirst.meta_key = 'first_name')
INNER JOIN usermeta metaPrimarySchool ON (metaPrimarySchool.user_id=userids.user_id AND metaPrimarySchool.meta_key = 'primary_school_blog_id')
INNER JOIN blogs b ON (b.blog_id = metaPrimarySchool.meta_value)
INNER JOIN cp cp ON (cp.uid=userids.user_id AND cp.timestamp>=1341100800)
GROUP BY cp.uid
ORDER BY measures DESC
limit 5;

We had to tune the query.

We took the following steps :-

Created a index on timestamp column of table cp.
create index cp_timestamp on cp(timestamp);

Created a 2 GB hot_cache and cached the most frequently used tables such as blogs and usermeta.

set global hot_cache.key_buffer_size=2197815296;
cache index blogs, usermeta in hot_cache;

Rewrite the query by replacing the derived join with EXISTS as we don't need any column from the derived "userids" in SELECT clause.



SELECT
cp.uid userid,
u.user_login,
metaFirst.meta_value AS first_name,
metaPrimarySchool.meta_value AS primary_school_blog_id,
b.alias AS primary_school_blog_alias,
b.blog_name AS blog_name,
SUM(cp.measures) AS measures
FROM users AS u
INNER JOIN usermeta metaFirst ON (metaFirst.user_id=u.id AND metaFirst.meta_key = 'first_name')
INNER JOIN usermeta metaPrimarySchool ON (metaPrimarySchool.user_id=u.id AND metaPrimarySchool.meta_key = 'primary_school_blog_id')
INNER JOIN blogs b ON (b.blog_id = metaPrimarySchool.meta_value)
INNER JOIN cp cp ON (cp.uid=u.id AND cp.timestamp>=cast(unix_timestamp(concat(year(curdate()),'-', lpad(month(curdate()),2,0),'-', '01')) as unsigned))
WHERE EXISTS
(
SELECT 1
FROM usermeta as userids
where meta_key LIKE '%_capabilities'
AND
(
meta_value LIKE '%abc%'
)
AND userids.user_id = u.id
)
GROUP BY cp.uid
ORDER BY measures DESC
LIMIT 100;



Now the query is giving result under a second.

Continue reading →

Labels