Saturday, June 30, 2007

Parallel Earth?

Apparently, the American Department of Defense (DOD) is building a virtual earth with a node for every single man, woman, and child. This is done to aid in the process of predicting the output of certain scenarios such as how would "you" react if food was cut off, or if "you" were shown a piece of media with some propaganda in it.
We know that they've been running simulations to predict outcomes, one famous simulation is that the americans nationalized the suis channel in their simulations even before Jamal Abdel Naser did that. But this is taking it to a totally new level.

Hmmm, but i found a way to break the system!!! I'll run a simulation of their system, and so, they will simulate me simulating them simulating me simulating them ... and the system will just explode !!! ;-D

Full article can be found here: http://www.theregister.com/2007/06/23/sentient_worlds/

Tuesday, June 26, 2007

!!سلبطة عيني عينك

معقول يا عالم؟
و الله صرت خايف ييجوا يتملكوا داري عشان العبدلي!!

حسبي الله و نعم الوكيل


Sunday, June 24, 2007

Wise quote... or not!!

Give a man a fire, he is warm for a day.
Teach a man how to build a fire, he is warm for the rest of his life.
Set a man on fire, he is warm for the rest of his life, and a few hours after!!

Monday, June 18, 2007

Experience with MySQL

While working on a project the past two weeks, i had to deal with a rather large database to import and query. What follows is my experience in this project and some interesting trivia about how MySQL server works. My computer machine is a bit ancient, i have a P4 2.8HT CPU, 512 MB DDR RAM, and a SATA harddisk. My operating system is Ubuntu 7.04, and the only other process that was running was firefox (to pass the time reading bash :-p), Database server is MySQL 5.0.22, default installation (except i changed default charset to utf8). All queries were run using the MySQL terminal client.

Here's the database structure:
Table A:
id int(11) primary key not null
num int(15) unique
contains ~ 1 million rows

Table B:
id int(11) primary key not null
A_id int(11) references Table A, id not null
attribute int(5) (contains around 130 distinct values)
contains ~ 13.5 million rows

Now the idea is that based on user input i need to get Table A.num, filtered by whatever the user selects in Table B.attribute. My first reaction was to do a join between the tables:

SELECT DISTINCT a.num FROM a, b WHERE a.id = b.a_id AND (b.attribute = x OR b.attribute = y OR ...) ORDER BY rand() LIMIT z;

After two and a half hours of hogging the CPU at 100%, i killed the process and decided it wasn't such a good idea. The problem here is that a cartesean product was calculated first before starting to filter, which would generate a temporary table with 13,381,664 * 991,483 rows, followed by recursing over the entire temporary result set to check for my filter. After which, the result would be ordered according to a.num and duplicate values would be emitted according to the distinct rule, followed by re-ordering the result set randomly, and finally selecting LIMIT number of rows to display.

So i decided to find another way, asking some friends i learned that rewriting it as a nested query would give me better results so i changed my query to the following:

SELECT a.num FROM a WHERE a.id IN (SELECT DISTINCT a_id FROM b WHERE b.attribute = x or ... ORDER BY rand() LIMIT z);

The idea here was to work on table B by itself, filtering the data, getting a subset of rows that is z rows long, randomly ordered, and is filtered according to my criteria, and use it to get the data i required.
First I was surprised here to find out that my database server version does not support order by or rand clauses in the nested query, which would force me to move them to the outer query. This would make the inner query generate a much larger result set, which would be used to to filter the outer table then randomizing them and getting the number of rows i wanted. Needless to say this is slower than what i wanted but since it simply wasn't possible to do it the way i wanted to i rewrote it like so:

SELECT a.num FROM a WHERE a.id IN (SELECT DISTINCT a_id FROM b WHERE b.attribute = x or ...) ORDER BY rand() LIMIT z;

After about an hour and twenty minutes, the query finished and a result set was presented. I did a little victory dance, but i knew this huge delay won't be tolerated, so i had to think about how to hasten it even further.

Now, i've heard a lot about indices, but never actually had to use them. Other than my primary key, or unique constraints, i never created an index. So, i decided to try this index thing, thinking that it wouldn't help much... but what the heck. I created an index on the attribute field in table B. The process took about 3 hours.
Re-running both of my queries. I was extremely surprised to see the first query (join) was actually FASTER than my second query (nested). The only thing i managed to conclude was that MySQL doesn't support indices in inner statements (not sure, if you have a better explanation, please do share). The nested query took about the same time to finish (a few minutes faster, but since i only ran it once it might have been a fluke), while the join statement was reduced to a whooping 20 minute execution time.

Getting high on the results, and gaining some new found respect to indices, i decided to create a second index, this time on the fields (attribute and a_id) in table B. Creating this index took a LOT of time, like 6-7 hours, but the results were worth it. Now my join query finishes in about 40-50 seconds. creating the index on a_id actually makes a difference because in my query the first condition is "a.id = b.a_id", and it gets evaluated first, so when we get to the filtering part, we don't have a cartesean product, only the rows that are related.

I'm at this stage now, still looking for ways to make even faster, however i gained a lot of respect to indices. I still have no idea how google manages to finish our search queries in a fraction of a second, i know they have WAY better hardware, but they also have WAY more information in their servers. One thing though, is that they have the ability to tune their DB server, a thing i can't don't have access to. Also, when you have a read only database (like my case), indices are a VERY good investment. Also, understanding how your DB server works, and how the query gets evaluated could give you a huge advantage.

Special thanks goes to my project manager, Ghassan Noubani, for rewriting my join query to its present form, and suggesting the second index.