Unless you set sync_binlog = 1, a system crash on the master will likely fail any slave with an “Client requested master to start replication from impossible position” error. Generally, this kind of situation requires manual intervention. When we see this, we make sure things indeed failed “past the end” of a binlog (i.e. the bit that didn’t get to the physical platter before the crash), reposition the slave to the next binlog, and use the Maatkit tools to ensure the slave is properly synced.
sync_binlog=1 is a problem in itself, because it makes the server not just do one fsync per commit, but several and that’s serious overhead. sync_binlog is actually not a boolean but a “fsync binlog every N commits” where 0 meaning “never”. So you could set it to 10 (fsync every 10 commits) and thus reduce the loss a little bit while not doing too much harm to performance. But it’s not ideal and won’t always prevent the above error.
As previously mentioned, Darren Cassar has been working on a new automated installer for the DBbenchmark program. It’s now available for download: click here. All you need to do is save it to the directory that you want to install to and then make sure it’s executable: “chmod 700 installer.sh”, then run it “./installer.sh”.
How long it may take MySQL with Innodb tables to shut down ? It can be quite a while.
In default configuration innodb_fast_shutdown=ON the main job Innodb has to do to complete shutdown is flushing dirty buffers. The number of dirty buffers in the buffer pool varies depending on innodb_max_dirty_pages_pct as well as workload and innodb_log_buffer_size and can be anywhere from 10 to 90% in the real life workloads. Innodb_buffer_pool_pages_dirty status will show you the actual data. Now the flush speed also depends on number of factors. First it is your storage configuration – you may be looking at less than 200 writes/sec for single entry level hard drive to tens of thousands of writes/sec for high end SSD card. Flushing can be done using multiple threads (in XtraDB and Innodb Plugin at least) so it scales well with multiple hard drives. The second important variable is your workload, especially how dirty pages would line up on the hard drive. If there are a lot of sequential pages which are dirty Innodb will be able to use larger size IOs – up to 1MB flushing dirty pages which can be a lot faster than flushing data page by page.
So if we have system with single hard drive doing 200 IO/ssc, 48G buffer pool which is 90% dirty and completely random page writes we’ll look at 13500 seconds or about 5min per 1GB of Buffer pool size.
This is worse case scenario though it is quite common in practice to see shutdown time of about 1min per GB of buffer pool per hard drive.
So far the benchmarking script supports Linux, FreeBSD, and OSX. I’m installing virtual machines today to get ready for development on the next OS that the community wants to have supported. Vote today for your choice. Development will begin Friday 2010-09-03.
Note: There is a poll embedded within this post, please visit the site to participate in this post's poll. PlanetMySQL Voting:
Vote UP /
Vote DOWN
At yesterdays Eigenbase Developer Meetup at SQLstream’s offices in San Francisco we arrived at a new logo for LucidDB. DynamoBI is thrilled to have supported and funded the design contest to arrive at our new mascot. Over the coming months you’ll see the logo make it’s way out to the existing luciddb.org sites, wiki sites, etc. I’m really happy to have a logo that matches the nature of our database - BAD ASS!
Often, the first step in evaluating and deploying a database is to load an existing dataset into the database. In the latest version, TokuDB makes use of multi-core parallelism to speed up loading (and new index creation). Using the loader, MySQL tables using TokuDB load 5x-8x faster than with previous versions of TokuDB.
Measuring Load Performance
We generated several different datasets to measure the performance of TokuDB when doing a LOAD DATA INFILE … command. To characterize performance, we vary
rows to load
keys per row
row length (including keys)
All generated keys, including the primary, are random, 8-byte values. The remaining data, needed to pad out the row length to specified length, is text.
Two files files are produced as part of data generation.
data file, containing ‘|’ separated fields
sql file, containing the CREATE TABLE command corresponding to the generated data
For instance, if the number of keys is 3 and the row length is 256 bytes, the following SQL statement is produced:
CREATE TABLE load_table (\
val0 BIGINT UNSIGNED NOT NULL,\
val1 BIGINT UNSIGNED NOT NULL,\
val2 BIGINT UNSIGNED NOT NULL,\
pad VARCHAR(232) NOT NULL,\
PRIMARY KEY (val0),\
KEY valkey1 (val1),\
KEY valkey2 (val2)\
) ENGINE=tokudb
We can make the data generation program available if anyone is interested.
Load Test
A simple shell script
creates the test table
performs a LOAD DATA INFILE <datafile> INTO TABLE load_table FIELDS TERMINATED BY ‘|’
returns execution time
For the experiments to be meaningful, we created datasets that do not fit in memory.
Results
We ran our benchmark on an Amazon Web Services c1.large node with 8 cores and 7 GB of memory. The test loads 100M rows (NOT pre-sorted). The data file was on a 2 disk RAID-0, the MySQL DB files on a different 2 disk RAID-0.
TokuDB Version 3 (~single-threaded) v. TokuDB Version 4 (multi-threaded)
Keys
Row Len
v3 rows/s
v4 rows/s
Speedup
1
64
27K
142K
5.1
4
64
13K
82K
6.2
1
256
7K
54K
7.2
4
256
5K
43K
8.2
Other metrics
Several metrics can be used to measure performance:
rows per second : data insert rate
key-value pairs per second : indicates how fast the primary and secondary indexes are being created
MB/s : how much raw data is being added to the database
Metrics for TokuDB v4:
Keys
Row Len
Rows/sec
KV-pairs/sec
MB/sec
1
64
142K
142K
9.1
4
64
82K
330K
5.3
1
256
54K
54K
13.9
4
256
43K
173K
11.1
These results show
significant parallelization (we believe larger CPU core count machines will see even larger benefits)
a significant jump in absolute load performance
speed-ups are not limited to tables with many keys – even the 1 key tables are 5-7x faster
We will report further results, especially speedups on larger CPU count machines, as they become available.
I am speaking at MySQL Sunday. The title for my talk is Success with MySQL and I will focus on things that operations and users can do to make a MySQL deployment succeed. There are many interesting talks scheduled for Sunday, including several at the same time as mine. I hope to see you there.
I get a number of question about contentions/"stuck in..". So here comes some explanation to:
Contention
Thread Stuck in
What you can do about it
In 99% of the cases the contentions written out in the out file of the data nodes (ndb_X_out.log) is nothing to pay attention to.
sendbufferpool waiting for lock, contentions: 6000 spins: 489200 sendbufferpool waiting for lock, contentions: 6200 spins: 494721
Each spin is read from the L1 cache (4 cycles on a Nehalem (3.2GHz), so about a nanosecond). 1 spin = 1.25E-09 seconds (1.25ns)
In the above we have: (494721-489200)/(6200-6000)= 27 spins/contention Time spent on a contention=27 x 1.25E-09=3.375E-08 seconds (0.03375 us)
So we don't have a problem..
Another example (here is a lock guarding a job buffer (JBA = JobBuffer A, in short it handles signals for heartbeats and some other small things, all traffic goes over JobBuffer B).
jbalock thr: 1 waiting for lock, contentions: 145000 spins: 3280892543 jbalock thr: 1 waiting for lock, contentions: 150000 spins: 3403539479
(3403539479-3280892543)/(150000-145000)=24529 spins/contention Time spent on a contention: 3.06613E-05 seconds (30.66us )
This is a bit higher than I would have expected and I think more analysis is needed. However, i tend not to get these contentions on a busy system.
Ndb kernel thread X is stuck in ..
Ndb kernel thread 4 is stuck in: Job Handling elapsed=100Watchdog: User time: 82 System time: 667 Ndb kernel thread 4 is stuck in: Job Handling elapsed=200 Watchdog: User time: 82 System time: 668 Ndb kernel thread 4 is stuck in: Job Handling elapsed=300Watchdog: User time: 82 System time: 669 Ndb kernel thread 4 is stuck in: Job Handling elapsed=400Watchdog: User time: 82 System time: 670
Here the important is to look at how User time and System time behaves. If User time is constant (as it is here - 82ms), but the System time is growing (667, 668 etc) which indicates that the OS kernel is busy. Slow network? Sub-optimal kernel version? NIC drivers? swapping? some kernel process using too much cpu?
If User time is growing it is probably because the ndb kernel is overloaded.
What can you do about this?
In config.ini: RealtimeScheduler=1 LockExecThreadToCPU=[cpuids]
check that cpuspeed is not running ( yum remove cpuspeed )
.. and finally ask us to optimize more!
Also, pay attention if you get the contentions on an idle Cluster or a busy Cluster.
Join us at this live event in Milan to better understand what’s new with MySQL. You will learn more about the current and future state of MySQL, now part of the Oracle family of products. We will also cover Oracle’s investment in MySQL aiming to make it even a better MySQL.
In particular the following topics will be discussed:
Oracle’s MySQL Strategy
What’s New for:
The MySQL Server
MySQL Cluster
MySQL Enterprise
MySQL Workbench
Stay tuned because we are organizing a similar event in Rome that will be announced soon. Attendance is free, but you’ll need to register in advance. Seats are limited, register today!
(English version) Herhalde bildiğiniz gibi, biz tamamen sanal şirketiz, yani ana merkez gib bir şeyimiz yok. Hepimiz evinden çalışıyor ve biz sadece bir ya da iki kez yılda gerçek hayatta buluşuyorus. Bu sene kararımız İstanbul‘a düştü. Firmadaki tek Türk ben olduğumdan dolayı toplantının organizesi bana düştü.
İlk toplantı tarihi ile ilgili bir anket yaptık ve İstanbul’daki toplantı Ekim ayına karar verdik. Tam olarak, Perşembe, 7 Ekim’den Salı, 12 Ekim 2010′a kadar.
Yaklaşık 30 kişilik toplantıyı düzenlemek pek bir basit görev değil, bu nedenle toplantıyı hazırlanmak için bana bir asistan tahsis edildi – oldukçada ünlü bir asistan – My Widenius. My (okunuşu Mü) MySQL’ın My’sü ve ona bir mariadb.org e-posta adresinle ulaşabilisiniz.
Eğer Ekim’in başında İstanbul’da iseniz, sizi bizim toplantıya davet etmekten mutluluk duyarız. Üç toplantı günleri olacaktır: Cuma 8 Ekim, Cumartesi 9 Ekim ve Pazar 10 Ekim 2010. Toplantıların çoğu herkese açık olacaktır. Misafirlerimiz olarak Facebook, Percona ve Intel bizlen olucak.
Ayrıca yerel (İstanbul) kullanıcı grupları ve bizim toplantıya ilgilenlerini arıyoruz. Sizi İstanbul’da görmek üzere, …
Sphinx search is a full text search engine, commonly used with MySQL.
There are some misconceptions about Sphinx and its usage. Following is a list of some of Sphinx’ properties, hoping to answer some common questions.
Sphinx is not part of MySQL/Oracle.
It is a standalone server; an external application to MySQL.
Actually, it is not MySQL specific. It can work with other RDBMS: PostgreSQL, MS SQL Server.
And, although described as “free open-source SQL full-text search engine”, it is not SQL-specific: Sphinx can read documents from XML.
It is often described as “full text search for InnoDB”. This description is misleading. Sphinx indexes text; be it from any storage engine or external source. It solves, in a way, the issue of “FULLTEXT is only supported by MyISAM”. Essentially, it provided full-text indexing for InnoDB tables, but in a very different way than the way MyISAM’s FULLTEXT index works.
Sphinx works by reading documents, usually from databases. Considering the case of MySQL, Sphinx issues a SQL query which retrieves relevant data (mostly the text you want to index, but other properties allowed).
Being an external module, it does not update its indexes on the fly. So if 10 new rows are INSERTed, it has no knowledge of this. It must be called externally to re-read the data (or just read the new data), and re-index.
This is perhaps the greatest difference, functionality-wise, between Sphinx and MyISAM’s FULLTEXT. The latter is always updated, for every row INSERTed, DELETEd or UPDATEd. The latter also suffers by this property, as this makes for serious overhead with large volumes.
There’s more than one way to make that less of an issue. I’ll write some more in future posts.
Sphinx does not keep the text to itself; just the index. Sphinx cannot be asked “Give me the blog post content for those posts containing ‘open source’”.
Sphinx will only tell you the ID (i.e. Primary Key) for the row that matches your search.
It is up to you to then get the content from the table.
With SphinxSE (Sphinx Storage Engine for MySQL) this becomes easier, all-in-one query.
It can keep other numeric data. Such data can be used to filter results.
It provides with GROUP BY-like, as well as ORDER BY-like mechanism.
It allows for ordering results by relevance.
It allows for exact match search, boolean search, and more.
It has an API & implementation for popular programming languages: PHP, Python, Perl, Ruby, Java.
The above describes Sphinx as a general fulltext search engine for databases. It does, however, have special treatment for MySQL:
First and foremost, it knows how to query MySQL for data (duh!)
If you don’t mind compiling from source, you can rebuild MySQL with SphinxSE: a storage engine implementation. This storage engine does not actually hold any data, but rather provides an SQL-like interface to the search daemon.
Thus, you can query for search results using SELECT statements, JOINing to document tables, retrieving results, all in one step.
If you do mind compiling MySQL, be aware that MariaDB comes with SphinxSE built in in newer versions.
It implements the MySQL protocol. You can connect to the sphinx server using a MySQL client, and actually issue SQL statements to retrieve data. Not all SQL is supported. The valid subset is called SphinxQL.
Open Query is now three years old! We initially started with consulting and training services, and extended this with our proactive subscriptions that also offers system administration and monitoring.
So how is it going? Pretty well. We’ve been profitable from the start, without funding (beyond a few hundred $ startup costs paid by Arjen) or any credit – by choice. Our objective has never been to grow ridiculously in terms of revenue or number of customers, we simply charge reasonable prices for real service. Right now we have dozens of clients on an ongoing basis, a neat trickle of new clients, and Open Query sustains the livelyhood and lifestyle of a number of people.
For me (Arjen), the three year mark is particularly interesting, since most startups do not make it past their first two years. With our different approach to doing business, we’ve seen our fair share of skepticism. Not that we mind, if anything it’s encouragement
If you’d like to learn more about our business principles, see the Upstarta site.
I use Workbench for my daily work, and it’s a great tool. If you haven’t tried the 5.2 release yet, you should. While performing some maintenance, I happened to issue a DELETE statement against a table which had no indexes (it was 10 rows), and Workbench complained:
Error Code: 1175
You are using safe update mode and you tried to update a table without a WHERE that uses a KEY column
It turns out this is a new feature in 5.2.26 (and is still there in 5.2.27) – Workbench now uses the equivalent of –safe-updates mode for the mysql command-line client (also known as the –i-am-a-dummy option – seriously). This wasn’t exactly convenient for me, especially since the DELETE was part of a larger script which I then had to revise and step through manually after it failed, but there’s an easy way to change this behavior. If you’re like me, you might consider disabling this:
Go to Edit -> Preferences
Select the SQL Editor tab
Uncheck “Forbid UPDATE and DELETE statements without a WHERE clause (safe updates)”
Despite the text, the –safe-updates mode affects more than UPDATE and DELETE statements without WHERE clauses – it requires such statements to explicitly use indexes.
I’m changing this behavior on my installation before I run into other problems.
In database modeling, a m:n relationship is usually resolved by an additional table. But what if this relation is used only for archiving and the number of links in the resulting table is not too high? In that context, I got the idea to store all referring ID's as CSV string directly into a TEXT column of one of the referring tables. I came to this idea, because otherwise I would have to build complicated foreign keys and this way I also save one additional table. Certainly, this only makes sense if the data is not frequently accessed as foreign key. Nevertheless, I would like to tackle the problem, even if the implementation is very MySQL-oriented.
“Hey DBA! Remember that change you made for my group four or five months ago? Well, we were waiting for things to get better but they have not. Can you change it back ASAP?!?”
I know several of you DBAs after reading the above are reaching for their antacids. OR something stronger.
Keeping track of what changes were made to which systems in your head is bound to bite a hard working DBA sooner or later.
This is one of those oh-so-obvious tips that will be ignored by many but it will come back to haunt like a cheap RAID array. Get a notebook, a pen, and keep it by you when you work. Now anytime you make a material change or perform a maintenance function on one of your systems, note what EXACTLY the change you made, the reason for the change, plus the day and date. It will take some effort to do this the first two times but you this to become an ingrained habit. Keeping track of running myisamcheck, changing the IP address of a replication slave, or setting up a MyIsam buffer cache is tedious but it will pay off when you are assailed with something similar to the sentence that starts this entry.
And for those of us with poor rotten hand writing — take the effort to make it legible. You may need to read it at 3AM after twenty hours of fighting a problem. Or the poor person covering for you while you are on vacation can not tell if you changed max_connections or max_connect_errors and decides to set them both to zero.
So why not store this data electronically? Well, Murphy’s Law aside, it is very handy to have something to refer to that is portable, does not need electrons to read, and is based on a technology even the most pointy haired boss can understand.
And it helps to have a print out of your my.cnfs taped into the back of the notebook.
Here are the last two entries in my log.
August 30, 2010 — Installed UDF for calculating median value on IBMtest1.
September 10, 2010 — System ‘A1′ : changed DBBC.NumBlocksPCT from 66 to 20 in Calpont.xml. PrimProc process was not able to allocate 66% but was able to get 20%. Will slowly walk up setting until it fails again to determine maximum setting.
Not much new this time: just summary of part 3 published and fixed mistake in chapter 10 (thanks, Shane!).
Summary.
In the third part we discussed methods of application debugging in cases when query plays secondary role in the problem.
I'd like to bring your attention we only discussed most frequent cases while MySQL server has a lot of parameters which of them can affect application. Analyze parameters which you use. One of the methods is run problematic query using MySQL server running with option --no-defaults and examine if results are different for MySQL server run with parameter which you use. If results are different analyze why parameter affects it and solve the problem.
Most of the time it would be handy to have a native MySQL script which would allow one to kill the sleeping connections which are in sleep state for more than 180 sec.. On the other hand DBA's can use "wait_timeout" etc parameters to control this..
[code]
# -- Make sure you are logged as MySQL 'root' user or any user who have got super privileges
DELIMITER $$ DROP PROCEDURE IF EXISTS `uKillSleepingSessions`$$
CREATE PROCEDURE `uKillSleepingSessions`() COMMENT 'This routne is used to kill idle sessions' READS SQL DATA
BEGIN
DECLARE no_more_rows BOOLEAN; DECLARE loop_cntr INT DEFAULT 0; DECLARE num_rows INT DEFAULT 0; DECLARE uID bigint(4);
DECLARE my_cur CURSOR FOR SELECT ID FROM information_schema.PROCESSLIST PL WHERE PL.COMMAND='Sleep' AND PL.TIME > 180;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET no_more_rows = TRUE;
OPEN my_cur; select FOUND_ROWS() into num_rows;
the_loop: LOOP
FETCH my_cur INTO uID;
IF no_more_rows THEN CLOSE my_cur; LEAVE the_loop; END IF;
SET @tmp_sql= CONCAT("KILL ",uID); PREPARE s1 FROM @tmp_sql; EXECUTE s1;
DEALLOCATE PREPARE s1; SET loop_cntr = loop_cntr + 1; END LOOP the_loop;
END$$ DELIMITER ;
# Usage from mysql prompt/gui client # Once called , it would cleanup idle connections which are sleeping for more than 180 sec call uKillSleepingSessions();
We’re proud to announce the next release of MySQL Workbench, version 5.2.27. This is the second maintenance release for 5.2 GA (Generally Available). This maintenance release does not introduce any new features, but focuses on general product improvement and usability. We hope you will make MySQL Workbench your preferred tool for Design, Development, and Administration of your MySQL database applications.
As always, we want to thank everyone for the great feedback we have received. This helps us to continuously improve and extend the functionality and stability of MySQL Workbench – please keep up on approaching us with any ideas to develop our product even further.
MySQL Workbench 5.2 GA
Data Modeling
Query (replaces the old MySQL Query Browser)
Administration (replaces the old MySQL Administrator)
Please get your copy from our Download site. Sources and binary packages are available for several platforms, including Windows, Mac OS X and Linux.
In addition to the new Query/SQL Development and Administration modules, version 5.2 features improved stability and performance – especially in Windows, where OpenGL support has been enhanced and the UI was optimized to offer better responsiveness.
This release also includes improvements to the scripting capabilities of the SQL Editor. You can read more about it in
If you need any additional info or help please get in touch with us.
Post in our forums, leave comments on our blog pages or if you want to talk to us directly you can visit us on our IRC channel #workbench on irc.freenode.net.
Ignacio Nin and I (mostly Ignacio) have worked together to create tcprstat[1], a new tool that times TCP requests and prints out statistics on them. The output looks somewhat like vmstat or iostat, but we’ve chosen the statistics carefully so you can compute meaningful things about your TCP traffic.
What is this good for? In a nutshell, it is a lightweight way to measure response times on a server such as a database, memcached, Apache, and so on. You can use this information for historical metrics, capacity planning, troubleshooting, and monitoring to name just a few.
The tcprstat tool itself is a means of gathering raw statistics, which are suitable for storing and manipulating with other programs and scripts. By default, tcprstat works just like vmstat: it runs once, prints out a line, and exits. You’ll probably want to tell it to run forever, and continue to print out more lines. Each line contains a timestamp and information about the response time of the requests within that time period. Here “response time” means, for a given TCP connection, the time elapsed from the last inbound packet until the first outbound packet. For many simple protocols such as HTTP and MySQL, this is the moral equivalent of a query’s response time.
The statistics we chose to output by default are the count, median, average, min, max, and standard deviation of the response times, in microseconds. These are repeated for the 95th and 99th percentiles as well. Other metrics are also available. Here’s a sample:
tcprstat uses libpcap to capture traffic. It’s a threaded application that does the minimum possible work and uses efficient data structures. Your feedback on the kernel/userland exchange overhead caused by the packet sniffing would be very appreciated — libpcap allows the user to tune this exchange, so if you have suggestions on how to improve it, that’s great.
We build statically linked binaries with the preferred version of libpcap, which means there are no dependencies. You can just run the tool. In the future, packages in the Percona repositories will provide another means for rapid installation via yum and apt.
tcprstat is beta software. Several C/C++ experts reviewed its code and gave it a thumbs-up, so many eyes have been on the code. We’ve performed tests on servers with high loads and observed minimal resource consumption. I personally have been running it for many weeks on some production servers without stopping it and have seen no problems, so I am pretty sure it has no memory leaks or other problems. Nevertheless, it’s a first prototype release, and we want much more testing. We might also change the functionality; as we build tools around it, we discover new things that might be useful. When we’re happy with it and you’re happy with it, we’ll take the Beta label away and make it GA.
The tcprstat user’s manual and links to downloads are on the Percona wiki. Commercial support and services are provided by Percona. Bug reports, feature requests, etc should go to the Launchpad project linked from the user’s manual. General discussion is welcome on the Google Group also linked from the user’s manual.
[1] Historical note: we initially called this tool rtime, but did not publicize it. However, some of you might have heard of “rtime” before. This is the same tool.
On September 19, 2010, Oracle is hosting MySQL Sunday, a half-day technical conference jam-packed with the latest on MySQL, the world's most popular open source database. The sessions will offer you insights into the latest MySQL technical innovations and community developments. Check out the agenda.
Keynotes
We are very excited that Marten Mickos, CEO, Eucalyptus Systems, will be joining us to deliver the closing keynote at MySQL Sunday, in addition to Edward Screven, Oracle's Chief Corporate Architect and Head of the MySQL business.
Secure your seat
MySQL Sunday is open to all Oracle OpenWorld, JavaOne, and Oracle Develop attendees, including those with the value-priced Discover pass ($75 if you register by September 18). You will be asked if you are attending MySQL Sunday during the conference registration process. Register today.
When: Sunday, September 19, 12:30 p.m. - 5:00 p.m.
There was a lot of talk about this being the next menace after email spam. I’m not actually sure what it’s called for VoIP systems, but my Asterisk setup has started to be attacked over the last few days. Lots of entries like:
[Aug 27 19:20:30] NOTICE[18826] chan_sip.c: Registration from '"742"<sip:742@a.b.c.d>' failed for '208.109.86.187' - No matching peer found
...
[Aug 31 10:13:10] NOTICE[18826] chan_sip.c: Registration from '"1002" <sip:1002@a.b.c.d>' failed for '41.191.224.2' - Wrong password
Lots of messages get logged a second and I noticed this as suddenly CPU load on my PC jumped up quite a bit.
For the moment I’ve routed these addresses via the interface lo0 so they won’t bother me any more, but I need to come up with a better solution.
First I’m curious if applications like Asterisk or FreeSwitch have any built-in anti-abuse controls to recognise bad behaviour and to disable those abusers. I’m pretty sure that I’ve not read about anything for Asterisk, and I’m currently reading the FreeSWITCH book I bought but haven’t come across this mentioned yet. Seems that applications like this may need to have these controls added at some time, just as sendmail, postfix and most mail servers have had to adjust to a hostile world.
The other option of course is to use a firewall or packet filter to limit the incoming traffic rate from a single IP to port 5060 or whereever the SIP connection is being accepted so that when going over the limit the ip will be blocked for some time. iptables can do this I think so I’m going to have to read about how to configure and set that up.
There are other applications designed to watch logs and use them to automatically add temporary blocks. fail2ban is one of these. I’ll also have to see if I can configure it for this task.
So if this has happened to you how do you protect your VoIP systems from that hostile world of the Internet?
The development cycle is moving right along for the community’s newest MySQL benchmarking script. I’m pleased to announce that we now officially support FreeBSD (version 8.1 tested) so go ahead and download now and test your FreeBSD, Linux, or OSX MySQL server! Click here for the download.
Courtesy of Darren Cassar and some generous coding this weekend, we’re going to be releasing a auto-installer / updater for the application which you can use to automate that part of the process. Stay tuned for information on that release.
Are you thinking...not one more blog, please! We have received a lot of feedback that we at Oracle need to be more vocal about our investment and progress with MySQL. The MySQL team at Oracle is very excited to launch a new blog where we will offer you the latest and greatest updates on product announcements, news, events, customers, activities, and overall progress about MySQL. You can be sure to find a mix of technical and business content.
As you continue to follow your favorite MySQL bloggers, we also hope that you will add "Oracle's MySQL Blog" to that list over time.
I manage MySQL product marketing at Oracle. You will meet a number of my colleagues in product marketing, product management, community relations and product development over time as you see them write through this blog.
Thanks for listening, and we look forward to your feedback.
Scenario: Master-Master replication
Description: Master A is the active db server whilst Master B is a read only swappable db server hence both are creating binary logs. During backup I run “FLUSH LOGS” in order to have a simpler point in time recovery procedure if that case arises.
Problem: Flush logs is mean mean command :) …. it rotates not only my binary logs but my error log too (since I user error-log=blahblahblah in my my.cnf). Well given I flush logs every night my error log is cycled through every night, but unlike binary logs which have an incrimental number attached to the fine, error logs only have a `-log` attached to the filename and a second “FLUSH LOG” would just clear all error logs permanently. That is really not fun believe me!
So what is the solution? you could either:
1. Not use “FLUSH LOGS” (nah that aint happenin)
2. Not use –error-log (that aint happenin either cos I need to use it for my specific setup)
3. Create an ugly hack script which saves the error log or renames it or whatever (Ugly Ugly)
4. Create a variable in MySQL which says: flush-log-ignore-error-log which if set “FLUSH LOGS” will know not to mess around with my error log. Hence … create a feature request (nicer and friendlier solution) and submit a tested patch too whilst at it.
If you think it is not idea that “FLUSH LOGS” recycles through error logs or and you have been bitten by the way it works like me, then feel free to put your vote into http://bugs.mysql.com/bug.php?id=56385&thanks=sub so that MySQL sees the importance of it.
The bidding war between HP and Dell for 3PAR has created great theater. The rationale is simple, both HP and Dell want a complete set of products to sell into the new cloud space and 3PAR is the only bitsized morsel among EMC, IBM and Hitachi that addresses this space. What is the compelling advantage they offer in storage? Elasticity. 3PAR provides the ability for companies to add/remove storage in thin slices (AKA thin provisioning). How does this relate to ScaleDB? We do the exact same thing for databases in the cloud and we do it for the most popular database in the cloud, MySQL.
How does VMWare play into this? Their CEO Paul Maritz was on CNBC talking about the hybrid cloud and how companies want to run core cloud capabilities on premise and then use the public cloud providers to handle compute overflow during peak usage. This means that public cloud value to corporations, assuming Maritz is correct, is based largely on their ability to provide elasticity. It will no longer be sufficient for public cloud companies to provide reserved servers, because the reserved servers will be run in the company’s data center. The public cloud will add/remove servers to handle peaks in usage. So elasticity is EVERYTHING. ScaleDB is all about elasticity for the database.
It is also interesting to note from the Maritz interview that he sees the next wave of cloud (and hence the next wave of cloud consolidation) coming from the software sector. More specifically, the ability to take existing applications and make them run on the cloud. In other words, to make them elastic. Again, this is exactly what ScaleDB does. We take existing MySQL applications and make them elastic.
It is also interesting to note that HP and Dell have decimated their own R&D and are now looking to acquire that expertise from outside, and they are willing to pay for the expertise.
Another theme playing out in the background makes this situation even more interesting. Oracle has adopted a systems approach, where they combine their hardware and software:
“The heart of the interview focused on Oracle's interest in Sun. By combining Sun's expertise in hardware with Oracle's software, Ellison suggested, the combined company can become a powerful "systems" company that sells solutions to businesses. The competitor that Ellison wants to beat: IBM.”
Summary: Cloud is the next battle ground. It all starts with the hardware/infrastructure (e.g. 3PAR) and then moves upstream to software. Oracle will be focused on selling complete systems, alienating HP & Dell, among others. This is compounded by the fact that HP and Dell have decimated their R&D, so they are forced to partner/acquire. At the same time, if Maritz’s vision of public clouds becoming effectively excess capacity for handling peaks from corporations is realized, then elasticity in the cloud will become critical as well. This obviously plays to ScaleDB’s strengths.
Last week I followed an very interesting ORACLE webinar delivered by Chris Mason : The State of Btrfs File System for Linux
BTRFS was initiated by Chris Mason who used to be responsible for Reiserfs at Suse and now works for Oracle. The first release started in 2007. BTRFS has been merged into Linux kernel in 2009. Now there are developers from REDHAT, INTEL SUSE, IBM, HP ... storage vendors. The project is very active. Ubuntu is considering to use it soon as its default filesystem. BTRFS is licensed under the GPL license. An interesting to read short summary of the life of BTRFS : A short history of BTRFS
...
To get information from a database it is necessary to execute a query to get this data.
Usually an ordinary SQL editor is used to create queries. To use such editor, one should remember the syntax of the SELECT operator and the names of tables and columns.
Let’s use a visual instrument developed specially to design queries, and see that it’s much easier to create queries visually instead of typing them in an editor.
Task:
It’s necessary to show the salaries of the employees of departments situated in different cities for the 2008 year in descending order.
After this let’s drag tables from Database Explorer to the diagram, the order of tables during dragging doesn’t matter.
As we can see, the application joins these tables automatically.
Query Builder: Query Diagram
Now let’s select the columns you need to get data from.
Click the checkbox near the Loc column of the dept table on the diagram, and after that the SalAmount column of the sal table.
You can see the selected columns on the Selection tab.
Query Builder: Selection Tab
Now let’s select the sum function on this tab in the column with aggregate functions for the SumAmount column.
Query Builder: Aggregate
Now it is necessary to set grouping by the Loc column, but the application selected to group data by the Loc column automatically. Let’s make sure of that by going to the Group By tab.
Query Builder: Group By Tab
Now we should cut the selection and keep only data of the 2008 year in the result.
To do this, let’s go to the Where tab and click the button with the green plus on it. The “=” symbol should appear. Let’s click the first phrase – <enter a value>.
Query Builder: Enter Value
After this the Operand Editor form should appear. Let’s select the Date and Time group from the Function list and double click the year(date) function in the list. After that let’s choose and double click the SalDate column in the other list.
Query Builder: Operand Editor
Let’s close the form and click the second one. Let’s enter 2008 there.
Query Builder: Type Constant
It’s time to execute the query we’ve created visually. To do this, let’s click F5.
Query Builder: Query Result
Now let’s look at the structure of the query we’ve created. To do this, let’s open Document Outline and open all nods.
Query Builder: Document Outline
Now let’s look at the DML of the created query. To do this, let’s go to the Text tab.
Query Builder: SELECT Query
Conclusion:
As we can see, the usage of a visual instrument for building queries allows to solve the task visually without going deep into the refinements of syntax of the SELECT statement itself and of the specifics connected with differences between MySQL and SQL Server syntax, to look at the syntax of the created query, to decrease the duration of the data selection process, and to look at the structure of the available query as a tree.
Quick solution to an issue that the affected Debian Lenny release: the process used to collect the MEMORY_ACTIVE_BYTES variable has been modified to correct a situation where some systems report an array of memory information instead of the expected single integer value. The bug has been fixed in revision 21 and the current download (revision 22) is available for download or svn update. As usual, you can download the MySQL dbbenchmark script here: [downloads].
Thanks goes to Brian Vowell at Evernote.com for bringing this bug to my attention. The original bug report can be found here: [link]
There are many times when writing an application that single threaded database operations are simply too slow. In these cases it’s a matter of course that you’ll use multi-threading or forking to spawn secondary processes to handle the database actions. In this simple example for Python multi-threading you’ll see the how simple it is to improve the performance of your python app.
#!/usr/bin/python
## DATE: 2010-08-30
## AUTHOR: Matt Reid
## WEBSITE: http://themattreid.com
## LICENSE: BSD http://www.opensource.org/licenses/bsd-license.php
## Copyright 2010-present Matt Reid
from __future__ import division
from socket import gethostname;
import threading
import sys
import os
import MySQLdb
class threader(threading.Thread):
def __init__(self,method):
threading.Thread.__init__(self)
self.tx =
self.method = method
def run(self):
run_insert()
def run_insert():
sql = "INSERT INTO table (`id`,`A`,`B`,`C`) VALUES (NULL,'0','0','0');")
try:
cursor.execute(sql)
db.commit()
except:
print "insert failed"
def init_thread():
for thread in range(threads):
background = threader()
background.start()
background.join()
def main():
try:
init_thread()
except:
print "failed to initiate threads"
sys.exit(0)
if __name__ == "__main__":
mysql_host = "localhost" #default localhost
mysql_pass = "pass" #default dbbench
mysql_user = "user" #default dbbench
mysql_port = 3306 #default 3306
mysql_db = "schema" #default dbbench
threads = 4 #must be INT not STR
try:
db = MySQLdb.connect(host=mysql_host, user=mysql_user, passwd=mysql_pass, db=mysql_db, port=mysql_port)
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
sys.exit (1)
cursor = db.cursor()
main()
db.close()
We had a very successful weekend of Planet.mysql users submitting their database statistics so I’ve pushed some code into a new release today so that everyone can benefit from some new features. The biggest change is to the threading logic. Previously the benchmarking script was serializing MySQL operations and only making use of a secondary thread (not the invoking thread) to query the database. Now you have the option of running with “–threads=x” to make use of your multi-core server. A good example of this improvement was on my Macbook Pro; before the threading change it was inserting ~700/sec, after the code change I tried –threads=4 and saw an improvement to ~900/sec. Rather significant.
“The wonderful thing about standards is that there are so many of them to choose from.” – Grace Hopper
A Shape file is the most common format for GIS vector data and just about every GIS program can use them. Unfortunately not all GIS data come in a shape file format. An E00 (E-zero-zero) file is the file format used by Environment Systems Research Institute’s (ESRI). ESRI is the Photoshop of the GIS workplace. E00 files are used by ArcInfo and ArcGIS Explorer. These are the two most common file formats. Spatial Data Transfer Standard (SDTS) is used by the US government to transfer data between dissimilar computer systems and GPS (.gpx) data used to transfer data to and from a GPS device.
Here is a list of the vector and raster file formats I have come across.
.shp/.shx/.dbf Shapefile – vector data (All three files make up a set.)
.gpx/ .gps a common GPS data format for software applications
.gml Geographic markup language
.pdb PocketAPRS vector map
.map APRSdos/WinAPRS/MacAPRS vector map
.gnis GNIS labels file (actually points instead of vectors)
.tif/.fgd geo TIFF raster image maps
Geospatial Data Abstraction Library (GDAL) is the work horse utility convert between these formats.. It is really a set of utilities and is the guts of the MapServer. The program ogr2ogr is included in GDAL. This program is key to converting and getting data into MySQL. The OGR toolkit is a sub kit of the FW Tools Toolkit. It has several command line tools. The two most useful are:
ogrinfo – inspects a GIS data source and spits out summary data or detailed information about the layers, kinds of geometries found in the file.
ogr2ogr – this is a command line tool that converts data. Ogr supports many formats: Shapefile, MapInfo Tab file, TIGER, s57, DGN, CSV, DBF, GML, KML, Interlis, SQLite, ODBC, ESRI GeoDatabase (MDB format), PostGIS/PostgreSQL, MySQL
Like the unix ‘file’ command, these utilities can recognize the type of data you feed them. One odd thing about ogr2ogr is, its command options are in output file then the input file order.
Here are a couple of useful examples from a cheat sheet on these utilities.
# Again this is based on Redhat 5.5
yum install gdal
ogrinfo ../data/Oklahoma/oklahoma_poi.shp
INFO: Open of `../data/Oklahoma/oklahoma_poi.shp'
using driver `ESRI Shapefile' successful.
1: oklahoma_poi (Point)
wget http://mappinghacks.com/data/TM_WORLD_BORDERS-0.2.zip
mkdir boundaries_shp
unzip TM_WORLD_BORDERS-0.2.zip -d boundaries_shp
cd boundaries_shp
ogr2ogr -f "MySQL" MySQL:"geo,user=root,host=localhost,password=" -nln world_borders -lco engine=MYISAM TM_WORLD_BORDERS-0.2.shp
mysql geo -e 'select iso3,astext(SHAPE) from world_borders where iso2="VA";'
+------+--------------------------------------------------------------------+
| iso3 | astext(SHAPE) |
+------+--------------------------------------------------------------------+
| VAT | POLYGON((12.445 41.903,12.451 41.907,12.456 41.901,12.445 41.903)) |
+------+--------------------------------------------------------------------+
In part one I said “GIS / Mapping Systems work with both text data and graphical data.” There are even more raster graphical file formats. GDAL has a library for raster GEO data types with a very large list of file formats it supports. Besides converting format types, GDAL can also warp, stitch and re-project images to fit a give projection. These blog posts are about MySQL so I will leave the conversion of graphical data to someone else.
In a coming posts I’ll go over:
What data is available and where can you find it?
More examples on what you can do with GIS data and MySQL.
MySQL Connector/Net 6.1.5, a new version of the all-managed .NET driver for MySQL has been released. This is a maintenance release of the 6.1 branch and is suitable for use with MySQL server versions 5.0 and higher. It is now available in source and binary form from http://dev.mysql.com/downloads/connector/net/6.1.html] and mirror sites (note that not all mirror sites may be up to date at this point of time – if you can’t find this version on some mirror, please try again later or choose another download site.)
MySQL Connector/Net 6.0.7, a new version of the all-managed .NET driver for MySQL has been released. This is a maintenance release of the 6.0 branch and is suitable for use with MySQL server versions 5.0 and higher. It is now available in source and binary form from http://dev.mysql.com/downloads/connector/net/6.0.html] and mirror sites (note that not all mirror sites may be up to date at this point of time – if you can’t find this version on some mirror, please try again later or choose another download site.)
I'm pleased to announce the release of Spider storage engine version 2.22(beta), Vertical Partitioning storage engine version 0.11(beta) and BKA-for-ha_partition version 0.3. Spider is a Storage Engine for database sharding. http://spiderformysql.com/ Vertical Partitioning is a Storage Engine for vertical partitioning for a table. http://launchpad.net/vpformysql BKA-for-ha_partition is a patch file of supporting "Batched Key Access" for table partitioning feature. http://launchpad.net/partitionmrrformysql
The main changes in this version are following. Spider - MS Windows support. - Add UDF "spider_copy_tables". Spider HA feature is now available!
Vertical Partitioning - MS Windows support.
Pathced MySQL source code and compiled MySQL binary are available from this release.
Please see "99_change_logs.txt" in the download documents for checking other changes. Thanks to Toru for working for supporting Windows.
Just in case it wasn't clear from Hakan's post, we are opening up the next Monty Program company meeting in October 7-12 to be a general MariaDB developers meeting. (In fact, we've had a few guests in all of the previous meetings too, but now it's formal and public.) Ever since Sun folded this annual MySQL AB tradition (to save money) there has been people asking when the next meeting would be, since for the developer community outside MySQL AB it was the main networking and information sharing event of the year. Last MySQL user conference we agreed that something needs to be done, and this is it. If you work on any of the MySQL variants, a storage engine, or are otherwise interested in deep architectural MySQL/MariaDB discussions, you are welcome to join and should contact Hakan or My for details.1 The invitation of course also is valid for Oracle employees, in case you were wondering.
And that's the other significant news hiding in Hakan's post: My Widenius, an experienced traveller herself, will be in charge of meeting logistics. This means even My herself now works for MariaDB, and can be reached with my at mariadb dot org :-)
1. If you are a Drizzle hacker, it is probably less interesting, but you are of course still welcome if you come! Anyway, we did also discuss having a similar developer meeting adjacent to an OpenSQL camp or something that could cover broader topics, and this is still an option. But as it is now, this is a MySQL/MariaDB focused meeting and there already was a separate Drizzle Developer day and we can just see what else is needed.
This has really been a long debate as to which approach is more performance orientated, normalized databases or denormalized databases. So this article is a step on my part to figure out the right strategy, because neither one of these approaches can be rejected outright. I will start of by discussing the pros and cons of both the approaches.
Pros and Cons of a Normalized database design.
Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. This is because of the following reasons:
Normalized tables are usually smaller and have a smaller foot-print because the data is divided vertically among many tables. This allows them to perform better as they are small enough to get fit into the buffer.
The updates are very fast because the data to be updated is located at a single place and there are no duplicates.
Similarly the inserts are very fast because the data has to be inserted at a single place and does not have to be duplicated.
The selects are fast in cases where data has to be fetched from a single table, because normally normalized tables are small enough to get fit into the buffer.
Because the data is not duplicated so there is less need for heavy duty group by or distinct queries.
Although there seems to be much in favor of normalized tables, with all the pros outlined above, but the main cause of concern with fully normalized tables is that normalized data means joins between tables. And this joining means that read operations have to suffer because indexing strategies do not go well with table joins.
Now lets have a look at the pros and cons of a denormalized database design.
Pros and cons of denormalized database design.
Denormalized databases fair well under heavy read-load and when the application is read intensive. This is because of the following reasons:
The data is present in the same table so there is no need for any joins, hence the selects are very fast.
A single table with all the required data allows much more efficient index usage. If the columns are indexed properly, then results can be filtered and sorted by utilizing the same index. While in the case of a normalized table, since the data would be spread out in different tables, this would not be possible.
Although for reasons mentioned above selects can be very fast on denormalized tables, but because the data is duplicated, the updates and inserts become complex and costly.
Having said that neither one of the approach can be entirely neglected, because a real world application is going to have both read-loads and write-loads. Hence the correct way would be to utilize both the normalized and denormalized approaches depending on situations.
Using normalized and denormalized approaches together.
The most common way of mixing denormalized and normalized approaches is to duplicate related columns from one table into another table. Let me show you by example:
Suppose you have a products table and an orders table.
The normalized approach would be to only have the product_id in the orders table and all the other product related information in the products table.
But that would make the query that filters by product_name and sorts by order_date inefficient because both are stored in different tables.
In a fully normalized schema, such a query would be performed in the following manner:
SELECT product_name, order_date
FROM orders INNER JOIN products USING(product_id)
WHERE product_name like 'A%'
ORDER by order_date DESC
As you can see MySQL here will have to scan the order_date index on the orders table and then compare the corresponding product_name in the products table to see if the name starts with A.
The above query can be drastically improved by denormalizing the schema a little bit, so that the orders table now includes the product_name column as well.
SELECT product_name, order_date
FROM orders
WHERE product_name like 'A%'
ORDER by order_date DESC
See how the query has become much simpler, there is no join now and a single index on columns product_name, order_date can be used to do the filtering as well as the sorting.
So can both the techniques be used together? Yes they can be, because real word applications have a mix of read and write loads.
Final words.
Although, denormalized schema can greatly improve performance under extreme read-loads but the updates and inserts become complex as the data is duplicate and hence has to be updated/inserted in more than one places.
One clean way to go about solving this problem is through the use of triggers. For example in our case where the orders table has the product_name column as well, when the value of product_name has to be updated, then it can simply be done in the following way:
Have a trigger setup on the products table that updates the product_name on any update to the products table.
Execute the update query on the products table. The data would automatically be updated in the orders table because of the trigger.
However, when denormalizing the schema, do take into consideration, the number of times you would be updating records compared to the number of times you would be executing SELECTs. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.
This weekend, I decided to attend BarcampJB pretty last minute. Lucky for me, barcamps are made for chaotics like me, so it was no problem at all. I found some friends that live here in Kuala Lumpur who I drove down to JB with (JB is around a 5 hour drive from KL, we did it in 3.5 ).
The camp was very interesting. Because JB is on the border with Singapore, there’s a good crossover between Malaysian and Singaporean techies.
I decided to go all out and give three talks on Saturday: First up was the MMM talk I’ve given at a few conferences before. All went well, and later on in the day some people approached me for more in-depth questions. It still seems that people have this idea in their head that they somehow need MySQL Cluster when there is more then one machine involved. When I explain them that that is very rarely the case and they can achieve what they want with MMM as well, they are often happy to hear that.
My next talk was more of a personal development one. People keep asking me here where I am from. When I explain to them that I’ve been location independent for the last 3 years, they are usually very eager to find out how I pull that off. I decided to summarise my experiences and put them in a talk. This talk was very well attended and I loved giving it. Most of the attendants were young techies, they are usually in a perfect position to do something very similar to what I’m doing.
The last talk was a lightning talk on Zabbix, the Open Source monitoring system we use at Open Query. Quick, and dirty, but effective.
Other interesting talks I attended were on breeze, an online banking application made for Standard Chartered bank that looks very slick and usable (If anyone from my bank is reading this: get with the program and fix our banking application to enter the 21st century please ).
Conary and Foresight Linux were interesting as well. Conary (the package management system in Foresight Linux) is not quite mature yet, but definitely a very interesting technology. I was interested to hear about it and hope to see it become more mainstream in the future.
Daniel Cerventus gave a good lightning talk on what not to do as a startup. The main message was to just do it, and not wait for grant money or VC’s. Some solid tips as well, one of them being to run your potential name through Namechk, a handy potential username checker for many services.
There was obviously also a lot of networking and we went for a foot massage at the end of the day. Funny fact: I was the only one to stay awake through the massage (Even though I am narcoleptic), while two of my friends (who I won’t name here ) snored all the way through it
All in all another succesful tech event in Malaysia. Definitely one of the many reasons I love living here!
How To Set Up MySQL Database Replication With SSL Encryption On
Debian Lenny
This tutorial describes how to set up database replication in MySQL
using an SSL connection for encryption (to make it impossible for
hackers to sniff out passwords and data transferred between the master
and slave). MySQL replication allows you to have an exact copy of a
database from a master server on another server (slave), and all updates
to the database on the master server are immediately replicated to the
database on the slave server so that both databases are in sync. This is
not a backup policy because an accidentally issued DELETE command will
also be carried out on the slave; but replication can help protect
against hardware failures though.
Just a quick note to let everyone know that our new benchmarking script now supports OSX 10.6 on Intel hardware. That means you can run one simple command and get all of the sequential and random INSERT and SELECT performance statistics about your database performance. As usual the script is open source and released under the new BSD license. Give is a try by downloading now! See the download page for more details.
Do you ever stop and think? It happens to me all the time, random subjects kick in and I end up searching (aka googling) and today I wondered … Why is MySQLs logo a dolphin? Shame on me it took me so long to ask the question but as the saying goes, better late than never.
As per MySQLs own article:
“We’ve been discussing a new logo for at least a year”, says main developer and MySQL AB founder Michael “Monty” Widenius. “I am personally concerned about the survival of endangered species, and I liked the idea of the dolphin as soon as it came up. It combines great symbol value with a powerful, modern design.”
After a while the name was chosen out of 6357 suggestions by the community and that is how Sakila the dolphin came to be :) …. and we lived happily ever after (or at least hope to!)
MySQL 5.1 has a great feature which not many people know about, that is the fact it can be extended via. the use of plugins. Unfortunately how you go about this is not incredibly well documented. You can search for examples on the internet, dig through the MySQL source code and ask on the forums and you may figure it all out. But doing all this is time consuming and could easily put someone off. So Sergei Golubchik and I have got together to bring you this book which will show you, using examples, how to write your own plugins.
We start by explaining the UDF API which has been around for a long time, and then move on to Daemon Plugins, Information Schema Plugins, Full-text Search Plugins and Storage Engine Plugins. Each with usable examples.
MySQL 5.1 Plugins Development has just been published by Packt Publishing and I believe it is well worth a look if you are thinking of extending MySQL.
Welcome to DBbenchmarks.com, a publicly accessible database that tracks anonymously submitted data about MySQL server performance. You can use this site to see research the performance of certain types of hardware when running MySQL. Our open-source benchmarking script is free to own and use, we only ask that you allow the script to connect to this database and submit the results. All results and data collected is anonymous and viewable on this site. We only track performance data from MySQL – you can see the list on the About page.
It’s that time of the year again — Oracle OpenWorld time — and it’s my pleasure to announce our regular Oracle bloggers meetup again this year. We all know that Oracle community has grown this year so we expect to see folks from all the different technologies including MySQL, Java, Sun hardware folks in addition to the core Oracle database and apps crowd.
So… all of you Oracle bloggers attending Oracle Open World 2010…
… you are invited to attend this Oracle Bloggers Meetup during OOW 2010 — a chance to meet your online buddies face-to-face in relaxed and informal atmosphere.
See the “Lower Dining Room” on the floor plan below and ask where is the “Bloggers Meetup” booked under my name — Alex Gorbachev. These are the keywords to find us easily.
The plan is to gather at 5:30pm on Wednesday after three (or four for those of us starting on Sunday) days of intense learning. This year, you won’t need to find where to kill few hours in between of the OOW sessions and customer appreciation event at the Treasure Island — the best place to be this year is our bloggers meetup — the place where all the “cool kids” are.
As usual, thanks to Oracle Technology Network and Pythian for sponsoring the venue and drinks. HP is planning to establish a prize again this year for something fun… yes, we will again do something fun.
Last year, we were collecting signatures on our Bloggers Meetup T-Shirts so feel free to wear them this year to show your seniority at the event. ;-) This year’s activity is a surprise but if you have something cool in mind — let me know privately {last_name} at pythian.com.
For those of you who don’t know the history… The Bloggers Meetups during the Oracle Open World were started by Mark Rittman and continued by Eddie Awad and then I picked up the flag. They have been great success so let’s keep them this way! To give you an idea, here are the photos from the OOW08 Bloggers Meetup (courtesy of Eddie Awad) and last year’s meetup blog post update from myself.
If you are planning to attend, please comment here with the phrase “COUNT ME IN”. This will help us make sure we have the attendance numbers right. I will maintain the list here. Make sure you provide your blog URL with your comment — it’s a Bloggers Meetup in the end! Make sure you comment here if you are attending so that we have enough room, food and (most important) drinks.
Of course, do not under any circumstances forget to blog and tweet about this year’s bloggers meetup.
Looking forward to seeing all of you again this year!
Stuttgart: a rainy day, waiting for Iftar. Good time for good news!
During my vacation I read about a request for a MariaDB package for Mac OS X and did some research. Back from vacation I have an alpha version of MariaDB package for Mac OS X for our community to test.
Caution: this is the first installer I ever wrote on a Mac, so use it on a test system only!
I would like you to test the installer and provide us with feedback.
Known issues in the MariaDB installer:
The Preferences Pane app for starting/stopping the server instance is missing
The installer for setting up MariaDB as a Startup Item is missing.
Side node: while digging into the Mac installer I found two
bugs in the MySQL Mac OS X installer.
TokuDB has a big advantage over B-trees when trickle loading data into existing tables. However, it is possible to preprocess the data when bulk loading into empty tables or when new indexes are created. TokuDB release 4 now uses a parallel algorithm to speed up these types of bulk insertions. How does the parallel loader performance compare with the serial loader? We use the Air Traffic Control (ATC) data and queries described in a Percona blog and also used in an experiment with TokuDB 2.1.0 to gain some insight.
Our ATC data is about 122M rows in size, is stored in a 40GiB CSV file, and can be found in our Amazon S3 public bucket. See the end of this blog for details. We use a table schema with 8 indices to speed up the ATC queries. The loader inserts the data into the primary fractal tree and one fractal tree for each of the 8 keys.
The load was run on two (old) machines:
2 core: Intel Core 2 Duo E8500 3.16GHz, 5GiB RAM, 1 disk SATA 1TB.
8 core: 2 socket quad core Xeon X5460 3.16GHz, 16GiB RAM, 6 disk SAS 1TB RAID0.
Load times:
Load Times for the ATC Database
TokuDB 2.1.0 and MySQL 5.1.36
TokuDB 4.1.1 and MySQL 5.1.46
TokuDB Speedup
2 core
10,974s
5,201s
2.1x
8 core
11,286s
2,655s
4.2x
We use “LOAD DATA INFILE” to load the CSV file into a MySQL table. The CSV file can be found in our Amazon S3 public bucket.
We see over 4x load time speed up for this database. This compares favorably with the other loaders. The next steps are to run experiments on machines with a larger number of cores, identify bottlenecks, and remove them.
BTW, the ATC load times into MyISAM are roughly the same as seen with the TokuDB 2.1.0 serial loader. Perhaps loads into MyISAM can be made faster with some twiddling of the MySQL system variables.
TokuDB data sizes (including indices):
6.7 GiB, or almost 6 times smaller than the CSV source file. The parallel loader uses the same compression parameters as the serial loader, so no change in size occurred.
Query times:
Query Times for the ATC Database
TokuDB 2.1.0
TokuDB 4.1.1
Q0
22s
28s
Q1
67s
33s
Q2
20s
24s
Q3
54s
29s
Q4
2s
1s
Q5
11s
7s
Q6
35s
29s
Q7
32s
39s
Q8.1y
7s
4s
Q8.2y
42s
32s
Q8.3y
37s
33s
Q8.4y
38s
35s
Q8.10y
139s
56s
Q9
36s
35s
TokuDB 4 uses the standard MySQL handler row iterator, so the query times are roughly unchanged. In addition, the query handling is not yet parallelized, so there is not much difference in query times on the 2 core and 8 core machines.
BTW, the query times for MyISAM are slightly longer than TokuDB.
The queries can be found in our Amazon S3 public bucket.
Future blogs:
Loader scalability on systems with larger number of cores. Machines with 48 cores (4 socket, 12 cores per socket) are now reasonably priced.
How we implemented the parallel loader: algorithms and parallel runtime.
The ATC CSV data files, the schema, and the queries can be retrieved from our public Amazon S3 bucket called tokutek-pub. Here are the Amazon S3 keys:
I recently encountered an interesting case. A customer reported that mysqld crashed on start on OpenSUSE 11.2 kernel 2.6.31.12-0.2-desktop x86_64 with 96 GB RAM when the innodb_buffer_pool_size was set to anything more than 62 GB. I decided to try it with 76 GB. The error message was an assert due to a failed malloc() in ut_malloc_low() in ut/ut0mem.c inside InnoDB source code. InnoDB wraps the majority of its memory allocations in ut_malloc_low(), so to get an idea of the pattern of requested allocations I added a debugging fprintf() to tell me how much was being allocated and whether it was successful.
I discovered something interesting. I expected the allocation to fail on the 76 GB of the buffer pool, due to some weird memory mapping issue and a continuous block of 76 GB not being available. However, that is not what happened. 76 GB buffer was allocated successfully. What was failing is the allocation of 3.37GB after that. What in the world could InnoDB need that was 3.37 GB? There was nothing in the settings that asked for anything close to 3 GB explicitly.
Source code is the ultimate documentation, and I took advantage of that. My good friend GDB guided me to buf_pool_init() in buf/buf0buf.c. There I found the following:
From the comments in the code (InnoDB code actually has very good comments), max_size is the maximum number of buffer pool pages (16K each), n_frames which is the same thing unless AWE is used, but it was not used, so I did not worry about it.
What shall we call that friend? It is used for storing some meta information about buffer pool pages. The most natural name I could come up with from reading the source code is the blocks array.
Thus we can see that we are allocating another chunk that is in proportion to the setting of innodb_buffer_pool_size for the blocks array. The exact proportions will probably vary from version to version, but roughly about 1 G for every 25 G of the buffer pool. This can become significant in the proper innodb_buffer_pool_size estimations when the system has a lot of RAM and you want to have the largest possible innodb_buffer_pool_size. Do not forget to give the blocks array some room!
While this was an interesting investigation, it nevertheless did not explain why there was not enough room for a 76 GB buffer pool. Even with the extra 3.37 GB allocation, there was still some free memory. Or was there? Maybe some hidden monster was eating it up? I quickly wrote this hack to prove or disprove the monster’s presence.
I verified that I could allocate and initialize two chunks of 40 GB from two separate processes, but not 80 GB from one. In fact, 80GB allocation failed right in malloc(), did not even get to initialization. I tested it with allocating 70 GB concurrently in each process so as to overrun physical memory + swap. Both allocations were successful, one initialized successfully, the other was killed by the OOM kill during initialization.
This smelled like a low ulimit, and sure enough it was. ulimit -m ulimited; ulimit -v unlimited did the magic, and mysqld successfully started with an 80 GB buffer pool. Apparenly OpenSUSE defaults are set in proportion to physical memory to keep the memory-hungry applications from taking the system down. On this particular system (96 GB physical memory, 2 GB swap it decided to set the virtual memory ulimit (-v) to 77.27 GB, and the physical memory (-m) to 80.40 GB).
This post is the fourth of a series that started here.
From the previous of this series, we now have resources configured but instead of starting MySQL, Pacemaker invokes a script to start (or restart) the EC2 instance running MySQL. This blog post describes the instance restart script. Remember, I am more a DBA than a script writer so it might not be written in the most optimal way.
First, let's recap what's the script has to perform (the full script is given below).
Kill the MySQL EC2 instance if running
Make sure the MySQL EC2 instance is stopped
Prepare the user-data script for the new MySQL EC2 instance
Launch the new MySQL instance
Make sure it is running
Reconfigure local heartbeat
Broadcast the new MySQL instance IP to the application servers
Kill the MySQL EC2 instance
In order to kill the existing MySQL EC2 instance, we first have to identify it. This is done by:
by filtering on the AMI type of the instance. Since an instance can be listed at the "stopped" state, it is mandatory to filter for states "running" or "pending". Then the instance is terminated with:
Terminating an EC2 instance is not instantaneous, we can confirm an instance is really stopped by monitoring its status and wait until it is actually "terminated". The code below is how the script performs this task.
Prepare the user-data script for the new MySQL EC2 instance
The new MySQL instance will be running heartbeat. Since we cannot use neither Ethernet broadcast or multicast, we need to configure the new instance so that it communicates through unicast with its partner node in the cluster, the node on which the restart script is run. This configuration is achieved by providing a user-data script (see the hamysql.user-data below) which completes the heartbeat configuration of the new instance. The hamysql.user-data script only performs a search and replace operation on the /etc/ha.d/ha.cf file and then restart the heartbeat service. In order for this to work properly, we just have to put the IP of the current instance in the script like here:
Now, Heartbeat, on the monitoring host, must be informed of the IP address of its new partner. In order to achieve this, a search and replace operation in the local ha.cf file followed of restart of Heartbeat is sufficient.
Broadcast the new MySQL instance IP to the application servers
The final phase is to inform the application servers that the IP of the MySQL has changed. The best way to list those application servers is through a security group and, provided the appropriate ssh keys have been exchanged, this code will push the IP update.
sleep 300 # 5 min before attempting again. Normally heartbeat should kill the script before
done
The hamysql.user-data script:
The script sets the IP of the monitor host in the heartbeat ha.cf configuration file and then, finishes up some missing configuration settings of the AMI.
The kind folks at O'Reilly sent me MySQL High Availability by Charles Bell, Mats Kindahl and Lars Thalmann . In summary, the book is awesome. Personally I didn't think that Replication was enough of a feature in mySQL to fill up 500+ pages about it, BUT, they did and not with fillers or info that is NOT actually important to you as the end designer. In Mark Callaghan's forward he states that this book "adequately" explains MySQL replication, this is a huge low-ball, I dare say if you did not look at the replication code (prior to row-based replication there was not a lot of it) your questions can be answered with this book. There are three parts to this book: Replication (roughly 50% of the book), Monitoring and Disaster Recovery, then finally High Availability Environments. Each part is well written and accurate. I have been using mySQL before Replication existed. When it was first released I have been using it and been to many talk about it, from the original author talks to Brian Aker’s talks to asking Jeremy Zawodny about it to giving my own talks about Scale-Out, replication, etc. My experience with mySQL Replication is rather robust. I broke holes in it, found bugs with it, and even wrote a quick C client (when it was still statement based replication only on the 4.1 branch) that would act as a proxy to gather all the replication feeds of all boxes onto a single box (it was a proof of concept). The book goes into clear detail about the common approaches and use cases of replication scaling out the apps reads. It even talks about data sharding that I must say I am an expert at (done it for Flickr/RockYou, for friends etc). The examples are clear although I would not recommend deploying their examples verbatim. The reason you never want to mod based on the number of shards in your system, because if you add more shards you are going to have to move your data all around. Central dictionaries are perfect to control balance. They go into this, but not how to lock a user to migrate to a lightly loaded shard. The example that they have creates holes in the flow, race conditions as well as downtime to move data, this is a nitpick, the book is good.
I do not want to give away the book, it is a good read and the quality is what you expect from O’Reilly. If you want to know about various replication techniques, replication shortcomings, building fail-over systems, and exploring other technologies that are comparable to replication, this is a great book for you. Oh and of note, they even go into exotic features that I rarely use in production and how that effects replication. Awesome.
We're opening up registration for our new training courses today. In short: we are moving from two days to a new four-day format. The new additions are created by:
Splitting our current InnoDB day in half. We now have one day for DBAs, and one day just on InnoDB topics.
A new Operations Day - covering how to maintain MySQL in production.
Our developer course has also undergone revision, and we now have more query tuning examples, and a new instrumentation chapter.
What is operations training?
Many companies split their technical staff between development, and operations. The operations team is responsible for tasks such as capacity planning, backup/disaster recovery, and carrying a pager. They are the heroes that fight fires.
Our operations day of training is delivered as a hands on class. Attendees will be divided into teams, and given a series of challenges to complete on EC2 machines. As part of the development of this course we wrote a sample LAMP application, embedded with minor flaws which students will need to fix while they try and keep the application up.
Where can you attend?
We're starting off by branding the operations day as a 'BETA'. You can attend for only $100 at San Francisco Thursday, 30 Sep 2010, New York Thursday, 14 Oct 2010 or Vancouver Thursday, 21 Oct 2010. There is also a discount of attending all four days for $1,450 if you book before 30 August.
After our initial BETA, the courses will be available in more USA and international locations. A partial list is already available on the training section of our website. We will confirm more cities in the coming weeks.
Recently I had the opportunity to do some testing on a large data set against two MySQL column-store storage engines. I'd like to note that this effort was sponsored by Infobright, but this analysis reflects my independent testing from an objective viewpoint.
I performed two different types of testing. The first focused on core functionality and compatibility of ICE (Infobright Community Edition) compared with MyISAM on a small data set. The second part of my testing compared the performance and accuracy of ICE with InfiniDB Community Edition on a 950GB data set.
The first first part of my analysis focused on testing specific MySQL functionally with Infobright's storage engine. A lot of my tests involved corner or edge cases since I have experience developing such cases for MySQL column-based storage engines in the past. I reported any bugs that I found, and contributed my test cases to ICE. In fact, some of the issues have already been addressed in the most recent software release. An example of such a problem would be "select avg(char_column) from table" where the column contains a mix of ascii data, such as names and numeric data. This is an example of implicit casting that probably wouldn't happen in a real application.
Importantly, I didn't find significant defects in Infobright that would be "show stoppers" for typical OLAP analysis queries. These tests were intended to approximate what would happen if you ported a MyISAM OLAP application to ICE. My testing suggests that with some basic testing, an application could be ported to ICE with a good chance of success. Of course, a good test environment is something that I think every operations team should insist on.
The second part of my analysis focused on testing a total of 29 queries on the large data set. I compared a number of different factors between the two databases, including:
Ease of installation
Loading capability and speed of loading
Accuracy of results of queries over the large data set
Speed of results of queries over the large data set
It should be noted that the second set of tests included 29 different queries, some of which were provided by Infobright and others which I contributed. InfiniDB does not support as many data types and aggregate functions as ICE, and therefore it could not run some of the queries. ICE supports almost all of the MySQL aggregation functions. Notably, GROUP_CONCAT is not supported, which is something I hope they rectify in a future version.
In addition, I was not able to get accurate results for all of the queries on InfiniDB. In particular the query "select count(*) from carsales.sales_fact" when run on InfiniDB returned a number that was higher than it should have been, and several GROUP BY queries returned unexpected results as well. I did not change any "out of the box" settings for Infobright. Even after I modified the configuration settings on a 16GB box, one query did not have enough memory to complete on InfiniDB.
Overall, I would say that ICE is more "ready for prime time" than InfiniDB given the inconsistencies that I encountered. I will try to reproduce the problems on InfiniDB into easily reproducible test cases which I may contribute to them, but this is difficult given the size of the data set involved. ICE was able to execute queries quickly, and with accurate results.
If you read Percona's whitepaper on Goal-Driven Performance Optimization, you will notice that we define performance using the combination of three separate terms. You really want to read the paper, but let me summarize it here:
Response Time - This is the time required to complete a desired task.
Throughput - Throughput is measured in tasks completed per unit of time.
Capacity - The system's capacity is the point where load cannot be increased without degrading response time below acceptable levels.
Setting and meeting your response time goal should always be your primary focus, but the closer throughput is to capacity the worse response time can be. It's a trade-off!Cary Millsap reminds us to think of this just like how traffic slows down with more cars on a highway:
You can actually choose to optimize a system in two different ways - for response, or for throughput. When you optimize for throughput you are relaxing (not eliminating) your response time objectives in order to have more tasks completed per unit of time.
It is much easier to relax response time objectives if the task is not user facing, which is why I often see applications and suggest that they convert a task that happens in the foreground to instead be sent to a message queue, or Gearman. Or in plain English: The same MySQL servers can achieve much more work, if you allow the potential for each individual task to take a little bit longer.
A couple of weeks ago one of my colleagues and I worked on a data corruption case that reminded me that sometimes people make unsafe assumptions without knowing it. This one involved SAN snapshotting that was unsafe.
In a nutshell, the client used SAN block-level replication to maintain a standby/failover MySQL system, and there was a failover that didn't work; both the primary and fallback machine had identically corrupted data files. After running fsck on the replica, the InnoDB data files were entirely deleted.
When we arrived on the scene, there was a data directory with an 800+ GB data file, which we determined had been restored from a SAN snapshot. Accessing this file caused a number of errors, including warnings about accessing data outside of the partition boundaries. We were eventually able to coax the filesystem into truncating the data file back to a size that didn't contain invalid pointers and could be read without errors on the filesystem level. From InnoDB's point of view, though, it was still completely corrupted. The "InnoDB file" contained blocks of data that were obviously from other files, such as Python exception logs. The SAN snapshot was useless for practical purposes. (The client decided not to try to extract the data from the corrupted file, which we have specialized tools for doing. It's an intensive process that costs a little money.)
The problem was that the filesystem was ext2, with no journaling and no consistency guarantees. A snapshot on the SAN is just the same as cutting the power to the machine -- the block device is in an inconsistent state. A filesystem that can survive that has to ensure that it writes the data to the block device such that it can bring into a consistent state later. The techniques for doing this include things like ordered writes and meta-data journaling. But ext2 does not know how to do that. The data that's seen by the SAN is some jumble of blocks that represents the most efficient way to transfer the changed blocks over the interconnect, without regard to logical consistency on the filesystem level.
Two things can help avoid such a disaster: 1) get qualified advice and 2) don't trust the advice; backups and disaster recovery plans must be tested periodically.
This case illustrates an important point that I repeat often. The danger of using a replica as a backup is that data loss on the primary can affect the replica, too. This is true no matter what type of replication is being used. In this case it's block-level SAN replication. DRBD would behave just the same way. At a higher level, MySQL replication has the same weakness. If you rely on a MySQL slave for a "backup," you'll be out of luck when someone accidentally runs DROP TABLE on your master. That statement will promptly replicate over and drop the table off your "backup."
I still see people using a replica as a backup, and I know it's just a matter of time before they lose data. In my experience, the types of errors that will propagate through replication are much more common than those that'll be isolated to just one machine, such as hardware failures.
There is a lot of discussions whenever running MySQL storing data on NFS is a good idea. There is a lot of things for and against this and this post is not about them.
The fact is number of people run their databases on NetApp and other forms of NFS storage and this post is about one of discoveries in such setup.
There are good reasons to have binary logs on NFS volume - binary logs is exactly the thing you want to survive the server crash - using them you can do point in time recovery from backup.
I was testing high volume replication today using Sysbench:
On this box I got around 12.000 of updates/sec which is not the perfect number, though it mainly was because of contention issues in MySQL 5.0 rather than any NAS issues.
This number was reachable even with binary log stored on NFS volume. This number is for sync_binlog=0 and innodb_flush_log_at_trx_commit=2
I noted however if I enable replication - connect the slave to this box the throughput on the Master drops to about 2800 updates sec.... which is very close to the magic number how many network roundtrips per second I can get over 1Gb link. It was even more interesting when that. If I would pause replication for prolonged period of time and let few GB of binary logs to accumulate the performance on Master will be high even with replication running, but it will slow down as soon as IO thread on the slave is caught up with master.
When I moved the Binary logs to the local storage I got very similar performance but there have been no degradation when replication is enabled.
I have not checked in details why this could be the case but I guess there is something which requires a network roundtrip when the binary log is written at the same time as slave-feeding thread is reading it.
I'd be curious to know if someone else can observe such behavior and if there is an NFS tuning which can be done to avoid it or if we need to fix MySQL
EQX root@cass01:/opt/cassandra/bin# ./nodeprobe -host localhost -port 8181 ring Address Status Load Range Ring facebook_1301003235_1301003235 10.129.28.22 Down 15.77 GB 9ZehBzpHHwnxiPJU |<--| 10.129.28.23 Up 7.59 GB facebook_100000471858343_1514390063 | | 10.129.28.14 Up 4.59 GB facebook_100000846936312 | | 10.129.28.20 Up 12.94 GB facebook_1301003235_1301003235 |-->|
Trying to get info from the host, the reads timeout. java.net.SocketTimeoutException: Read timed out
Doing an lsof -p on the java proc I see that it is holding open a bunch of sockets. So the node itself is hanging on something internal is my assumption.
Looking at /var/log/cassandra/system.log I see that the last rotation happened Jun 8th over a month ago and no new log is being written to. THe issue is the node just died today. So this seems like a bug to me.
Now since Cassandra does not tell me what the problem is, I assume that there is a bug in this version and searching Cassandra Jira bug database I see that a lot of stuff is fixed as well as added. So might as well as upgrade.
Before I upgrade I wanted to do research to see if anyone else has. To my surprise there doesn't seem to be any blog talking about upgrading from 0.5 to 0.6.3
I know its rather easy but there is some new stuff in 0.6.3 that is turned on by default: So let's see what changes in the conf
I see that in storage.xml there is some new XML attributes for the ColumnFamily tag such as RowsCached, new tags called HintedHandoffEnabled, Authenticator, DiskAccessMode, RowWarningThresholdInMB. Additional to this I noticed that a lot of XML tags are missing. A rolling upgrade is just not possible and is mentioned in NEWS.txt
Thus in my application I set this $GLOBALS['cfg']['disable_nosql_feature'] = 1; I have about 40 toggles to play with, a very helpful process to enable dynamically code with out breaking your site.
now time for an upgrade without the service running:
Steps:
Shut down Cassandra: dsh -g cassandra "pkill java" # same thing as stop-server
rpm -e cassandra-0.5.1
rpm -ivh cassandra-0.6.3.rpm
/opt/cassandra/bin/cassandra
Done. Note what the hell is cassandra-0.6.3.rpm, it's an rpm I created that has my storage-conf.xml log4j.properties cassandra.in.sh
After Upgrading:
*************************************************************** WARNING: ./nodeprobe is obsolete, use ./nodetool instead *************************************************************** Address Status Load Range Ring facebook_1301003235_1301003235 10.129.28.22 Up 11.75 GB 9ZehBzpHHwnxiPJU |<--| 10.129.28.23 Up 3.04 GB facebook_100000471858343_1514390063 | | 10.129.28.14 Up 2.33 GB facebook_100000846936312 | | 10.129.28.20 Up 4.4 GB facebook_1301003235_1301003235 |-->|
Now what is left to do it change my ganglia scripts / nagios scripts to use nodetool instead of nodeprobe.
Join us for MySQL Sunday, a half-day technical conference packed with the latest on MySQL, on September 19, 2010. You’ll also hear what's new and what's next directly from Edward Screven, Oracle Chief Corporate Architect, and other visionary technologists at the "MySQL Fireside Chat" general session.
So running through some various tasks, I'm finally on the section of work where I can resurrect a script that finds inconsistent data between master-master pairs.
Let's get a quick summary to find our problems to target the script at.
CHECKSUM TABLE uses a ACCUM algorithm to determine the table checksum. This gives a good fast snapshot in a shared lock mode so writes do not stop while doing a checksum (this is in the context of INNODB). RollingUserLog gets nearly 700 writes per sec per box (No alerts).
Ah so the data is inconsistent, we are not missing rows but 1 or more rows have different values. Now time to find it.
To resurrect my script now that I know what table is messed up:
Algorithm is this:
open connection to both servers - fork compare the data by scanning the table via a index walk crc32/md5/hash the rows compare columns if (dbfacebook38b.hash != dbfacebook38a.hash) mark row and record primary key to track position
print report
but wait why polish up my script when I can use a formal one that does the job great!
maatkit is awesome. Nearly every problem that I run into and built a script for has been formalized. I'm going to muster up some time and contribute to this toolkit like add my binary log rotate which makes sure that the slaves are caught up to the binary file that is being purged.
The book is a mixture of MySQL Cluster (NDB), Replication schemes, some performance tuning, some minor kernel tweaking, and some more exotic approaches to common High Availability problems. Overall, I found this book very informative and a good read. Now the specifics, the book starts out on NDB and stays focused on this fact for about 60% of the book. The next 20% is on mySQL replication then about 10% of the book is on tweaking kernel, mysql, network settings to get the most out of the system. The last 10% is a mixture of uses of exotic systems such as GFS, Conga, ISCSI and how to use these shared storage techs with mySQL.
High Availability MySQL Cookbook, is a good read, and jogged my memory on NDB (since I do not use it on a day-to-day basis). If you are looking for a good reference on how to get an HA system up then this is a good book for you. If you are looking for why to use NDB over mySQL multi-master replication, this is not a book for you. The “why-to-use-this-over-that” is not the scope of the book. The drawbacks and concerns of what technology to use are not a focus of this book, and should not be. It is assumed that the reader knows which direction to go into for the most part or gives the reader enough information to set up and environment to see which way to go.
In conclusion, I like this book. I think it is rather concise, and right to the point, which either gets you started into building HA systems or gives you a good reference for an existing HA environment.
In The Doom Of Multiple Storage Engines, PeterZ lists a number of unique technological challenges that MySQL has to face precisely because it chooses Storage Engines and the way it implements them.
There is another, nontechnical challenge that needs solving by vendors that are producing commercial closed source storage engines for MySQL - my experience is with one specific vendor, but the situation is basically the same for each and every closed source SE:
Where I work, we are a MySQL support customer. Actually, we are a very satisfied MySQL support customer, because in our opinion the MySQL support just plain rocks and has more than once saved our corporate asses, or at least greatly improved our MySQL experience.
If we were to load a closed source storage engine into our MySQL binaries, we would make life a lot harder for MySQL support, and despite the fact that MySQL support is actually extremely cooperative when it comes to non-MySQL binaries, we would most likely end up with a support situation that is a lot less stellar than what we have now.
In the case I am thinking about, the vendor does not just produce a storage engine, but according to their architecture diagram, more or less a fork of MySQL including changes to the optimizer and other non-SE parts of MySQL. The version strings of their binaries I have seen have smaller numbers than current MySQL versions, in some cases much smaller numbers down to what MySQL considers Alpha or Beta versions of the 5.1 server.
If there was a problem with that binary and we were to take it to MySQL support, they would most likely laugh us out of the room, and rightly so. If we took it to the storage engine maker and the problem relates to something that is not related to their product, but to the MySQL proper, we are probably not really talking to the right company for a fix. Or we are looking at a lot of potential for corporate finger pointing.
The solution - for us at least - is really to beat down any effort of our internal customers that is related to researching closed source MySQL storage engines, unless they can show us a support agreement or support endorsement for this storage engine by MySQL/Sun/Oracle. Their product may be technologically sound and the solution to all of our performance problems - we will probably not be interested until there is such an agreement. We'd rather introduce another commercial database server (which is neither MySQL nor Oracle) than run such an unsupported configuration.
So the most important thing for any storage engine maker is to be either open source, or be part of a support alliance with MySQL/Sun/Oracle. Without either you do not exist.
Fosimo 1.1.0 is available. Fosimo is a desktop app for Facebook which helps you control your Facebook account without browsing through Facebook.
Facebook released a new applications authentication flow based on OAuth protocol and a new Data permissions dialog so we moved forward and updated Fosimo with it.
Starting from Fosimo 1.1.0 version the user is no longer required to enter the username and password into Facebook's login page to authenticate and authorize the app. Instead the user have to provide a bunch of necessary data permissions for the app and Facebook processes the whole authorization process behind the scenes.
Unfortunately the new scheme is not fully backward compatible therefore if you are a Fosimo user you may notice some parts of the app not working starting today. In this case you need to remove the application from your Facebook account (via "My account" >> "Application settings") and restart it on your computer (make sure you have a new 1.1.0 version of Fosimo). This is supposed to solve permissions/authentication problems during the migration.
Check out this SlideShare Presentation: I'd have to say that this is the coolest presentation on Indexing and how it relates to INNODB. I have written on this subject many times in the pass but this presentation covers all the bases and does a great job at explaining WHY you should do certain statements over others.
Ok. So it all begins with somebody who is using INSERT ON DUPLICATE KEY UPDATE. That guy wants to count the number of UPDATE actions that statement has taken, as opposed to INSERT actions.
We could have been using mysql_info() to fetch that information. But instead we rig the UPDATE clause:
CODE:
root@localhost [kris]> create table t (
id integer unsigned not null primary key,
d integer unsigned not null
) engine = innodb;
Query OK, 0 rows affected (0.16 sec)
PACKT Publishing sent me titled "MySQL Admin Cookbook" to review and I told them that I would be brutally honest about it. They said cool and well here, we go.
Overall, the book is cool if you are starting out in MySQL administration and want to get a box up and running. If you are looking to scale MySQL or make your application faster this is not the book for you. If you are worried about consistency and getting the most out of your hardware-this is not the book for you. If you are trying to figure out what the best index combination is-again-this is not the book for you. If you want to know how to add users, or set up replication, or dump a CSV format text file of data then this is the book for you.
Some things that annoy me from this book is all of the GUI cut and paste screen shots. Explaining stuff with a GUI screen shot really sucks IMHO since by the time you read the book, the GUI changed. I personally stick with command line interfaces or write my own GUI layouts to administration actions since I know what the various ADMIN commands do. Let me stress again that GUI explanations really go out of date fast and is only pertinent for when the book is made. For instance if you ever used Eclipse, a common IDE for various languages (mainly Java), between Eclipse builds the GUI changes. The overall interface for the MySQL command line client has stayed the same since the very beginning. To be fair though the book does show some mySQL command line examples, like for handling NULLs but consistency is key to getting your ideas across.
Another pet peeve of mine is the book has a tag line 99 great recipes for mastering MySQL configuration and administration yet I couldn't confirm 99 recipes since the book is not actually structured this way IMHO. It is structured in the format of "How to do it", "How it works", and "there's more..." for certain actions and there is just not enough meat for Mastering MySQL configurations – like what is a Star Replication Schema and how to do it? How do you rotate in new servers when in a circular MySQL config? Where is MySQL clustering? Why are file sorts so slow? How is MySQL using the disk subsystem with this config ... etc.
IN conclusion, would I recommend this book to readers? If you need a starting point to ask Google for some more complicated questions-this is a good start. For experienced administrators, no it is not for you.
For the past couple of weeks I have been trouble shooting some Cassandra issues where data would not make it to Cassandra.
The image above graphs all the exceptions that are produced from Cassandra. The two big lines are
Transport Exceptions (te) - meaning that Cassandra could not answer the request think of this as MAX Connection errors in mySQL.
Unavailable Exceptions (ue) - meaning that Cassandra could answer the request but the "storage engine" cannot do anything with it because its busy doing something like communicating with other nodes or maintenance like a node cleanup.
So how did I get the graph to drop to 0? After looking at the error logs, I saw that Cassandra was getting flooded with SYN Requests and the kernel thought that it was a SYN Flood and did this
possible SYN flooding on port 9160. Sending cookies.
To stop this the puppet profile was changed to have
Next looking into the Cassandra log which I defined to exist in /var/log/cassandra/system.log
WARN [TCP Selector Manager] 2010-03-26 02:46:31,619 TcpConnectionHandler.java (line 53) Exception was generated at : 03/26/2010 02: Too many open files java.io.IOException: Too many open files
Then noticed that ulimit -n == 1024
thus I changed /etc/security/limits.conf so that It's at a server setting by adding this:
* - nofile 8000
Now my Transport Exceptions and Unavailable Exceptions are gone and data is being written to it consistently.
There are many other ways of doing the same thing, I could have modified my init script or did some other stuff but I choose this way. Default Distros set kernel and limits fields too low: settings for desktop levels.
In the past few months, I have tested many NoSQL solutions. Redis, MongoDB, HBase yet Cassandra is the Column Store DB I picked because of its speed (on writes), reliability, built in feature set that makes it multi-datacenter aware. The one other personal reward for Cassandra is it is written in Java. I like reading and writing in Java more than C++ although it really does not matter for me personally in the end.
Let us talk about the reason why I am introducing Cassandra into my infrastructure and some of its drawbacks I have noticed so far.
Why it is being introduced: We have a feature where we record every single click for 50 million Monthly Active Users (real-time) and storing this in mySQL is just waste of semi-good hardware for data that is only looked at for the past 24 hours. Over the course of some time (couple of months) more than 3 billion rows accumulated, which translated into a 3.5 TB distributed INNODB datafile. So purging/archiving this data just sucks.
Now introducing user clicks into Cassandra was rather easy. I researched various sites, asked my Digg buddy and then figured out the rest. Within two days I was up and running with Cassandra and had a great understanding about Column Families and SuperColumns.
Developing with a Cassandra Data layer:
Now that the infrastructure is up, I needed to add a data model to /opt/cassandra/conf/storage.conf. The SQL that drove this functionality consisted of two main SQL statements. Add click Get a Range of Clicks
For these operations, mySQL rarely takes 0.001 seconds (1 ms).
Cassandra for writes is rather fast, but for reads, Getting Range Clicks i.e. using recv_get_slice
it takes .02 seconds (20 ms).
What does this mean? MySQL is a hell of a lot faster! Is it because of my CF design? No. for instance, take this slug (keys separated by a delimiter to make a distinct key name).
The slug says that clicker A, clicked on page B and what is stored are columns of distinct clicks - remember this is a column store DB.
So {$clicker}_{$pageowner} == the KEY for the COLUMN $object_id_clicked with the value being a $value and the free extra value the $timestamp.
Here is the php code $columnPath = $this->getColumnPath($objectid); $this->getCassandraConnect()->insert(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $columnPath, $this->today_ts, microtime(true), ConsistencyLevel::ZERO);
Now I want a list of items clicked. $data = $this->getCassandraConnect()->get_slice(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $this->getColumnParent(), $this->getSlicePredicate(), ConsistencyLevel::ONE);
This says give me the last N logged clicks that the clicker A made for page owner B. This is a hash lookup (Big-O(1)) but a sorted list of columns (O(nlogn)) and return the last N elements.
Why is Cassandra sooo slow on reads. Is it because my memory config is not enough? No. 7 GB of data is allocated for data that fits in memory ( for now ).
My hunch is the slowdown is a combo of the low thread read pool and in the inherent sorting and there doesn't seem to be a way to turn it off, so without digging in the code I couldn't say (I will be able to once I get an understanding of the code layout).
In conclusion, I like Cassandra, it’s very fast in writes, slow (for my taste but fast enough) in reads and what takes 2 lines of SQL takes 250 lines of PHP code interfacing with the Cassandra.thrift suite. I am going to use it still because it is good enough and I love the built in HA of it. Additionally the performance is improving very quickly.
Recently I needed the query stream hitting a very busy master. Normally I would have been using the MySQL Proxy to collect queries, but on a very busy machine the Proxy is as much of a problem as it is part of the solution, so I chose a different approach.
I had a SPAN port configured for the master, which is Ciscospeak for a port on a switch which mirrors all traffic of one or more other ports. I had an otherwise idle machine listening to the SPAN port on a spare network card. That way it is possible to collect traffic to and from the master without any interference with the master.
On the listener box, I had tcpflow collecting data to my master (and only traffic to, not from the master):
CODE:
tcpflow -i eth1 dst master and port 3306
These tcpflow files now need to be processed into a slow-log like format for further processing. For that I wrote a very simple processor in C after some experimentation with tcpdump and mk-query-digest had been shown as being too slow to keep up.
The processor is called extract_queries and it's souce can be found below. It would be used like so:
CODE:
# mkdir flow
# cd flow
# tcpflow -i eth1 dst master and port 3306
(wait 1h)
(break)
# cd ..
# find flow -print0 | xargs -0 extract_queries -u > slow
# mysqldumpslow -s c slow > stats
Just released a desktop tool for facebook application administration.
"Please note: Fosimo.AA is not a separate app, but an extension to your existing apps, you configure it yourself with your app API_KEY/SECRET and it gives you extensible admin functionality for any app of yours just from your desktop. No need to browse to facebook site or login to facebook or authorize anywhere online in order to make it work.
Sometimes it is not worth an effort to develop a separate Ban module for each of your apps just to ban a couple of folks and it is not secure to host such module online."
Among features: ban and unban users, add global news, etc. The tool is not a freeware - $30 per license. License is not needed for some basic features to work.
Giving a talk at the University of Utah on everything from scaling, clustering, mysql, mysql internals, noSQL (Cassandra) to how to manage all this stuff. If you are there at University I'm bringing some Swag!
Brian Aker, a brilliant helpful duder, who I learn a lot from. Gives a great talk about what is NoSQL explained in a way for database guys. I warn you, there are some points in this video where you can't hear Brian due to the audience "participation" but you should get the content.
INNODB has some irritating gotchas that makes disk space management hard. In 2002ish INNODB, added innodb_file_per_table to get around allot of these issues, but it does not fix everything.
If you are running innodb_file_per_table, you will notice in your database directory
db.opt - database characteristics file.
tablename.frm - the table structure.
tablename.ibd - the actual innodb table space file
Imagine that you have a table with 10 million rows and you delete say 5 million rows in multiple chunks around 400K chunks, because deletes are slow. Next, you notice that the table space file did not shrink. So what do you do? OPTIMIZE tablename, tada all the wasted space is reclaimed, but here is the PROBLEM the ibdata file grew! ibdata stores all of the UNDO LOGS thus GROWS due to the deletes and space is never reclaimed.
Ok, lets try coping the tablespace file to another directory and re-import the tablespace file after wiping the data dictionary (ibdata).
For instance /etc/init.d/mysql stop cd /var/lib/mysql/DB cp * /tmp/hold rm /var/lib/mysql/ib* cp /tmp/hold/* /var/lib/mysql/DB /etc/init.d/mysql start // create the ibdata file
ALTER TABLE tablename IMPORT TABLESPACE ERROR 1146 (42S02): Table 'DB.tablename' does not exist
Really, there is nothing that you can do to force innodb to purge the undo logs in the ibdata file without a full dump of the data to text and then reimport the data.
So here are the steps to shrink all table spaces and the ibdata file
mysqldump --all-databases (or use mk-parallel-dump) stop mysql rm -f /var/lib/mysql/ib* /var/lib/mysql/DB/* start mysql mysqlimport (or use mk-parallel-restore)
Takes a bunch of time but there is no other recourse. If you know of another way please share :)
Finland-based Aito Technologies, developer of a customer experience analytics product suite, today announced its plans to employ Sun Microsystems’ MySQL Embedded Database Server software to successfully analyze up to billions of phone calls, text messages, and mobile data sessions on a daily basis.
Recently I moved from windows servers 2003 32 bit and IIS6 to windows servers 2008 64 bit and IIS7.
My experience was rather painful, windows 2008 seems like a wild animal after windows 2003 experience.
2008 has lots networking and scalability issues and it took me quiet a while and many tweaks and hacks
to achieve reliability and good performace plus scalability and solve many issues.
If you own a windows 2008 server under some significant load from the web issues described in this post maybe relevant for you. Maybe it may save you some time in desperate searching around the net for solutions...
Before we continue to windows 2008 and IIS7 tips - some relevant info about IIS7.
IIS7 breaking changes
IIS7 introduced new integrated pipeline model where asp.net pipeline is integrated into IIS
which has extensibility and performance benefits. This comes together with some breaking changes to configuration and asp.net.
IIS7 has static compression turned on by default and dynamic compression turned off by default.
My first tip is to disable static compression (website - compression in IIS7 manager).
It can be the problem when you have static XML files on you server for example.
XML files become unreadable by XML parsers with IIS7 static compression. Browsers and other 3rd parties will not be able to read them anymore because XML is malformed (possible bug?).
Change max concurrent requests per CPU setting
By default IIS7 has a limit of handling 12 concurrent requests per CPU and will queue requests above this limit. If you have some significant web load and many AJAX style requests to your server - this setting maybe very restrictive and it is hard to find out the root of the problem when you server performance is suddenly degraded.
See
this post
to get the picture how it can cause performance issues on your server.
Some relevant info about asp.net thread usage on IIS7 here:
Thomas Marquardt advice is to change this default limit. Recommended settings:
"All of this may be a little confusing, but for nearly everyone,
my recommendation is that for ASP.NET 2.0 you should use the same settings as the defaults in ASP.NET v4.0; that is, set maxConcurrentRequestsPerCPU = "5000" and maxConcurrentThreadsPerCPU="0".
This is done by adding DWORD MaxConcurrentRequestsPerCPU to the registry under
called MaxConcurrentRequestsPerCPU (DWORD). This key doesn't exist by default. Or/and in aspnet.config section which overrides registry setting (also doesn't exist by default)
aspnet.config is here on windows 64bit:
You need to add the following section under "configuration" section (here I used default values) -
don't forget change maxConcurrentRequestsPerCPU to 5000.
Disable offload network enhancements if you have network issues
Different unexplained network issue on your windows 2008 server
could be related to TCP Chimney and related "networking enhancements" in windows 2008 and certain
hardware vendors when hardware doesn't play well with these enhancements.
If you have networking problems similar to these below - try to disable these features.
Set Connection "close" request header explicitly posting to AJAX web services
When moving from IIS6 to IIS7 integrated mode you may experience issues similar to below
(POST request time-outs and request aborted errors) and all related to AJAX POST requests to IIS7 while AJAX GET requests don't have any problem.
I experienced the same issue. Some HTTP POST requests got aborted in IIS7 pipeline and never reach server code. It can happen one per 1000 requests. However it may be still a significant number and if you have many users it can degrade performance for other requests.
In HTTPERR logs you can notice Timer_EntityBody errors related. POST requests usually arrive split in 2 segments: headers and following request body while GET requests arrive in one segment. On network monitor you can notice that server received POST request headers but request body never arrives to the server and after some timeout IIS aborts such request.
The solution which can help minimize the number of these errors is to set Connection "close" request header on the caller explicitly. Somehow IIS7 doesn't treat ajax POST requests very well, while IIS6 is more robust and able to handle missing Connection "close" header.
Go figure.
Sun Microsystems today announced Cashpoint, the Austrian betting and gaming company, has selected Sun's MySQL™ database to manage its customer, Web, and results data. In order to protect its systems' high-availability, Cashpoint uses both a MySQL Enterprise™ Unlimited subscription and the MySQL Cluster database product. Employing an enterprise-wide MySQL Enterprise subscription along with MySQL Cluster contributes to considerable cost savings for the company.
Sun Microsystems, Inc. today announced the details of the Sun Tech Days worldwide developer conference for 2009-10 at http://developers.sun.com/events/techdays/index.jsp. Sun Tech Days is a multi-city world tour designed to showcase how the developer community can leverage Sun technologies, services and products to drive the next generation of industry innovation. Since 1998, the Sun Tech Days worldwide conference has brought together hundreds of thousands of developers and students to learn about technology from industry experts.
In a procedural language without the use of threads (or Inter Process Communication via forks), to execute I/O requests they are done one after another. Synchronous Queries produce at best a Big-O of N such that N is an element of I/O communication (queries) and N equals the number of queries needed to achieve the requested dataset. With IPC or threads we can speed up common O(N) problems to reduce the N with parallelism, its still functionally a O(N) yet from a single instance point of view N is much less because threads (IPC) takes that Serial computing component and executes the code in parallel. To better explain what I am talking about lets look at some PHP code:
foreach($friends as $friend){ $data[] = getMySQLData(“SELECT * FROM AccountData WHERE userid = $friend); }
The Primary key for the AccountData table is userid. Assuming that you have 5000 friends, the query has to be executed 5000 times. We can reduce the O(N) and change it to a O(nlogn) (Binary Tree - doesn't take into account other factors) by switching the query to
$data = getMYSQLData(“SELECT * FROM AccountData WHERE userid IN (….)”);
We just sped up the retrieval of the data significantly, yet we just introduced a bottle neck on the datalayer. Our architecture requires that the data is located in a single location.
What if AccountData’s data is spread across many servers federated by userid? This means that userid belongs to a server, so the server contains a shard of the AccountData’s Data.
Now we are back to a O(N) where each query needs to be executed on the corresponding shard. The logical next step is to group queries per shard and run across them all. For instance
$multiShardIDs = $genericShard->getMultipleShardIDs($objIds); foreach ($multiShardIDs as $shardID => $shardUserIDs) { if (stripos($orgQuery, " WHERE ") !== false){ $query = $orgQuery." AND {$column} IN (".implode(',', $shardUserIDs).") "; } else{ $query = $orgQuery." WHERE {$column} IN (".implode(',', $shardUserIDs).") "; }
$shard_to_sql[$shardID] = $query;
.... more stuff ....
Yet this is still a O(N) its just that N is smaller. Each query is still executed serially. Let’s look at some stats of synchronous queries of SELECT 1; This query is executed across 35 shards and the timings are from PHP point of view.
Field
End Value
Start Value
Delta
ru_oublock
0
0
0
ru_inblock
0
0
0
ru_msgsnd
0
0
0
ru_msgrcv
0
0
0
ru_maxrss
0
0
0
ru_ixrss
0
0
0
ru_idrss
0
0
0
ru_minflt
9872
9865
7
ru_majflt
0
0
0
ru_nsignals
0
0
0
ru_nvcsw
11344
11114
230
ru_nivcsw
977
968
9
ru_nswap
0
0
0
ru_utime.tv_usec
865054
849053
16001
ru_utime.tv_sec
16
16
0
ru_stime.tv_usec
556097
552097
4000
ru_stime.tv_sec
1
1
0
Total Execution Time
0.18323707580566
As you can see, to execute this from PHP it took 100 ms, 100s pages reclaimed and 200s voluntary context switches to query 35 servers.
Now let’s look at Asynchronous execution of SELECT 1; // the query generation is from PHP yet the execution is performed on a server that executes the query in parallel
Field
End Value
Start Value
Delta
ru_oublock
0
0
0
ru_inblock
0
0
0
ru_msgsnd
0
0
0
ru_msgrcv
0
0
0
ru_maxrss
0
0
0
ru_ixrss
0
0
0
ru_idrss
0
0
0
ru_minflt
9131
9121
10
ru_majflt
0
0
0
ru_nsignals
0
0
0
ru_nvcsw
3891
3889
2
ru_nivcsw
290
290
0
ru_nswap
0
0
0
ru_utime.tv_usec
596287
596287
0
ru_utime.tv_sec
4
4
0
ru_stime.tv_usec
460028
460028
0
ru_stime.tv_sec
0
0
0
Total Execution Time
0.019363880157471
As you can see from the table above executing the query asynchronously produced results with less context switching, less pages reclaimed and almost 10 times execution improvement over the synchronous query counterpart. How is the asynchronous query executed? Lets take a look at the figure below.
So A user comes through the firewall / load balancer with a HTTP Request to the www pool that runs PHP. PHP now makes a CURL request to the Async Shard Servers (through a LB same LB different PORT). The HTTP Request to the Async Shard Server contains the SQL we wish to execute. The Async Shard Servers has a thread per shard and executes the request in parallel. The results are merged and sent to the calling CURL process via JSON. The returned JSON is then converted into a PHP object. This is a typical three-tier environment.
When having to query multiple servers using an Asynchronous Tier is dramatically faster; in fact its as fast as the slowest server. This is the main sticking point of why asynchronous queries are faster then synchronous queries (in this context) since the total execution time for serial queries is the SUM of all the query execution.
The current version of the server is used for Friend Query execution across the datalayer. Its been solid for a few months now, and I'm currently getting permission to release it as an Open Source Product. The features this server contains:
Lightweight
CPU bounded
Scales Linearly
A Timer Thread to keep the database config up to date in memory and fetching the config from PHP so if PHP changes connections to the shards so does Java
Uses Java-6 Executor Service
Merges the result set prior to sending it to the calling process
Sun Microsystems, Inc. today announced that CASPUR, a non-profit, consortium of Italian universities focused on scientific supercomputing and innovative technologies, has subscribed to Sun's MySQL Enterprise™ database service.
E.ON ES, a subsidiary within E.ON Group, today announced that it has purchased a multi-year MySQL Enterprise™ database subscription from Sun Microsystems to help power its advanced work order management system, used for the service and maintenance of Sweden's electricity distribution grid.
I am not a fan of upgrading mySQL unless I need to. I am of the mind if it is not broke don't fix it, but when I do upgrade I follow these general steps.
If I have run into a mySQL bug, I look to see if that bug is fixed by searching the mySQL bug database.
If I've notice a performance bottleneck, I look to see if the performance bottleneck has been fixed by searching the same database.
I will NOT upgrade to the latest and greatest version of mySQL (5.4) I stay within my branch (5.0).
These are my three general motivations that drive my upgrade decisions. Anytime I upgrade I also make a list of things that might affect my environment for the stuff I use.
Ignore all the NDB changes... I don't use it and that's the majority of fixes. This is also, why I do not use it.
List the changes that will affect the production environment
Deploy the version that I picked on a few servers running my original config
Do data corruption tests (make sure my checksum scripts return the same data)
Verify that the problem I'm trying to fix is fixed
Deploy to more boxes
Let the new server bake for a period of no less than a week
Deploy everyplace
So now, I'm upgrading from 5.0.56 to 5.0.86. What I'm trying to fix is mysql memory overhead at high levels of ram.
For instance, I have a slew of 48GB boxes. I set the bufferpool to 40GB; the OS uses 1 GB of memory (roughly) leaving an overhead of 7GB for the system cache and various spikes of sort buffers. Over a period, I see that mySQL will consume and hold onto 47GB of memory for an unknown reason even with some tight my.cnf settings. (I'm certain they are tight since I know what each buffer does). Therefore, testing some later versions of mySQL we found that these later versions do not grow past the settings defined yet performs the same.
Next, since I decided that upgrading is a good solution, now it’s time to list all the changes that fixes things.
Therefore, overall, upgrading should give me a boost in performance. My own internal testing sees some tighter memory usage, even though this is not fixed explicitly, the product has matured overall so I can account for the reduction in memory to that.
As announced previously backup facebook inbox functionality is now available on Fosimo. It took some time until the app was whitelisted on Facebook in terms of reading users inbox permission but now it should be operational. If you already have Fosimo 1.0.8 it should work.
Download here. Hurry up because maybe it is the first app on Facebook which is capable of doing this. For now your inbox is saved in XML format.
What is Nagios? Nagios IMHO is the best Open Source monitoring system out there. It supports hosts checks, a level to determine on a host level if a box is considered "up". It supports service check, a level to determine if a particular service such as mySQL is up. It has features to log all events to a flat file or to a DB. It can notify you when a service is in a warning state, error state or unknown state.
For the purpose of this article, I am going to talk about handling events such as a clearing up swap.
First, let us look at some configuration of Nagios. We are going to define a command, then service acting on that command. Let us assume that the nagios install is in /usr/local/nagios.
Therefore, in /usr/local/nagios/ a few configuration files are key: - /usr/local/nagios/etc/objects/commands.cfg - the command file where the checks are defined - /usr/local/nagios/etc/hosts/*/hosts.cfg - the services file where the checks are defined for execution based on other directives in this file.
This says that check_local_swap executes check_swap with a warning threshold of $ARG1 and a critical threshold or $ARG2
Next when defining a service for a host
define service{ use generic-service; Name of service template to use host_name dbfacebook34b ; hostname service_description SYS:Swap ; what shows up in alerts is_volatile 0 check_period 24x7 ; threshold when to check (all the time) max_check_attempts 4 ; threshold to check before marking state event_handler handle-swap ; handle an event (another command) normal_check_interval 5 ; in seconds retry_check_interval 1 ; only try once before reporting the state contact_groups itops ; contact group to send notifications to notification_options w,u,c,r ; need to look this up for all defs notification_interval 600 ; retry sending notifs every 8 mins notification_period 24x7 ; keep sending them check_command check_nrpe!check_local_swap!80%!55% ; execute the event handler and warn like hell }
Lots of goodies as you can see. Let us look at the event handler
This means execute this script whenever any event for swap occurs (I decided to make this simple and not put a threshold on this).
What does handle_swap.pl do - well it’s a perl script that looks at free memory and if only a few 100K of swap is in use, swapoff -a; swapon -a;
In this case, it is a bit safe to do this. Why do this? Why not just turn of swap. I have talked in depth about this subject-but for a minor recap. Linux needs swap else, kswapd will freak out. Swap in DB's is bad so I clean it up automatically since O_DIRECT on my SAN is not an option.
Why not just run a cron job? Nagios keeps a log, I like to review what is happening from a central location, and nagios is freaking COOL.
Sun Microsystems, Inc. today announced an update to its MySQL Enterprise™ subscription, designed to assist corporate users improve the speed and uptime of their MySQL™ database applications. Thirty-day trial subscriptions of the new MySQL Enterprise Fall 2009 Release are now available free-of-charge from http://www.mysql.com/trials.
Fosimo 1.0.8 is available. Fosimo is a desktop app for Facebook which helps you control your Facebook account without browsing through Facebook.
Facebook has rolled out several important features recently including new inbox functionality today and new authentication scheme for desktop applications a couple of weeks ago so there are new features and adjustments.
New in Fosimo 1.0.8
Improved permissions configuration and authentication (supports recently introoduced "Publish Stream" and "Read Inbox" permission).
Improved photo upload functionality.
Improved set status (removed "is" verb). This is an old bug on Fosimo but finally fixed.
Many fixes and enhancements.
Backup facebook inbox: now you can backup your inbox on facebook to your hard drive with 1 click. Please note: Facebook is rolling out new inbox functionality now so it may take a couple of days until this feature is fully available for everyone.
I didn't try Windows 7 yet. But seems that Windows 7 inherited user access control feature from Windows Vista as is. It maybe useful info for people running MSIs on Windows 7.
I learned it because lifehacker linked yesterday to my 2-year-old post about MSI Error 2869 on Windows Vista where I explained the error and how to deal with it (lifehacker had the same MSI problem on Windows 7). In short — the error is caused by user access control feature and you need to run MSI from command line as administrator as a workaround (or patch the MSI). I was a bit surprised yesterday while reading that 2-year-old post since it seems very techy and it took me a while to fully understand it :). Good coverage after all...
For the first time ever I took a two-week vacation. How can a person who has 100s of database servers, 100s of web servers, and a system that supports 100K tps across 20 TB of data is able to take a vacation? Easy, I have a great team that is very competent in managing the platform by following our cookbook routines and guidelines for new application interaction with the databases.
Where did I go? I went to NYC-stayed in Tribeca and only ate Pizza, Hotdogs, White Castle and Hala Food from vendors that are nearly on every street corner. I also lost 6 pounds! After three months of half-ass dieting to prepare for my Pizza diet, who would have known that just going for it was the key to success! How did I do it? Well, first I walked to Little Italy, bought 2-4 slices, ate them, and then walked around the city. That is it. As a Database Dude, I do not do much walking except for when I am too tired to run on the treadmill for 20 mins. I was averaging around 10 miles a day for a 12-hour period walking around NYC taking pictures, enjoying the sites, eating, drinking, and then repeat. I even walked from Canal Street all the way to the Staten Island Ferry just because. Man it was great. After NYC, I went to Puerto Rico for the second time. My family and I had a freaking awesome time. We went to the only US National Park that is a Rain Forest. Went to Vieques to the Bio Luminous Bay, where the single cell organisms light up when kinetic energy is applied to them. Went to Culebra, to Flamenco Beach – which is off the hook: much better than any other beach I have been to. Drove around the entire Main Puerto Rican Island stopping off in Rincon and eating every 3-4 hours. I am fully rested and feel like a 20 year old again. So, expect a lot more posts. I think the order will be:
Migrate 20 TB of data into a new Format without downtime. How to make Friend Queries work with a database. Throughput increase from Asynchronous Queries.
Sun Microsystems, Inc. today announced that TweetMeme, the innovative online application that tracks the popularity of links sent via Twitter, has subscribed to Sun's MySQL Enterprise database offering in order to keep up with the microblogging service's explosive growth.
Created in 2008 by pioneering Web 2.0 startup Fav.or.it, TweetMeme gives the rapidly expanding Twitter community a means of easily seeing and sorting the most popular links on Twitter. TweetMeme also provides a realtime search facility of fully expanded, qualified and indexed links so that users can get answers to queries about very recent events or news.
So, when you connect on the fly to a database your subject to a variety of issues, like when the db is not available and when the db does not have a route.
One of the main reasons why a dev may want to connect on the fly is because they have too many front ends to hold a persistent connection on the backends. Since mySQL does not use libevent, holding open threads to mySQL is much more costly. Threads == Memory.
But, that's here nor there. The main purpose of this post is to talk about how to recover from failed connections that block apache threads.
Common Failures: No route to Host Flapping NIC Locked Tables Recovering from a Crash more of the same.
My Environment: I have a bunch of webservers (200+) that all have 300 possible threads (60000 possible connections to a single DB) behind a load balancer that uses the LB least connections protocol to distribute load across all 200+ webservers.
Since PHP is used in a stateless mode (no guarentee that the same user will hit the same server), I have to have some way of telling all other apache procs for said box that a server is down. I use stateless on the fly connections, so each apache proc will test the connection.
Here in lies my most common problem. If a box in the backend dies, all apache threads will block for a predefined time-out.
In /etc/php.ini (it could be in /etc/php.d/mysql.ini) I set
mysql.connect_timeout = 5 // the default is 60
In my common_db class when connecting to a database, the connect routine returns a database handle object
if (PEAR::isError($dbh) && $delta_to_connect >= 5){
Now PRIOR to calling the database connect code I check to see if the IP is up.
$status = apc_fetch($ip); if ($status[0] === 'DEAD'){ return false; }
... do connect ...
But, there could be a variety of issues that can call false positives, like network flaps, someone blocking the db for some time etc. So, I'll allow one request .1% of the time bypass the status check to try again.
But, if your app can't connect to the DB aren't you down anyway?
Yes, although I don't have to restart all the httpds. Also each DB has a redundant pair to when returning false, the app code will try the other set of servers.
I have a variety of methods to deal with these scenarios but this is the quickest to implement.
Infoworld has an article "The fight over open source 'leeches', about the failure of enterprises to contribute the changes they make to open source software back upstream to the project.
"When it comes to open source communities, individuals are much better citizens than institutions. The enlightened self-interest that causes individuals to send back bug fixes, contribute ideas for new features, and write documentation is much harder to find in institutions," Dan Woods, CTO of Evolved Media, wrote in Forbes earlier this year.
I have a very old article in my german language blog about Open Source Software and companies, and the infoworld article triggered me now to translate it:
Many companies do use open source, and that is a good thing. Most of that software is well suited to the task, and usually it is much easier to understand, more flexible and also easier to debug than their closed source equivalents. But open source requires that enterprises adjust their process with regard to changes and deployment of software in order to be successful open source users.
The fact that you can change open source software lures many companies into making changes to the software. But making the changes is just a small part of the work, and the most inexpensive one.
There are two PRIMARY directions to sort data in SQL: Ascending (ASC) and Descending DESC. When these two sort definitions are put together in a single statement a filesort is produced.
Why do we want to avoid filesorts?
Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm here.
So, here is an example
CREATE TABLE `ABCD` ( `A` int(10) unsigned NOT NULL default '0', `B` int(10) unsigned NOT NULL default '0', `C` int(10) unsigned NOT NULL default '0', `D` int(10) unsigned NOT NULL default '0', PRIMARY KEY (`a`,`b`,`c`,`d`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC, d ASC\G
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ABCD type: ref possible_keys: PRIMARY key: PRIMARY key_len: 8 ref: const,const rows: 2 Extra: Using where; Using index; Using filesort 1 row in set (0.00 sec)
Notice the filesort? So how does one get around this filesort?
Well
Let's define some roles for columns C and D. C is the parent while D is the child.
We want all the latest parents (C) We want all the oldest children (D)
We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED
SELECT * FROM ABCD WHERE A=? AND B=? ORDER BY C DESC explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC LIMIT 10\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ABCD type: ref possible_keys: PRIMARY key: PRIMARY key_len: 8 ref: const,const rows: 2 Extra: Using where; Using index 1 row in set (0.00 sec)
Now
FOREACH($C_parent as $i => $c_id) {
$C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;
}
So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort.
What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.
This is the translation of an article from my german language blog. It is not a literal translation, but has been amended and changed a bit to take more recent information into account.
It started out as a discussion within the german language MySQL group in USENET. There the eternal question came up why phpMyAdmin gets no love at all from the helpers and regulars in that group. My answer was:
phpMyAdmin (PMA) like many other GUI tools for MySQL has a number of limitations. For a web tool such as PMA these come from its operating principles and can hardly be changed. But let's start at the beginning:
In MYSQL the connection is a special context or scope for many things. At least the following things are part of the connection scope:
Transactions. A disconnect implies a ROLLBACK.
Locks. Transactions generate locks with writing statements or SELECT for UPDATE. Table locks are generated by LOCK TABLES. Disconnect releases the locks.
The number returned by LAST_INSERT_ID() is cached within the connection context. It is unavailable after a disconnect.
Tables created by CREATE TEMPORARY TABLE. They are deleted on disconnect.
Prepared Statements and the parsed bytecode of stored routines are kept per connection, even if that is the wrong solution from a performance POV.
@-variables. SET @bla = 10 or SELECT @bla := count(*) FROM cookie; are defining variables within the context of a connection. They are lost on disconnect.
SESSION-parameters. SET SESSION mysiam_sort_buffer_size = 1024*1024*64 or SET @@session.myisam_sort_buffer_size = 1024*1024*64 are setting configuration parameters with session scope. They are lost on disconnect.
Replication specific SET-commands such as SET TIMESTAMP or SET LAST_INSERT_ID can affect the behavior of functions such as SELECT NOW() or SELECT LAST_INSERT_ID().
I am calling all of that connection scoped state.
A client that is disconnecting in an uncontrolled way or that does not report a disconnect properly is defective in the sense that all functionality dependent on connection scoped state is unavailable. The opposite case also exists - a client that reuses existing connections may have connections that have an unclean state, which may or may not be a security risk.
Now Sun has been bought, and not by IBM or Cisco, but by Oracle. In the aftermath everbody is asking themselves - what happens to MySQL?
Well, firstly MySQL and InnoDB are now part of the same company. Oracle has been maintaining InnoDB pretty well in the past, and that can only improve.
Will Oracle let MySQL die and try to push their on products into the market? Hardly so. Why should Oracle do that, and if so, using what products?
There are a finite set of cases where mySQL clients will hang on a connection-for a small period of time (seconds) or indefinitely. Most notably no-route to host causes a timeout to occur which in most clients are on the order of seconds to recovery.
In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too.
I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?
Use a Load balancer
This is cool and all, but is expensive since two or three are needed for HA purposes. In addition, this increases the complexity of managing servers, and most LBs are optimized for HTTPD traffic, not raw TCP traffic.
Use a memcache layer to keep DB state
This is cool as well, but the state server, which collects the state of your servers and updates memcache, is now a single point of failure-and a lot of time is spent to make this resilient. Not to mention all clients across Programming languages (bash scripts, Java, PHP, Python, etc) need to have access to this memcache layer and logically handle the connection the same. This is doable, but there is an easier way.
Use IRON DNS:
IRON DNS is a term I use to build an HA Resolver. IF a box fails a health check, a nagios event handler can tell IRON DNS to set the domain names IP address to 127.0.0.1 or a routable interface that produces a Connection Refused (111 instead of 110).
For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.
For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.
Here are some conditions where connections take to long:
No Route to Host
INNODB recovery
Disk Fails (disk fills up)
Switch Fails / Flaps
Plug-in storage engine crashes, yet mySQL is up
DNS Resolve fails
If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.
Memcache is a fantastic Hash table-very fast and one of the great successes of Brad Fitzpatrick-who in my opinion has done more for the open social movement as an individual then anyone else. I use memcache quite extensively, now I am thinking about adding ZLIB native to compress the value of each key-much like how INNODB does with the Barracuda file format. The theory is with a CPU hit, we can store more data per memcache instance. I've talked to the Northscale guys and they love the idea. What do you think?
for PHP for instance compresses the data on the client side.
There still might be some value compressing the data on the server-side, but now I'm not as motivated.
What might be a good alternative is to compress keys into 8 byte longs in memcached automatically instead of the actual string that can be huge. To give some more detail, 8 byte longs is a 64-bit int. A string can easily be converted into a big int by bit manipulation - and the address space is huge so key conflict is effectively removed.
Cloud computing is the big thing now days, weather you are an app developer using EC2 or the Google APP engine, or a new company trying to build your own cloud product. If you are hosting, or using a platform it costs money. I hate to spend money, especially money that is for my company. If I find an idle server, I use it to 100% utilization (prior to the saturation point).
I needed to build a new application that periodically crawls a website to update various lists. Building a crawler is expensive, especially from scratch. First, you have to define the amount of lag that is allowed from the crawl copy and the real copy. Of course the Project Manager does not want any lag, all events must be caught and near real-time without overloading the source of the data-but I am not hating, it is a challenge. Next, what technology to use, what language to write the app in-what considerations are left to be defined? How does one crawl Gigs, Tera, and amount of data in a guarantee period? On top of that, how much additional hardware is this going to cost. In addition, to be a cloud it needs to have an API so app developers can set, and get consistent data in an expected period. This is a lot of freaking requirements.
Therefore, to solve this issue, I know that mySQL will store the data, but getting the data is the hard part. This is what is going to cost money, lots of it. I looked around common architectures and found that nothing would do what I wanted to do in a cost effective manner. So, I designed my own using Seti @ Home as the basis for the design.
Get to the point already Dathan:
I have turned ever user who views my applications into a collector, using idle bandwidth without knowing who is collecting the data. My user base on spare cycles will fetch a feed of my choosing, and sends that data to my servers without any personal information. Instead of using an Amazon or Google service, I have turned my user base into a cloud to service their needs.
It is rather awesome-I must say. I am able to service the needs of more than 60 million users at the cost of development time, and NO NEW HARDWARE. The cloud does not have to be a service provider-it can be the end user as long as the end user is not impacted by the requests. BTW the team that I manage is freaking awesome-they built my vision with trial and error and a hand waved spec.
Currently the system scales as long as there is enough end users. If I lose all my users then well I am boned, but to support the feeds all I need is 100K nodes at the current rate. With 60 million end nodes, I am cool.
Imagine if Google with Adsense used this install base to tell Google if the data has changed for an arbitrary web address. All it needs is a few people to hit the same url, inform Google that the web address in question has a different checksum, then at that point Google’s crawlers can go fetch it, instead of constantly crawling data that doesn't change. Google would be able to reduce overall server cost significantly, if it just knew what data has changed instead of guessing what data has changed.
By next years Velocity conferance I hope to have a full disclosure on what technology my team used, how my team get around cross domain issues, and how to compute checksums to validate the data.
PS - I designed this, with my team we made it much better and one person implemented it and owns the product from this point on.
Where I work, Merlin is an important tool for us and provides a lot of insight that other, more generic monitoring tools do not provide. We love it, and in fact love it such much that we have about 140 database agents reporting into Merlin 2.0 from about 120 different machines. That results in a data influx of about 1.2G a day without using QUAN, and in a data influx of about 6G a day using QUAN on a set of selected machines.
It completely overwhelms the Merlin data purge process, so the merlin database grows out of bounds, which is quite unfortunate because our disk space is in fact very bounded.
The immediate answer to our purge problem was to disable the merlin internal purge and with the kind help of MySQL support to create a script which generates a list of record ids to delete. These ids end up in a number of delete statements with very large WHERE ... IN (...) clauses that do the actual delete.
I will be attending Community One tomorrow and on Thursday at Marriott Marquis Hotel, New York, NY. I am especially looking forward to the announcements tomorrow which sound very interesting :)
The first day is a free event featuring:
* Cloud Platforms – Development and deployment in the cloud. * Social and Collaborative Platforms – Social networks and Web 2.0 trends. * RIAs and Scripting – Rich Internet Applications, scripting and tools. * Web Platforms – Dynamic languages, databases, and Web servers. * Server-side Platforms – SOA, tools, application servers, and databases. * Mobile Development – Mobile platforms, devices, tools and application development. * Operating Systems and Infrastructure – Operating systems and virtualization. * Free and Open – Open-source projects, business models, and trends.
The second day of the event is focused on Deep Dives with two half-day sessions on MySQL and two full-day sessions on Java and Web development. I will be attending the session, "Using Java EE and SOA to Architect and Design Robust Enterprise Applications."
Following the conference, I will be a panelist at a Cloud Computing Seminar at Microsoft office in NY. It's going to be a long but exciting day!
It will be great to catch up with old and new friends at the event.
Walking a table means, traversing each row, commonly used in building queues, fixing data, or dumping a table. I've recently ran into a problem-caused by an assumption, where walking a table was taking way to long using the method
$pos = 0; do {
$result = SELECT col FROM TABLE LIMIT $pos, 1000; $pos += 1000; } while ($result);
The assumption was since INNODB uses a cluster index, this would traverse the table using the PRIMARY key. This is not the case, its not a problem in INNODB but a bad assumption, that I fell victim to. A table scan to each $pos occurs producing a Big-O of N^2. So, when the query:
SELECT col FROM TABLE LIMIT 1000000, 1000 is executed mySQL will scan all the rows up to row position 1001000 and for each subsequent iteration.
This is SLOOOOW. IMHO since the table is sorted by the primary key, mySQL should optimize this case - but it does not and will not. So, to walk an INNODB table fast, and keep liner time or a Big-O of N an alternative is
$last_id = 0 do {
$result = SELECT col FROM TABLE USE INDEX(PRIMARY) WHERE pkey_part > $last_id LIMIT 1000 $last_id = $result[count($result) - 1]->pkey_part
}while($result);
This dumps a table very fast, almost as fast as doing a count(*) on the PRIMARY KEY.
Another method is to
SELECT col INTO OUTFILE "/dir/file.ids" FROM TABLE;
but the data is local to the database - thus the need for the application to grab data. Another draw back of this method is that the dump produces more disk IO then walking a table off of a key, slowing down access to this table.
In conclusion, even if the storage engine keeps the table order consistent like INNODB does, do not assume that LIMIT 100000, 1000 is equivalent to a file seek of position 100000, without telling the Optimizer to use an index.
Tomorrow, I'll be attending the Executive Seminar on Cloud Computing at NASDAQ MarketSite (NY). Speakers include Dr. Werner Vogels and Mårten Mickos (ex-CEO of MySQL). Big thanks to Amazon and RightScale who were able to accommmodate my RSVP even when the registration had formally closed.
I hope to be able to catch up with Mårten Mickos during the event. In case I do succeed in catching up, is there any question you want me to ask him? You can email me or post a comment.
It's funny that the event site still shows Mårten's title as "SVP of Sun Microsystems’ Database Group."
Came across an interesting post by Bret (co-founder of FriendFeed) about how FriendFeed uses MySQL to store "schema-less" data. According to the post, they weren't having issues with scaling existing features but rather they were experiencing pain when trying to add features.
Now the way they are using MySQL is interesting and bizarre at the same time. At a very high level, it seems their approach is to use a RDBMS as if it is a column-oriented database. Of course, it makes me wonder why not just use a column-oriented database? I need to read the post again in the morning (too tired right now so just gave it a quick glance).
I am very interested in hearing thoughts from my peers at Planet MySQL regarding this approach. They seem to have gone great lengths to go this route. What issues and benefits you see of this approach and whether you ever see yourself taking this route? I, for one, am not entirely convinced of this approach and whether it can really scale down the road. Also, if it was someone other than Friend feed going down that route, I might have actually lost my tempered and yelled :)
Side note: Friendfeed is growing fast, and it would have been cool if Bret was speaking at one of the three upcoming MySQL events in April.
I'll be going into detail what is Sharding, how to Shard, pitfalls of Sharding, performance/throughput gains, shard roles, and performance scaling in general. I hope to make this the most comprehensive talk to date on the subject in 45 min.
The topic is called Scaling a Widget Company. I'll detail how I setup the data layer for Rockyou. How many transactions per second Rockyou is at, what the infrastructure is comprised of, how 99.999% uptime is achieved and hopefully get into BCP which I probably will not have time to go over.
If you want me to focus on specific aspects on the subject of shard'ing let me know and I will :).
If you have a slave, that slave is probably running with the read-only flag set in the mysqld-section of your my.cnf. To be able to write to a read-only slave you need to be the replication SQL_THREAD or have SUPER privilege.
Since 5.0.16, it is still possible to execute CREATE TEMPORARY TABLE on a read-only slave, so CREATE TEMPORARY TABLE privilege also allows you to write to a read-only slave in a limited and controlled way.
If you want to process a lot of data in a temporary table, you are probably creating the temporary table without any indices, then INSERT ... SELECT data into it, and then ALTER TABLE ... ADD INDEX afterwards, because that is usually faster than to insert data into a table with indices. Only that you cannot ALTER TABLE a temporary table, even on a server that is not read-only - in order to run ALTER TABLE on any table, even temporary onces, you need ALTER TABLE privilege which you might not want to give out lightly.
There is no reason at all to check ALTER TABLE privilege for an alter table operation on a temporary table, because that table is visible only to your connection and cannot be shared. It is also deleted when you disconnect. In fact there is no reason at all to check permissions for temporary tables. But it is done.
Because it is done, you can grant ALTER TABLE to a single table that does exist or even to a table that does not yet exist. If you grant ALTER TABLE privilege to an existing table on a read-only server, you cannot alter that table, because the server is read-only. If you then use CREATE TEMPORARY TABLE to create a table with that name, the temporary table will shadow the existing persistent table and for your connection the persistent table will become inaccessible.
The semantics of the GRANT will change, though, and will now apply to the temporary table, which is writeable on a read-only server because it is temporary, and is alterable because of the grant which was not meant to apply to it in the first place. Problem solved: I now can ALTER TABLE my temporary table on the read-only server after I have finished my data load.
All is well? Not!
There are multiple things at work here which I consider broken:
GRANTs are applied to temporary tables. This is not making any sense at all in my book. Temporary tables are connection-local objects and they cannot have grants applied to them which were always referring to persistent objects when they were made.
Temporary tables can shadow persistent tables in the namespace of a connection. Because GRANTS are tied to objects via the objects name and not an objects UUID or another form of truly unique object-identifier, GRANTS can refer to changing objects even when the grant does not change. This feels somehow broken, as in "not properly normalized". Does RENAME TABLE edit grant tables as well? I have to check!
By granting CREATE TEMPORARY TABLE privilege to a user I am allowing that user to shadow any other object within a schema. The temporary table will then pick up any rights granted to the shadowed object for the duration of its lifetime. This cannot be good.
So the good news is that the inaugural OpenSQL Camp is going to be an awesome event with mouth watering sessions by noted experts. The bad news (for me) is that I won't be attending it, which makes me sad. I cannot leave my town because my wife can go into labor anytime now.
The session list looks great! Congratulations and thanks to Baron, Sheeri, Ronald, all the sponsors and contributors for organizing the first OpenSQL Camp.
I will be watching PlanetMySQL closely for juicy blog posts. Hopefully, the one and only Sheeri is taking her camcorder!
Today I discovered Stack Overflow, a collaborative site that focuses on technical Questions. You can ask questions related to any language, apparently without having to register. The site is currently in beta. There are also a few MySQL questions that are currently unanswered.
I am currently tasked with writing Software Requirements Specification (SRS) document for a project. Effective sharding (based on specific criterion) and Scalability are key requirements of the project.
Scalability is traditionally classified as a non-functional requirement. My question to the community is that if scalability is crucial to a project, would it still be classified as a non-functional requirement? Are their cases when scalability requirements would be best classified as functional requirements?
Tomorrow morning I am presenting a session, Startup Scalability Strategies, at Startonomics, a conference being organized by Dave McClure and Deal Maker Media. The sessions will be streamed live using UStream. Check the Startonomics website at http://startonomics.com for more details.
I attended several events in June of this year including Graphing Social Patterns East, Velocity and Structure 08. At each of these events, I tried to take some notes and posted them to my personal blog. I received a few pings from readers of this blog to point them to a list of these posts. It took some time but here is the list of my notes. In some cases, I have linked directly to the presentation files.
Working the Clouds: NextGen Infrastructure for New Entrepreneurs: This panel on cloud computing featured panelists including Geva Perry (GigaSpaces), Jason Hoffman (Joyent), Tony Lucas (XCalibre), Lew Moorman (Rackspace), Christophe Bisciglia (Google / AppEngine) and Joe Weinman (AT&T). Christophe got grilled heavily by other panelists but he handled it pretty darn well.
Werner Vogels: Keynote at Structure 08: Dr. Vogels keynote was one of the highlights of Structure 08. He presented case study of Animoto and talked about the 70/30 switch among other things.
The Platform Revolution: A Look into disruptive technologies: Jonathan Yarmis AMR Research (VP of Disruptive Technologies) talked about technology trends, social networking, mobility, mobile , cloud computing, stream computing, business models, user 2.0 and the new enterprise reality.
Green Data Centers: Bill Coleman (Cassatt Corporation) presented this session. Bill is known for being responsible for the B in BEA. (Presentation slides)
Creating Bebo Applications: Bebo is now part of OpenSocial and this presentation presented at Graphing Social Patterns talks about how to create applications for Bebo.
Open Social and Google App Engine: Patrick Chanezon (API Evangelist) and Paul McDonald (Product Manager for Google App Engine) presented a technical overview of OpenSocial and Google App Engine at Graphing Social Patterns East.
OpenSocial: Open for Business: In this session, panelists were Patrick Chanezon (Google), Paul Lindner (hi5), Max Newbould (MySpace) and Sachin Rekhi (imeem). (Also see)
Mobile Social Networks: A Comparison: Benjamin Joffe's excellent eye opening session for anyone interested in using mobile platform for creating social networking solutions.
Top 5 Things that fail and win on social networks: Dave McClure, chair of Graphing Social Patterns, presented this concise but every effective presentation on what fails and what wins on social networks.
Widget Strategies & Social Platforms: Hooman Radfar, CEO of Clearspring Technologies discussed the new role of widgets and how to go about creating them.
Scaling MySQL - powered Web Sites by Sharding and Replication: Slides from Peter Zaitsev's session at Velocity. (Presentation slides)
Capacity Management: John Allspaw's signature presentation on capacity management. John also has a book coming out on this topic. (Presentation slides)
LinkedIn Communication Architecture: Slides about LinkedIn's platform built in Java. (Presentation slides)
SOX Compliance: A presentation by Skye Rogers. I missed this presentation but then caught up with Skye at dinner. I wish Skye would have received more time to discuss SOX Compliance. (Presentation slides)
There were several sessions I didn't get to go to which is a sad thing. You may want to check the conference websites directly (linked at top of this post) to see if there are presentation slides available. Also, if you took notes at these sessions, please feel free to drop the links as comments to this post.
“Funny how Amazon doesn't use S3 to store any assets for amazon.com”tweet by @gruber
Amazon's S3 suffered a major outage today knocking many websites offline. S3 outage started at approximately 12:00 PM EST and the last time I checked at 11:11PM EST, Smugmug, a popular photo hosting site that extensively uses S3, was still down.
Currently, there are several great opportunities with exciting companies available in the New York area. If you're a rock star Java/PHP/Ruby developer or a pixel-obsessed designer, contact me at your earliest convenience.
Web Developer:
Give Real is a well-funded startup in the midst of an exciting period of growth and success. Our technology uses a patent pending platform that combines the ubiquity of credit card transactions and the power of social networks to create a new gifting experience.
Our primary platform is Rails, but there are programming challenges that range from SOAP APIs to Facebook application development. We are searching for full-time developers with expertise and broad experience in:
* Ruby on Rails (we also use rSpec, Starling, Memcache) * MySQL * xHTML & CSS, and comfort with Javascript * Team development with tools like Git & Trac
In addition, we are also interested in candidates who have:
* Expert Javascript skills * Java & SOAP experience * Experience scaling with Rails, or any other web platform * Comprehensive Linux knowledge * UI and graphic design backgrounds
We are willing to pay top-notch developers very competitively (plus the possibility of options) to join our team and help write code that will be used by hundreds of thousands of users within a few months. We are ideally located in downtown Manhattan less than a minute walk from the BDFV and NRQW lines
Also, if you know someone who may be a good fit for us (developer or graphic designer), we are offering a $1000 referral reward for anyone we hire.
Please contact us at jobs@givereal.com
Graphic Design:
Give Real is a well-funded startup in the midst of an exciting period of growth and success. Our technology uses a patent pending platform that combines the ubiquity of credit card transactions and the power of social networks to create a new gifting experience.
We're searching for full-time designers with experience in:
* Design for advertisements * Design for consumer focused websites & applications * xHTML & CSS coding * HTML & design for emails * Working on top of an MVC or template system (we use Rails)
In addition, we are also interested in candidates who have:
* Team development with tools like Git & Trac * Comfort with Javascript programming * Rails programming experience
We are willing to pay top-notch developers very competitively (plus the possibility of options) to join our team and help design the look and feel of a service that will be used by hundreds of thousands of users with a few months.
Also, if you know someone who may be a good fit for us (RoR developer or graphic designer), we are offering a $1000 referral reward for anyone we hire.
Please help save Ivan, son of Andrii Nikitin (MySQL Support Engineer), who needs a bone marrow transplant. Andrii's message is below:
"My family got bad news - doctors said allogenic bone marrow transplantation is the only chance for my son Ivan.
"8 months of heavy and expensive immune suppression brought some positive results so we hoped that recovering is just question of time.
"Ivan is very brave boy - not every human meets so much suffering during whole life, like Ivan already met in his 2,5 years. But long road is still in front of us to get full recover - we are ready to come it through.
"Ukrainian clinics have no technical possibility to do such complex operation, so we need 150-250K EUR for Israel or European or US clinic. The final decision will be made considering amount we able to find. Perhaps my family is able to get ~60% of that by selling the flat where parents leave and some other goods, but we still require external help."
-- Andrii Nikitin, MySQL Engineer
Please remember, every little bit will help the family pay for Ivan's operation! Be as generous as you can.
Today at 1PM EST I am presenting the second part of memcached for MySQL webinar. I was told that the registration numbers look as good as the previous one. This one will be a bit more technical than the previous webinar. Sorry for the late notice but hope you can join!
In just a few minutes, I will be leaving for Graphing Social Patterns East, a conference by Oreilly. Dave McClure of 500 Hats is the conference chair. I plan to meet old friends and make new ones. It should be a lot of fun. More about Graphing Social Patterns.
Ever wish you could have a browser based shell for Google? One that was clutter and advertising free? Say hello to Goosh, one of the coolest service to hit the web.
It even recognizes 'clear' :) For now, I am addicted to it.
Disaster is really inevitable. Even with all the redundant power investments, ThePlanet (formerly EV1 and RackShack), had to shut down their backup generators at their H1 data center on the instructions of the fire crew. This happened after a wire-short in fault transformer led to an explosion that knocked off one of their walls, ultimately bringing 9,000 servers down. Luckily no one was injured.
This just goes on to show that just because a data center has redundant power and backup generators, it does not mean that a disaster cannot happen. IIRC, ThePlanet's last disaster was blamed on backup generators not kicking off properly.
While there was no damage to servers, I wonder how many MyISAM repairs need to be triggered once the servers do come back online?
Michael Arrington of TechCrunch asks Twitter a few questions. I have only included a sample list below but you should read his blog post for all the questions:
Is it true that you only have a single master MySQL server running replication to two slaves, and the architecture doesn’t auto-switch to a hot backup when the master goes down?
Do you really have a grand total of three physical database machines that are POWERING ALL OF TWITTER?
Is it true that the only way you can keep Twitter alive is to have somebody sit there and watch it constantly, and then manually switch databases over and re-build when one of the slaves fail?
A 'yes' answer to any of these questions by Twitter would be disturbing to say the least. However, it won't be surprising as companies expect databases to just somehow magically work without creating and supporting a proper architecture. High availability doesn't comes cheap and reputation for companies is everything.
I find it amusing that Twitter isn't even looking for a DBA. May be that's considered a job for the SA over there :)
A big thank you to all those who attended the memcached webinar today on which I was a panelist. I was told that there were more than 560 registrants.
The feedback I received directly and indirectly shows that there is a lot of interest about memcached. In the future, I hope to work again with MySQL/Sun on more memcached related webinars.
If you attended the webinar and have some suggestions, comments or questions, please contact me at fmashraqi at yahoo dot com or post a comment on this blog.
Special thanks to Jimmy Guerrero, Monty Taylor, Rich Taylor, Edwin DeSouza and Alex Roedling for their hard work in arranging the webinar. Also thanks to Brian Aker, Matt Ingenthron and Trond Norbye for their assistance at various phases.
Regarding my earlier post on memcached webinar, I was informed today that more than 420 registrants have signed up. Space is limited and filling up fast so if you are interested in memcached and haven't registered yet, click on the following link to register now!
Today, many large scale web 2.0 properties (including my employer) save millions of dollars by depending on memcached to bring their application response time under control and to offload pressure from databases.
There are several success stories about using memcached to speed up database driven websites. Facebook, for instance, runs the largest memcached installation and the numbers only keep increasing. In May 2007, Facebook was reportedly running 200 dedicated servers with 3TB of memory in their memcached cluster. At the "Scaling MySQL Up or Out" Keynote, Facebook revealed they are now using 805 dedicated memcached servers. That's more than a 400% increase in less than a year!
For my employer, memcached has been a crucial component of the infrastructure that has been instrumental in handling explosive growth in a cost-efficient manner. In addition, memcached has helped us offload billions of queries from our database.
To highlight several real-life use cases of memcached (see below), I will be presenting a memcached webinar on Thursday, June 29 at 1 PM EST (10 AM PST). Monty Taylor (Senior Consultant, Sun Microsystems) and Jimmy Guerrero (Sr Product Marketing Manager, Sun Microsystems - Database Group) will also be speaking at the event. Space is limited and filling up fast (200+ registrants already) so I highly recommend registering now.
In this webinar, I will be covering several use cases for memcached including (but not limited to):
deterministic cache
non-deterministic cache
proactive cache
"state" cache
filesystem cache replacement
Hope to "see" you at the webinar. Note: This memcached webinar is not to be confused with the memcached webinar being presented by Ivan Zoratti on June 28.
At the MySQL Conference and Expo, right after my participation in scaling up or scaling out keynote panel, I talked to Sun's Multimedia team about Sun and MySQL in our environment.
Recently, I found the interview on Sun's Multimedia page. The video of my discussion is embedded below:
Over the weekend I took some notes from a presentation and did some research from various sources. The result was a blog post about Internet trends that I posted on my personal blog. There are some very interesting statistics about Internet usage and social networking. Also, Facebook fans will find some interesting facts as well.
When InnoDB creates a new transaction it is not yet committed. The database has not yet made any promises to the application and so we do not really have to make anything persistent so far.
To be fast InnoDB tries to assemble the transaction in a memory buffer, the innodb_log_buffer. It should be sufficiently large that you actually can assemble such a transaction in memory without needing to write it out in part into the redo log. A size of 1M to 8M is normal.
Once a transaction is to be committed InnoDB has to read the page from disk which contains the image of the row that is being changed. It then has to actually make that change in memory. The changed page is cached in a memory pool called the innodb_buffer_pool. This pool also caches unchanged pages that have been accessed by read operations. All of these pages on disk and in memory are 16K in size and the innodb_buffer_pool_size determines how much RAM we will use as a cache for such pages - usually as much as we can spare.
InnoDB does transactions. Meaning: It collects statements working on InnoDB tables and applies them on COMMIT to all tables "at once". Either all of these statements inside one transaction succeed ("commit") or all of them fail ("rollback"), changing nothing.
By default, the database is in AUTOCOMMIT mode. Meaning: The server sees a virtual COMMIT command after each statement. You can disable autocommit completely, or you are starting an explicit transaction inside autocommit using the BEGIN statement.
MySQL offers a variety of storage engines giving you a lot of flexibility in managing your storage and data access needs. Still I encounter customers who are not using this flexibility when they should, because they lack information about the advantages of Non-MyISAM storage engines or which are using storage engines like InnoDB as if they were using MyISAM.
This is the introductory article in a series of texts that will hopefully once become an InnoDB tutorial. A german version of this article is available in my german language blog.
An InnoDB Tutorial
The InnoDB storage engine is an engine that can be operated ACID compliant, does transactions and foreign key constraints. It is useful for all applications that do online transaction processing or have a high rate of concurrent write accesses for other reasons.
PHP is a nice programming language for web applications with a large number of databases supported.
PDO is one of many database access abstractions trying to unify the way PHP talks to databases. Unlike most of these, PDO is written in C and not in PHP.
Wez Furlong wants to improve PDO "Version 2". In particular he and others want to produce an open spec that can be used by anybody including database vendors to code against, add unit testing for the spec and the drivers that claim to support it and improve metadata handling for PDO.
In order to get database vendors, notably IBM, into the game he proposes a Contributor License Agreement to make it easier for vendors to commit into the PDO part of PHP. There exist corporate and individual versions of the CLA, and the PDO license is yet another license that is different from the PHP license. Wez has prepared a FAQ covering questions regarding all that. Continue reading "PHP PDO V2 CLA"
My thinking was that the topic is already dead, but people have strange ideas off and on again. Have a look at the S9Y boards where you'll find someone who wants to have a storage backend "LDAP" for S9Y. This is sick and wrong! Let me explain why.
According to my findings in Bug #31876, MySQL does not commit data to disk in Windows using the same method MS SQL Server and DB/2 are using. The method MySQL uses appears to be seven times slower in pathological scenarios.
The bug report contains a patch - thanks to the MySQL WTF (The Windows Task Force) and the lab provided by the customer for helping me to find that.
Does this work for you? I want to hear about your test results.
Sorry for the low maintenance level on this site, personal matters have limited my available time. I plan to now come back to more active maintenance of this site, beginning with an update to the forums.
The forums now allow for registration without admin approval but have a new requirement that those registering must answer a simple question to submit their registration. This should hopefully allow new members to join without waiting for their approval, while also preventing spam registration.
When I return from the MySQL Conference I’ll be restoring the long-missing samplecode section to the site and restoring the three-parts of the Yatta tutorial.
One of the recent issues wiped out my copies of Yatta that were sent to me by you the readers. If you previously sent me Yatta, please re-send to mike@thisdomain.
During the site failure the articles and source for the Yatta series was lost, so I thought I would put a call out: do you have a local copy of the articles and source for the Yatta series? If so, please comment here or send me an email!
I didn’t have a chance to restore the full structure to vbmysqldirect’s project pages yet, but I have posted an archive of the project pages to our new Projects page.
Good news everyone! At this point most of the articles are restored. I’ll be finishing the restoration of the last two or three articles soon and moving on to a restoration of the sample code and the project pages (vbmysqldirect in particular).
Thank you for waiting so far, and if you find any errors in the restored articles please let me know.
Those of you who have been around for a while will likely know that part 4 of the tutorial had some code that became outdated when VB2005 came out of Beta. I’ve updated the code samples in the article and its zip file but lack the time right now to do a full validation due to work.
As such, I need your help. If someone out there could be so kind as to start at Part One and work through the tutorial using the current release of VB2005 to check for issues it will be much appreciated.
I’ll try to get parts 5-7 up soon and move on to the remaining article content. Then I’ll work on getting samplecode online.
Also, new content is always appreciated. If you have an article or sample you want to submit feel free to let me know and I’ll create a contributor account for you in the CMS.
I’m also looking for someone to manage the Spanish language articles and moderate the new Spanish language topic in the forums.
I’ve installed forums to http://www.vbmysql.com/forums. Because of the new hosting arrangements and the difficulty in restoring the previous forum software, I’ve changed backends. This means you’ll need to re-register and recreate any posts you were still looking for answers to.
I’ve restored two articles so far and would appreciate any feedback as to the appearance/style of the articles posted so far. Since I’m rebuilding site design as I restore articles, now is a good time to voice your opinion.