A fast query performing slow in production
I recently encountered a weird issue with MySQL query performance. A particular query would perform very badly (approximately 5 seconds) in production, but perform very well in the testing environments. The testing environments contain a recent copy of production, so one would expect similar performance, right?
I did what any somewhat seasoned engineer would do in this situation, and that is running an `EXPLAIN` query on both environments. This learned us that the query optimizer in each environment picked a different index for one particular join.
I was a bit puzzled by this unexpected result and began examining how the two MySQL servers could be configured differently, but I found no significant difference.
Luckily, Copilot came to the rescue and suggested running `ANALYZE TABLE 'table_name'` on the production MySQL database. And indeed, all of a sudden, production started picking the better-performing index as well. Problem solved, users happy again.
Outdated statistics and stale indexes
So what actually happened? MySQL's query optimizer picks the best index based on table statistics, more precisely, the key distribution. These statistics are not automatically updated. Therefore, after some time, it's possible that these statistics become too outdated and return an incorrect result. Running ANALYZE TABLE updates these statistics and thus improves how MySQL picks an index.
This could happen after introducing a new field with an index. At first, the data in the field is whatever value you made up that makes sense (null, or some other constant). After some time, new rows are added and more and more different values are added to the field. However, the key distribution values are still based on that time; the index only contained 1 or 2 different values (which MySQL interprets as: this is a very bad index, only use in case of no other options).
The data in the testing environment is regularly imported from a production data copy. The entire database is rebuilt from scratch, so it is only logical that the database statistics are more up to date in these environments.
A non-technical lesson learned
A second lesson learned here is to trust people when they say a specific action is slow. If you can’t recreate the issue in the testing environment, it is easy to dismiss it as a ‘probably busy server or something’.
Sometimes the best debugging comes from listening to users and questioning your own assumptions, not just running commands.
                Fewer developers. More progress.
Our staff engineers join your team to improve how you build, not just what you build.
			
				
		
		
		
	
	
	
	
	
Member discussion