Hereβs how to address why the code runs well on your development machine but has totally borked in production.
If your lovingly crafted application is running slowly after deployment, there are five common reasons it runs well on your development machine but has totally borked in production.
There are of course other reasons why your software doesnβt work well in production, but these are the top reasons Iβve seen when developers say βit runs fine on my machineβ and then discover that with great volume comes great responsibility for the concerns of scale.
Cause No. 1: You have one big thread
In some modern frameworks (like Node.js), threading is handled for you. Youβre supposed to make the right nonblocking I/O calls at the right time to go do things and have your main line of code clean of a lot of heavy lifting. Failing that, you can start to starve the system of actual threads.
If you have this problem, it begs some questions. The most basic is: If youβre doing a major algorithm inside of some JavaScript running in Node.js, whether Node.js and (gasp) JavaScript is the right technology to use here? If you must, you need to learn about how Node.js handles concurrency and how to avoid blocking the event loop. You need to learn how to submit work to a workerpool instead. You may even have to learn (gasp!) about threads.
Cause No. 2: Your database is !@#$
There are a lot of reasons your database can be blown. The first and most obvious is a lack of an index. If youβre on an SQL database, you should learn how indexes work. If you have a where clause with three key/value pairs in it and you run it over and over with different values, that should probably be an index. For example:
select β¦ from customer c where c.firstname = :fname and c.lastname = :lname and c.state = :state
You could have three indexes (and there might be reasons to do that). However, that still results in index mergesβmeaning the results of three searches have to be merged. Instead, consider an index that includes all three fields in one index. There is a caveat to all of this: Indexes do tend to slow inserts and updates.
Another common reason is a total misdesign of your schema. I was once on a largeMongoDB project where the customer designed the schema like it was a RDBMS.
Nothing executed reasonably because MongoDB wasnβt designed to join tables to do simple things like look up a phone number; it was designed to have that right in the customer document as an attribute. If you have a bad schema design, your application canβt run well on it.
In modern applications, developer preference has had a lot to do with database selection, but not every application performs great on every type of database:
- If you are doing hierarchical queries or finding the relationship between two rows, you shouldnβt be on an RDBMS.
- If youβre basically reimplementing tables on top of a key-value store, just stop.
- If you have mostly friend-of-a-friend (FoaF) queries, maybe you need a graph
- Β
- If youβre doing a lot of queries with conditional field names like
Foo%searches, use an index like Apache Solr (yes, thatβs my companyβs product) instead (at least for that part).
Another less obvious reason your database is borked is that youβve tried to open too
many connections at once. For example, if you have one database connection pool locally that opens 100 connections but youβve got 15 application servers all opening 100 connections, thatβs 1500 connections that have to be opened at once. That may not work too well. You may need to do this a little at a time. Your ops people should know about this and how to constrain how much traffic makes it to the application servers at a given time (how to start up βwarmβ rather than βhotβ).
If the performance issue is the database, you should be able to find them with database monitoring tools, by logging query return times, or by listening to the DBA who told you the database couldnβt handle all of those connections.
Know the database youβre writing to and what it likes in terms of schema and practices. Pick the appropriate database or databases for the job.
Cause No. 3: You didnβt size memory correctly
Most modern business software run on some sort of stack-based virtual machine. Iβm not talking about VMware or Docker, but something more like the Java Virtual Machine (JVM). Without getting into much detail on the inner workings of VMs, nearly all of them require that you to dedicate a certain amount of memory called a heap. They also use other types of memory every time they launch a thread. If they run low on heap memory, theyβll spend a lot more time on memory management, which will look (until they crash and burn) like the application just got slow.
On the JVM, you can turn on garbage-collection logging, which will show you how many collections are being run. You can also just up the heap size, but do that judiciously.
Many people think the heap is the only kind of memory but there is also the JVMβs -xss stack size option. Each thread gets a certain amount of stack memory. If
System Memory β (heap + otherstuff + (numthreads * numstack)) i<= 0
then when you grab another thread, youβll throw a special kind of out of memory exception. Depending on the kinds of libraries youβre running, that might be a thread pool that doesnβt expand or a database connection pool that doesnβt expandβboth will look like a slowdown. The good news is this is captured in any log.
Cause No. 4: You sized your thread or connection pools incorrectly
If youβve got 1,000 concurrent users and five database connections in the pool, youβve probably got a wait condition waiting on that pool. If youβve got 100 HTTP threads on top of that and a TCP backlog setting of 5, after 105 people try to connect, youβre going to see a βconnection refusedβ messageβbut things will get really slow before then. In addition, some software has a number of βacceptβ threads, which basically βanswer the phone and hand it off to one of those other threads.β Usually there is one, maybe two, of those.
There is no hard-and-fast rule for what those numbers should all be, but beΒ reasonable on the proportions. Also remember cause No. 3 while doing this because you can run into other constraints.
Cause No. 5: You havenβt set your limits and file handles correctly
Most operating systems have limits on the number of threads and files an operating systemβs user is allowed to open. If you run at this limit, things get slow before they fail. You should see this in the log, and there are tools to show what file locks and handles are in use.


