Top 10 Causes of Java EE Enterprise Performance Problems

Performanceproblems are one of the biggest challenges to expect when designing andimplementing Java EE related technologies. Some of these common problems can befaced when implementing either lightweight or large IT environments; whichtypically include several distributed systems from Web portals & orderingapplications to enterprise service bus (ESB), data warehouse and legacyMainframe storage systems.

Itis very important for IT architects and Java EE developers to understand theirclient environments and ensure that the proposed solutions will not only meettheir growing business needs but also ensure a long term scalable &reliable production IT environment; and at the lowest cost possible.Performance problems can disrupt your client business which can result in short& long term loss of revenue.

Thisarticle will consolidate and share the top 10 causes of Java EE performanceproblems I have encountered working with IT & Telecom clients over the last10 years along with high level recommendations.

Pleasenote that this article is in-depth but I'm confident that this substantial readwill be worth your time.

#1 - Lack of proper capacity planning

I'mconfident that many of you can identify episodes of performance problems followingJava EE project deployments. Some of these performance problems could have avery specific and technical explanation but are often symptoms of gaps in thecurrent capacity planning of the production environment.

Capacityplanning can be defined as a comprehensive and evolutive process measuring andpredicting current and future required IT environment capacity. A properimplemented capacity planning process will not only ensure and keep track ofcurrent IT production capacity and stability but also ensure that new projectscan be deployed with minimal risk in the existing production environment. Suchexercise can also conclude that extra capacity (hardware, middleware, JVM, tuning,etc.) is required prior to project deployment.

Inmy experience, this is often the most common "process" problem that can lead toshort- and long- term performance problems. The following are some examples.

Problems observed	Possible capacity planning gaps
A newly deployed application triggers an overload to the current Java Heap or Native Heap space (e.g., java.lang.OutOfMemoryError is observed).	- Lack of understanding of the current JVM Java Heap (YoungGen and OldGen spaces) utilization - Lack of memory static and / or dynamic footprint calculation of the newly deployed application - Lack of performance and load testing preventing detection of problems such as Java Heap memory leak
A newly deployed application triggers a significant increase of CPU utilization and performance degradation of the Java EE middleware JVM processes.	- Lack of understanding of the current CPU utilization (e.g., established baseline) - Lack of understanding of the current JVM garbage collection healthy (new application / extra load can trigger increased GC and CPU) - Lack of load and performance testing failing to predict the impact on existing CPU utilization
A new Java EE middleware system is deployed to production but unable to handle the anticipated volume.	- Missing or non-adequate performance and load testing performed - Data and test cases used in performance and load testing not reflecting the real world traffic and business processes - Not enough bandwidth (or pages are much bigger than capacity planning anticipated)

Onekey aspect of capacity planning is load and performance testing that everybody shouldbe familiar with. This involves generating load against a production-likeenvironment or the production environment itself in order to:

Determinehow much concurrent users / orders volumes your application(s) can support

Exposeyour platform and Java EE application bottlenecks, allowing you to takecorrective actions (middleware tuning, code change, infrastructure and capacityimprovement, etc.)

Thereare several technologies out there allowing you to achieve these goals. Someload-testing products allow you to generate load from inside your network froma test lab while other emerging technologies allow you to generate load fromthe "Cloud".

I'mcurrently exploring the free version of LoadTester, a new load testing tool I found allowing you to record test casesand generate load from inside your network or from theCloud.

Regardlessof the load and performance testing tool that you decide to use, this exerciseshould be done on a regular basis for any dynamic Java EE environments and aspart of a comprehensive and adaptive capacity planning process. When doneproperly, capacity planning will help increase the service availability of yourclient IT environment.

#2 - Inadequate Java EE middleware environmentspecifications

Thesecond most common cause of performance problems I have observed for Java EEenterprise systems is an inadequate Java EE middleware environment and / orinfrastructure. Not making proper decisions at the beginning of new platformcan result in major stability problems and increased costs for your client in thelong term. For that reason, it is important to spend enough time brainstormingon required Java EE middleware specifications. This exercise should be combinedwith an initial capacity planning iteration since the business processes,expected traffic, and application(s) footprint will ultimately dictate theinitial IT environment capacity requirements.

Now,find below typical examples of problems I have observed in my past experience:

Deploymentof too many Java EE applications in a single 32-bit JVM

Deploymentof too many Java EE applications in a single middleware domain

Lackof proper vertical scaling and under-utilized hardware (e.g., traffic driven byone or just a few JVM processes)

Excessivevertical scaling and over-utilized hardware (e.g., too many JVM processes vs.available CPU cores and RAM)

Lackof environment redundancy and fail-over capabilities

Tryingto leverage a single middleware and / or JVM for many large Java EEapplications can be quite attractive from a cost perspective. However, this canresult in an operation nightmare and severe performance problems such asexcessive JVM garbage collection and many domino effect scenarios (e.g., StuckThreads) causing high business impact (e.g., App A causing App B, App C, andApp D to go down because a full JVM restart is often required to resolveproblems).

Recommendations

Projectteam should spend enough time creating a proper operation model for the Java EEproduction environment.

Attemptto find a good "balance" for your Java EE middleware specifications to provideto the business & operation team proper flexibility in the event of outagesscenarios.

Avoiddeployment of too many Java EE applications in a single 32-bit JVM. Themiddleware is designed to handle many applications, but your JVM may suffer themost.

Choosea 64-bit over a 32-bit JVM when it is required but combine with proper capacityplanning and performance testing to ensure your hardware will support it.

#3 - Excessive Java VM garbage collections

Nowlet's jump to pure technical problems starting with excessive JVM garbagecollection. Most of you are familiar with this famous (or infamous) Java error:java.lang.OutOfMemoryError. This is theresult of JVM memory space depletion (Java Heap, Native Heap, etc.).

I'msure middleware vendors such as Oracle and IBM could provide you with dozensand dozens of support cases involving JVM OutOfMemoryError problems on aregular basis, so no surprise that it made the #3 spot in our list.

Keepin mind that a garbage collection problem will not necessarily manifest itselfas an OOM condition. Excessive garbage collection can be defined as anexcessive number of minor and / or major collections performed by the JVM GCThreads (collectors) in a short amount of time leading to high JVM pause timeand performance degradation. There are many possible causes:

JavaHeap size chosen is too small vs. JVM concurrent load and application(s) memoryfootprint.

InappropriateJVM GC policy used.

Yourapplication(s) static and / or dynamic memory footprint is too big to fit in a32-bit JVM.

TheJVM OldGen space is leaking over time * quite common problem *; excessive GC(major collections) is observed after few hours / days.

TheJVM PermGen space (HotSpot VM only) or Native Heap is leaking over time * quitecommon problem *; OOM errors are often observed over time following applicationdynamic redeployments.

Ratioof YoungGen / OldGen space is not optimal to your application(s) (e.g., abigger YoungGen Space is required for applications generating massive amount ofshort lived objects). A bigger OldGen space is required for applicationscreating lot of long lived / cached Objects.

TheJava Heap size used for a 32-bit VM is too big leaving small room for the NativeHeap. Problems can manifest as OOM when trying to a new Java EE application,creating new Java Threads or any computing task that requires native memoryallocations.

Beforepointing a finger at the JVM, keep in mind that the actual "root" cause can berelated to our #1 & #2 causes. An overloaded middleware environment willgenerate many symptoms, including excessive JVM garbage collection.

Properanalysis of your JVM related data (memory spaces, GC frequency, CPU correlation,etc.) will allow you to determine if you are facing a problem or not. Deeperlevel of analysis to understand your application memory footprint will requireyou to analyze JVM Heap Dumps and / or profile your application using profilertools (such as JProfiler) of yourchoice.

Recommendation

Ensurethat you monitor and understand your JVM garbage collection very closely. Thereare several commercial and free tools available to do so. At the minimum, youshould enable verbose GC, which will provide all the data that you need for yourhealth assessment

Keep in mind that GC related problems are unlikely to be caught duringdevelopment or functional testing. Proper garbage collection tuning willrequire you to perform load and perform testing with high-volume fromsimultaneous users. This exercise will allow you to fine-tune your Java Heapmemory footprint as per your applications behaviour and load level forecast.

#4 - Too many or poor integration with external systems

Thenext common cause of bad Java EE performance is mainly applicable for highlydistributed systems; typical for Telecom IT environments. In such environments,a middleware domain (e.g., Service Bus) will rarely do all the work but rather"delegate" some of the business processes, such as product qualification,customer profile, and order management, to other Java EE middleware platformsor legacy systems such as Mainframe via various payload types and communicationprotocols.

Suchexternal system calls means that the client Java EE application will triggercreation or reuse of Socket Connections to write and read data to/from externalsystems across a private network. Some of these calls can be configured assynchronous or asynchronous depending of the implementation and business processnature. It is important to note that the response time can change over timedepending on the health of the external systems, so it is very important toshield your Java EE application and middleware via proper use of timeouts.

Majorproblems and performance slowdown can be observed in the following scenarios:

Toomany external system calls are performed in a synchronous and sequentialmanner. Such implementation is also fully exposed to instability and slowdownof its external systems.

Timeoutsbetween Java EE client applications and external systems are missing or values aretoo high. This will cause client Threads to get Stuck, which can lead to a full domino effect.

Timeoutsare properly implemented but middleware is not fine-tuned to handle the"non-happy" path. Any increase of response time (or outage) of external systemwill lead to increased Thread utilization and Java Heap utilization (increased# of pending payload data). Middleware environment and JVM must be tuned in away to predict and handle both "happy" and "non-happy" paths to prevent a fulldomino effect.

Finally,I also recommend that you spend adequate time performing negative testing. Thismeans that problem conditions should be "artificially" introduced to theexternal systems in order to test how your application and middlewareenvironment handle failures of those external systems. This exercise shouldalso be performed under a high-volume situation, allowing you to fine-tune thedifferent timeout values between your applications and external systems.

#5 - Lack of proper database SQL tuning &capacity planning

Thenext common performance problem should not be a surprise for anybody: databaseissues. Most Java EE enterprise systems rely on relational databases forvarious business processes from portal content management to order provisioningsystems. A solid database environment and foundation will ensure that your ITenvironment will scale properly to support your client growing business.

Inmy production support experience, database-related performance problems arevery common. Since most database transactions are typically executed via JDBCDatasources (including for relational persistenceAPI's such as Hibernate), performance problems will initially manifest as Stuck Threads from your Java EE container Threadmanager. The following are common database-related problems I have seen overthe last 10 years:

* Note that Oracle database isused as an example since it is a common product used by my IT clients.*

Isolated,long-running SQLs. This problem will manifest as stuck Threads and usually asymptom of lack of SQL tuning, missing indexes, non-optimal execution plan, returneddataset too large, etc.

Tableor row level data lock. This problem can manifest especially when dealing witha two-phase commit transactional model (ex:infamous Oracle In-Doubt Transactions). In this scenario, the Java EEcontainer can leave some pending transactions waiting for final commit orrollback, leaving data lock that can trigger performance problems until suchlocks are removed. This can happen as a result of a trigger event such as amiddleware outage or server crash.

Suddenchange of execution plan. I have seen this problem quite often and usually theresult of some data patterns changes, which can (for example) cause Oracle toupdate the query execution plan on the fly and trigger major performancedegradation.

Lackof proper management of the database facilities. For example, Oracle hasseveral areas to look at such as REDO logs, database data files, etc. Problemssuch as lack of disk space and log file not rotating can trigger majorperformance problems and an outage situation.

Recommendations

Propercapacity planning involving load and performance testing is critical here tofine-tune your database environment and detect any problems at the SQL level.

Ifyou are using Oracle databases, ensure that your DBA team is reviewing the AWRReport on a regular basis, especially in the context of an incident and rootcause analysis process. Same analysis approach should also be performed forother database vendors.

Takeadvantage of JVM Thread Dump and AWR Report to pinpoint the slow running SQLsand / or use a monitoring tool of your choice to do the same.

Makesure to spend enough time to fortify the "Operation" side of your databaseenvironment (disk space, data files, REDO logs, table spaces, etc.) along withproper monitoring and alerting. Failure to do so can expose your client ITenvironment to major outage scenarios and many hours of downtime.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。