IT Infrastructure & Operations: 99% of IT professionals make this one mistake when they start with automation.

On-Premise vs. Cloud

Use a hybrid cloud strategy, not an “all-in” on-premise or “all-in” on cloud approach.

A large bank, concerned about security, decided to keep all its infrastructure on-premise. They struggled to innovate quickly. A tech startup went “all-in” on the public cloud but found costs spiraling for their high-performance computing workloads. A wiser company adopted a hybrid approach. They kept their sensitive customer data and high-performance computing on-premise for security and cost control, but used the public cloud for its agility in developing new, customer-facing applications. This balanced approach gave them the best of both worlds.

Stop doing a “lift and shift” migration for all your applications. Do re-architect for the cloud where it makes sense.

A company took their old, monolithic on-premise application and simply moved the virtual machine to the cloud. This “lift and shift” was easy, but the application was slow, not scalable, and expensive to run. They weren’t using any of the cloud’s benefits. For their next application, they re-architected it for the cloud. They broke it into microservices and used managed databases. This cloud-native application was more resilient, could scale automatically, and was significantly cheaper, unlocking the true promise of the cloud.

The #1 secret for a successful hybrid cloud implementation that balances cost, performance, and security.

The secret is not a specific technology, but a unified control plane. A company had workloads on-premise and in two different clouds. Their operations team was a mess, using three different sets of tools to manage security, networking, and deployments. A successful hybrid implementation uses a platform that provides a single, consistent interface for managing all resources, regardless of where they are physically located. This unified control plane is what turns a collection of separate environments into a true, manageable hybrid cloud.

The biggest lie you’ve been told about the cloud being cheaper than on-premise for all workloads.

The lie is that moving to the cloud will automatically save you money. For a spiky, unpredictable workload, the cloud’s pay-as-you-go model is often much cheaper. But for a predictable, high-utilization workload that runs 24/7, like a large database, a well-run on-premise environment can actually have a lower total cost of ownership over a three-to-five-year period than paying the on-demand rates for a large cloud instance. “Cheaper” depends entirely on the workload.

I wish I knew this about data egress costs when I was planning my first major cloud migration.

I was so excited about the low storage costs in the cloud. I moved a petabyte of our company’s data archive to a cloud provider. The storage bill was tiny. Then, we needed to move a large portion of that data to another system for analysis. We were hit with a shocking, six-figure bill for “data egress.” I had only considered the cost of putting data in, not the much higher cost of taking it out. I wish I had known that data egress fees are a major, often overlooked, component of cloud economics.

I’m just going to say it: On-premise infrastructure is not dead, especially for high-performance computing and data-sensitive workloads.

The tech world loves to proclaim the death of the data center. But for a scientific research institution that needs to process massive datasets with ultra-low latency, or a bank that, due to strict data sovereignty laws, cannot let its customer data leave the country, on-premise infrastructure is still the best and often the only choice. The cloud is a powerful tool, but it is not the right tool for every single job. On-premise has a crucial role to play in a hybrid world.

99% of IT departments make this one mistake when calculating the TCO of cloud vs. on-premise.

The most common mistake is only comparing the direct hardware and software costs. An IT department will compare the cost of buying a server to the cost of a cloud virtual machine. They forget to factor in all the “hidden” costs of their on-premise environment: the cost of the data center’s power and cooling, the physical security, and, most importantly, the salaries of the people required to rack, stack, and maintain that hardware. A true Total Cost of Ownership (TCO) analysis includes all of these indirect costs.

This one small action of implementing a cloud cost management platform will change your hybrid cloud economics forever.

A company was running workloads on-premise and in the cloud. They had no single view of their total infrastructure spending. They implemented a cloud cost management platform that could analyze both their on-premise costs and their cloud bills. For the first time, they had a single dashboard that showed them their total blended cost. The platform also identified which of their on-premise workloads would actually be cheaper to run in the cloud, allowing them to make data-driven decisions about their hybrid strategy.

The reason your cloud bill is so high is because you’re treating your cloud resources like on-premise servers.

An IT team migrated to the cloud but brought their on-premise mindset with them. They would provision a large virtual machine for an application and then leave it running 24/7, just like they did in their old data center. They were paying for a huge amount of idle capacity. The cloud is an elastic, pay-for-what-you-use environment. To control costs, you have to embrace cloud-native concepts like autoscaling and turning resources off when they are not in use.

If you’re still only considering on-premise solutions, you’re losing agility and scalability.

A retail company running their e-commerce site on-premise had to spend months planning and buying new hardware to prepare for the holiday shopping season. They were always guessing their capacity needs. Their competitor, running in the cloud, could automatically scale their infrastructure up to handle the Black Friday traffic spike in a matter of minutes, and then scale back down in January. This agility to respond instantly to business demand is a massive competitive advantage that on-premise infrastructure simply cannot match.

Infrastructure as Code (IaC)

Use declarative IaC tools like Terraform or Pulumi, not just imperative shell scripts.

An operations engineer wrote a series of shell scripts to set up a new server. The scripts had to be run in a specific order, and if one failed, the server was left in an inconsistent state. This is an imperative approach. She switched to a declarative tool like Terraform. She now just defined the desired end state of her server in a configuration file, and Terraform intelligently figured out how to get there. This declarative model was more reliable, predictable, and easier to manage.

Stop doing manual server configuration. Do manage your entire infrastructure as version-controlled code instead.

A sysadmin used to configure servers by manually SSHing into them and running a series of commands. Every server was slightly different, and there was no record of the changes he had made. This was a nightmare. He started using Infrastructure as Code. He now defined his entire server configuration in a text file and stored it in Git. His infrastructure was now versionable, testable, and perfectly reproducible. He could rebuild his entire environment from scratch with a single command.

The #1 hack for getting started with Infrastructure as Code without rewriting everything.

The secret is to use a tool that can import your existing, manually-created infrastructure. A team had a complex cloud environment that had been created over years by clicking around in the web console. They wanted to start using IaC but were daunted by the idea of rewriting everything. They used a tool that could scan their cloud account and automatically generate the IaC configuration files that represented their existing resources. This allowed them to start managing their infrastructure as code without having to start from zero.

The biggest lie you’ve been told about IaC being only for cloud environments.

The lie is that Infrastructure as Code is only useful for managing public cloud resources. A network engineer thought that tools like Terraform were not for him. He then discovered that there are “providers” for almost everything. He was able to use the same IaC workflow to manage his on-premise VMware environment, his physical network switches, and even his DNS records. IaC is a methodology, not a cloud-specific technology. It can be used to manage any resource that has an API.

I wish I knew this about the importance of state management in Terraform when I first started.

When I first started using Terraform, I just ran it on my laptop. My “state file,” which keeps track of the resources Terraform manages, was just a local file. When my laptop crashed, I lost the state file, and Terraform no longer knew which infrastructure it was supposed to be managing. It was a disaster. I wish I had known to use a remote backend for my state file from day one. Storing the state in a shared, remote location is essential for any kind of collaborative or production-grade IaC workflow.

I’m just going to say it: If you’re not using IaC in 2025, you’re committing professional malpractice.

In the modern era of cloud computing and complex, distributed systems, manually managing infrastructure is no longer a viable professional practice. It is slow, error-prone, and insecure. An IT department that is still relying on manual configuration is like an accounting department that is still using a paper ledger. Infrastructure as Code is the industry-standard, professional way to build and manage modern infrastructure. To not use it is to be willfully negligent.

99% of operations teams make this one mistake when adopting IaC.

The most common mistake is not creating a standardized, reusable module library. Each developer on a team will write their own, slightly different IaC code to deploy a new application. This results in inconsistency and duplicated effort. A smarter team creates a central library of standardized, blessed “modules” for common infrastructure patterns, like a web server or a database. This allows developers to build new environments quickly and consistently, knowing that they are using a pre-approved, secure configuration.

This one small habit of running a “plan” before you “apply” will change your savings from countless infrastructure disasters forever.

A developer made a small typo in his Infrastructure as Code file. He immediately ran the “apply” command. The typo caused the command to destroy his entire production database. His colleague, working on a similar change, had a habit of always running the “plan” command first. The plan showed her exactly what changes were going to be made, and she caught a similar typo before it could do any damage. The “plan” command is a dry run, and it’s the most critical safety feature in any IaC workflow.

The reason your environments are so inconsistent is because you’re not using Infrastructure as Code.

A company’s development, staging, and production environments were all configured slightly differently. A bug that appeared in production could not be reproduced in development. The reason for this inconsistency was “configuration drift”—small, manual changes that had been made over time. By defining all of their environments with Infrastructure as Code, they could ensure that every single environment was a perfect, 100% identical replica, which eliminated a huge class of “it works on my machine” problems.

If you’re still clicking around in a web console to provision infrastructure, you’re losing reproducibility and auditability.

An administrator provisioned a new virtual machine by clicking through a web interface. A week later, he needed to create an identical one. He couldn’t remember the exact settings he had chosen. There was also no audit trail of who had created the machine or why. If he had used Infrastructure as Code, the entire configuration would have been captured in a version-controlled text file. This would have made the infrastructure perfectly reproducible and would have provided a clear, auditable history of every single change.

Monitoring & Observability

Use an observability platform with distributed tracing, not just siloed monitoring tools for metrics and logs.

A website was slow. The operations team looked at their server metrics (CPU was fine). The development team looked at their application logs (no errors). Both teams were blind. A different company used an observability platform. With distributed tracing, they could see the entire lifecycle of a single user request as it traveled through a dozen different microservices. They instantly saw that one specific, downstream service was taking five seconds to respond. They found the needle in the haystack.

Stop asking “is the server down?”. Do ask “why is the application slow for this specific user?” instead.

The old way of monitoring focused on system health: is the CPU high? Is the server running? A user complained that the app was slow. The monitoring dashboards all looked green. The system was “up,” but the user was still having a bad experience. A modern observability approach allows you to ask much more specific questions. An engineer could drill down and see the exact trace for that specific user’s request, identifying that a database query for their particular account was inefficient. It’s a shift from monitoring the system to understanding the user’s experience.

The #1 secret for transitioning from monitoring to full-stack observability.

The secret is to enrich your telemetry data with context. A log message that just says “User login failed” is not very useful. A log message enriched with context is much better: “User login failed for user_id: 123 from ip_address: 1.2.3.4, reason: invalid password.” By adding rich, high-cardinality metadata—like user IDs, request IDs, and feature flags—to your logs, metrics, and traces, you unlock the ability to slice and dice your data in powerful ways and ask much more interesting questions of your system.

The biggest lie you’ve been told about the “three pillars of observability”.

The lie is that if you are just collecting metrics, logs, and traces, you are “doing observability.” A team had all three data types but they were in separate, siloed tools. They couldn’t correlate a spike in a metric with a specific log message or a trace. Observability is not about having three different pillars; it’s about having a single, unified platform where you can seamlessly pivot between these different data types to understand the “why” behind an issue. The value is in the connection, not the collection.

I wish I knew this about the importance of high-cardinality data when I was choosing a monitoring solution.

When I chose my first monitoring tool, it was great at tracking system-level metrics like CPU utilization. But I couldn’t track metrics on a per-user basis. I couldn’t see the performance for a specific customer ID because that was “high-cardinality” data, and the tool wasn’t designed for it. I wish I had known that true observability requires a platform that can handle high-cardinality dimensions, allowing you to break down your data by unique identifiers like user ID, request ID, or shopping cart ID.

I’m just going to say it: You can’t fix what you can’t see.

An e-commerce site was randomly slow for some users. The team had no idea why. They had no detailed visibility into their production systems. They were flying blind. They implemented a modern observability platform. Within a day, they discovered that a specific, inefficient database query was being triggered by users from a certain country. They couldn’t fix the problem until they could see it. Observability is the practice of making your complex, opaque systems visible and understandable.

99% of SREs make this one mistake when setting up their alerts.

The most common mistake is creating alerts that are not actionable. A Site Reliability Engineer (SRE) would get paged at 3 AM with an alert that said, “CPU utilization is high on server X.” This alert told him the symptom, but not the cause or the impact. He still had to do a lot of work to figure out what to do. A good alert is actionable. It should be tied to a specific user-facing symptom (e.g., “checkout latency is high”) and should link to a playbook that tells the on-call engineer exactly what steps to take.

This one small action of implementing structured logging will change your ability to debug complex systems forever.

A developer used to write log messages as plain text strings, like “User logged in.” This was difficult to search and analyze. She switched to structured logging. Now, her log messages were in a JSON format with key-value pairs, like {“event”: “user_login”, “user_id”: 123, “source_ip”: “1.2.3.4”}. This one small change made her logs machine-readable. She could now easily search, filter, and create visualizations based on her log data, which dramatically accelerated her debugging process.

The reason you can’t find the root cause of an outage is because you only have monitoring, not observability.

During an outage, a team with a monitoring system could see that their error rate was high. But they had no idea why. They were just looking at the symptom. A team with an observability platform could see the spike in errors, drill down to the specific traces that were failing, see the exact line of code that was throwing the error, and see the full context of the request. Monitoring tells you that something is broken; observability helps you understand why it’s broken.

If you’re still only looking at CPU and memory graphs, you’re losing sight of what your users are actually experiencing.

An operations team was proud of their monitoring dashboard, which showed that all of their servers had low CPU and memory usage. They thought everything was healthy. Meanwhile, the customer support team was being flooded with complaints about the website being slow. The team was monitoring the health of the servers, not the health of the user experience. By implementing front-end performance monitoring and tracking user-centric metrics like page load time, they were able to get a much more accurate picture of their system’s actual performance.

Site Reliability Engineering (SRE)

Use SLOs (Service Level Objectives) and error budgets, not just 100% uptime goals.

A development team was afraid to release new features because the operations team had an unrealistic goal of 100% uptime. There was no room for error. They adopted an SRE approach. They defined a Service Level Objective (SLO) of 99.9% uptime. This gave them an “error budget”—a small, acceptable amount of downtime or risk they could “spend” each month. This allowed the development team to innovate and release new features more quickly, as long as they stayed within their budget.

Stop blaming individuals for outages. Do conduct blameless postmortems to learn from failure instead.

An outage occurred, and the company’s leadership immediately asked, “Whose fault was this?” The engineer who made the change was blamed, creating a culture of fear. A company with an SRE culture had a different approach. After an outage, they conducted a “blameless postmortem.” The goal was not to find who to blame, but to understand the systemic reasons why the failure occurred. They focused on improving the process and the technology to prevent the same class of error from happening again.

The #1 tip for creating a successful SRE culture in your organization.

The most important tip is to get buy-in from leadership to give the SRE team the authority to say “no.” An SRE team identified that the system was becoming too unstable due to the rapid pace of new feature releases. Because they had the authority from leadership, they were able to temporarily halt new releases and force the development teams to spend a sprint focused exclusively on reliability and fixing technical debt. Without this authority, an SRE team is just a glorified operations team with no power to enforce reliability standards.

The biggest lie you’ve been told about SRE being “just DevOps by another name”.

The lie is that SRE and DevOps are the same thing. While they share many of the same principles, like automation and collaboration, there is a key difference. DevOps is a culture and a set of practices. SRE (Site Reliability Engineering) is a specific, prescriptive implementation of that culture, originating from Google. SRE provides a set of concrete practices—like SLOs, error budgets, and blameless postmortems—for how to run a reliable production system. SRE is how you do DevOps.

I wish I knew this about the importance of toil automation when I was a traditional sysadmin.

As a sysadmin, I spent most of my day on “toil”—manual, repetitive, tactical work like resetting passwords, provisioning servers, and responding to simple alerts. It was exhausting and unfulfilling. I wish I had known about the SRE principle of automating toil. An SRE’s goal is to automate themselves out of a job. They are given the time to write software that automates the repetitive tasks, which frees them up to work on more strategic, long-term engineering projects.

I’m just going to say it: Your SRE team should be writing more code than your operations team.

A traditional operations team spends most of its time reacting to incidents and performing manual tasks. A true SRE team is composed of software engineers who apply software engineering principles to operations problems. A core tenet of SRE is a “50% cap on ops work.” At least half of an SRE’s time must be spent on engineering projects—building automation, improving monitoring, or increasing the system’s reliability. If your SRE team is just manually putting out fires all day, they are not doing SRE.

99% of organizations make this one mistake when trying to implement SRE.

The most common mistake is to simply rename their existing traditional operations team the “SRE team” without changing anything else. The team still has the same old responsibilities, the same old tools, and no authority to enforce reliability. They are just sysadmins with a fancy new title. A true SRE implementation requires a fundamental cultural shift, a commitment to engineering, and the empowerment of the team to prioritize reliability over new features when necessary.

This one small action of defining an error budget will change the conversation between your development and operations teams forever.

The development team wanted to move faster and release new features. The operations team wanted to maintain stability and not have any outages. They were in constant conflict. They took one small action: they jointly agreed on an SLO and an error budget. This created a data-driven framework for their conversation. Now, the development team could release as fast as they wanted, as long as they didn’t “spend” the entire error budget. It aligned both teams around a single, shared goal of balancing innovation and reliability.

The reason your team is constantly firefighting is because you haven’t invested in SRE principles.

An operations team felt like they were on a treadmill. They would spend all day and night responding to alerts and manually fixing problems, only to have the same problems reappear the next week. They were constantly firefighting. The reason was a lack of investment in engineering. The SRE approach would have given them the time and mandate to automate the repetitive fixes and to engineer long-term solutions that would prevent the fires from starting in the first place.

If you’re still manually responding to every alert, you’re losing the opportunity to engineer long-term solutions.

An on-call engineer was paged at 3 AM because a server had run out of disk space. He manually logged in and cleared some old log files. The problem was solved for now. An SRE, faced with the same alert, would have a different approach. She would write a small piece of automation that automatically cleans up old log files, and she would improve the monitoring to provide an earlier warning before the disk filled up. She would engineer a permanent solution so that that specific alert would never happen again.

Networking

Use a software-defined networking (SDN) approach for agility, not manual switch and router configuration.

A network engineer at a large data center needed to set up a new network segment for an application. It was a slow, manual process that involved logging into dozens of individual switches and routers to configure VLANs and access control lists. A company using a software-defined networking (SDN) approach had a centralized controller. The engineer could define the network policy once, and the controller would automatically configure all the underlying hardware. This made network provisioning dramatically faster and less error-prone.

Stop doing traditional, hardware-based networking. Do embrace the principles of network automation instead.

A network team’s workflow was based on manually logging into devices via the command-line interface (CLI). Every change was slow and risky. They embraced network automation. They started using tools like Ansible to automate their configuration changes. Instead of manually typing commands on 50 different switches, they could write a single playbook and push the change to all of them at once, in a consistent and repeatable way. This was a fundamental shift from being a “CLI jockey” to being a network automation engineer.

The #1 secret for designing a scalable and resilient data center network.

The secret is to use a “leaf-spine” architecture, not a traditional three-tier model. The old model had bottlenecks at the aggregation and core layers. A leaf-spine architecture is a flatter, more scalable design. Every “leaf” switch (where the servers connect) is connected to every “spine” switch. This creates a huge number of paths for traffic to travel, which dramatically increases the bandwidth and provides a high degree of resiliency. If any single link or switch fails, the traffic can be easily rerouted.

The biggest lie you’ve been told about the “end of the network engineer”.

The lie is that with the rise of the cloud and automation, the role of the network engineer is becoming obsolete. The reality is that the role is not disappearing; it’s evolving. The network engineer of the future will spend less time manually configuring individual devices and more time writing code, building automation, and designing complex hybrid-cloud networks. The demand for engineers who understand both traditional networking principles and modern automation skills is higher than ever.

I wish I knew this about the power of Python for network automation when I was studying for my CCNA.

When I was studying for my networking certification, the entire curriculum was focused on the command-line interface. I spent hundreds of hours learning how to manually configure Cisco devices. I wish I had also spent time learning Python. A few years later, I realized that I could write a simple Python script that could log into a hundred devices and collect the information I needed, a task that would have taken me all day to do manually. The ability to code is the new superpower for network engineers.

I’m just going to say it: The command-line interface (CLI) is an outdated way to manage a network at scale.

For a single router, the CLI is fine. But for a network with hundreds or thousands of devices, it’s an incredibly inefficient and error-prone way to manage a network. A network engineer who has to manually type commands on 500 different switches to make a change is not scalable. Modern network management should be done through APIs and automation tools. The CLI is a legacy interface that should be used for troubleshooting, not for large-scale configuration management.

99% of network engineers make this one mistake when they start learning automation.

The most common mistake is trying to just replicate their manual CLI workflow in a script. A network engineer will write a script that just logs into a device and sends the same series of CLI commands they used to type by hand. A better approach is to use tools that interact with the device’s API. This allows them to work with structured data (like JSON) instead of just “screen scraping” the text output of the CLI, which is a much more robust and reliable way to build automation.

This one small action of using a source of truth (like NetBox) for your network configuration will change your operational efficiency forever.

A network team managed their IP addresses in a spreadsheet and their device configurations in a collection of text files. It was chaos. They implemented a “source of truth” database. This central database contained all the intended state of their network—every IP address, every VLAN, every interface. Their automation scripts could now pull data from this source of truth to generate device configurations and run validation checks. This one small action created a single, reliable foundation for their entire network automation strategy.

The reason your network changes are so slow and risky is because you’re not using automation.

A network engineer needed to update the access control list on 100 different switches. Doing it manually would take all day and there was a high risk of making a typo on one of the switches, which could cause an outage. With an automation script, she could push the change to all 100 switches in five minutes, with a guarantee that the change was applied consistently and correctly everywhere. Automation doesn’t just make network changes faster; it makes them significantly safer.

If you’re still logging into each network device individually to make a change, you’re losing your sanity.

Imagine having to make the same, simple change to 200 different network switches. The process of logging into each device, typing the same commands over and over again, and then logging out is mind-numbingly tedious and a recipe for human error. This is the daily reality for many network engineers who haven’t embraced automation. Using a simple automation tool can turn this soul-crushing, multi-day task into a single, five-minute command.

Storage

Use object storage for unstructured data, not a traditional file system (NAS).

A company was trying to store billions of image and video files on a traditional network-attached storage (NAS) system. The file system was struggling to handle the sheer number of files, and the performance was terrible. They switched to an object storage system. Object storage is designed to handle massive amounts of unstructured data at scale. Each file is an “object” with its own unique ID, which can be retrieved via a simple web API. It was a much more scalable and cost-effective solution for their needs.

Stop doing complex, multi-tiered storage architectures manually. Do use software-defined storage (SDS) to automate data placement instead.

A storage administrator used to spend hours manually migrating data between different tiers of storage. He would move “hot,” frequently-accessed data to expensive, high-performance flash storage, and “cold,” archival data to cheaper, slower disk. He then implemented a software-defined storage (SDS) solution. The SDS platform could automatically and dynamically move the data between the different tiers based on real-time access patterns. This automated data placement saved him a huge amount of time and optimized both performance and cost.

The #1 hack for optimizing your enterprise storage costs.

The secret is aggressive data reduction through deduplication and compression. A company’s storage array was almost full. They were about to spend a huge amount of money on a new array. They enabled data deduplication and compression on their existing system. The system was able to find and eliminate redundant copies of data and shrink the size of the remaining files. They were able to achieve a 3-to-1 data reduction ratio, which freed up a huge amount of capacity and delayed the need for a costly hardware upgrade for over a year.

The biggest lie you’ve been told about the “death of the SAN”.

The lie is that with the rise of cloud and object storage, the traditional Storage Area Network (SAN) is dead. For many modern, cloud-native applications, this is true. But for a large enterprise that has a critical, high-performance database running on-premise that requires ultra-low latency and high IOPS, a high-performance Fibre Channel SAN is still the best and most reliable tool for the job. The SAN is not dead; its role has just become more specialized.

I wish I knew this about the performance implications of different storage protocols (iSCSI vs. Fibre Channel vs. NVMe-oF).

When I was a junior sysadmin, I thought all block storage was the same. I didn’t understand the difference between the different protocols used to connect to it. I wish I had known that Fibre Channel provides very reliable, low-latency performance, which is great for critical databases. I wish I had known that iSCSI, which runs over standard ethernet, is cheaper and more flexible. And I wish I had known about the emerging NVMe-oF protocol, which provides the ultra-low latency of a local SSD over a network.

I’m just going to say it: For most modern applications, a distributed storage system is the right choice.

A traditional storage system is a monolithic box. If that box fails, your application goes down. A modern, distributed storage system, like Ceph, spreads your data across a cluster of many commodity servers. There is no single point of failure. If one of the servers fails, the system can automatically heal itself and continue operating without interruption. This software-defined, distributed approach provides a level of scalability and resiliency that is very difficult to achieve with a traditional, monolithic storage array.

99% of IT architects make this one mistake when designing their storage solution.

The most common mistake is focusing only on the capacity (how many terabytes) and not on the performance (how many IOPS). An architect will buy a massive, high-capacity storage array made of slow, spinning disks for a virtual desktop infrastructure (VDI) project. When hundreds of users try to log in at the same time, the system grinds to a halt because the storage can’t handle the high number of random I/O operations. They had plenty of capacity, but not nearly enough performance.

This one small action of implementing data deduplication and compression will change your storage capacity planning forever.

A company was constantly buying new storage arrays because they were running out of space. They finally enabled data reduction features on their existing system. They were shocked to find that they were storing dozens of identical copies of the same large virtual machine images. Data deduplication was able to store only a single copy of these redundant blocks. This one small action immediately freed up over 50% of their storage capacity, dramatically changing their budget and their capacity planning process.

The reason your application is so slow is because of a storage I/O bottleneck.

An application’s servers had plenty of CPU and memory, but the users were still complaining that it was slow. The reason was a storage bottleneck. The application was writing a huge number of small log files, and the underlying spinning disk storage couldn’t keep up with the high number of random I/O operations per second (IOPS). By moving the application’s data to a high-performance, all-flash storage array, they were able to eliminate the storage bottleneck and dramatically improve the application’s performance.

If you’re still buying proprietary storage hardware, you’re losing flexibility and paying a premium.

A company was locked into a single, proprietary storage vendor. They were paying a huge premium for the hardware and the expensive annual support contracts. They switched to a software-defined storage (SDS) solution. The SDS software could run on any standard, commodity server hardware from any manufacturer. This gave them the flexibility to choose the best hardware for their needs and avoid vendor lock-in, all while significantly lowering their total storage costs.

Virtualization & Hypervisors

Use lightweight containers for application virtualization, not just heavyweight virtual machines for every workload.

A developer needed to run ten different, small web applications. The old way would have been to spin up ten separate virtual machines (VMs), each with its own full operating system. It was a huge waste of resources. The modern approach is to use containers. She was able to run all ten applications as lightweight, isolated containers on a single VM. Because the containers shared the host operating system’s kernel, the overhead was a fraction of the traditional VM approach.

Stop managing individual VMs. Do use an orchestration platform like Kubernetes to manage your containerized applications instead.

A sysadmin was responsible for a large application that ran on a cluster of 50 different virtual machines. When a VM failed, he would have to manually restart it. It was a constant game of whack-a-mole. The company containerized the application and started managing it with Kubernetes. Now, if a container or a VM failed, the orchestration platform would automatically detect the failure and restart it on a healthy node in the cluster. The system became self-healing, and the sysadmin could finally get a good night’s sleep.

The #1 secret for optimizing your VM density on a hypervisor.

The secret is to take advantage of memory-saving techniques like Transparent Page Sharing (TPS). A hypervisor host was running ten identical Windows virtual machines. Without TPS, each VM would have its own separate copy of the common operating system files in RAM. By enabling TPS, the hypervisor was able to find the identical pages of memory across all the VMs and store only a single copy. This dramatically reduced the total memory usage on the host, allowing the administrator to run more virtual machines on the same physical hardware.

The biggest lie you’ve been told about containers being “less secure” than VMs.

The lie is that because containers share a single host kernel, they are inherently less secure than virtual machines, which have their own isolated kernels. While a poorly configured container environment can be insecure, a modern container platform with features like seccomp profiles, AppArmor, and gVisor can provide a level of security and isolation that is comparable to, and in some cases even stronger than, traditional VMs. Container security is a solvable problem, and it is not a valid reason to avoid using them.

I wish I knew this about the difference between Type 1 and Type 2 hypervisors when I first started with virtualization.

When I first started, I would run a “Type 2” hypervisor, like VirtualBox or VMware Workstation, on top of my Windows laptop to run a Linux virtual machine. It was slow. I wish I had known that a “Type 1” or “bare-metal” hypervisor, like ESXi or Hyper-V, runs directly on the server hardware. By eliminating the overhead of the underlying host operating system, a Type 1 hypervisor provides much better performance and scalability, which is why it’s the standard for all production data center virtualization.

I’m just going to say it: The future of virtualization is not VMs, but containers and serverless.

For two decades, the virtual machine was the fundamental unit of compute in the data center. That era is ending. For most new, cloud-native applications, containers are a much more lightweight and efficient way to package and run code. And for many event-driven workloads, serverless platforms are abstracting away the infrastructure entirely. While VMs will continue to exist for legacy applications, the center of gravity for application development has shifted to these higher levels of abstraction.

99% of sysadmins make this one mistake when allocating resources to their VMs.

The most common mistake is overprovisioning. A developer will ask for a virtual machine for their new application. To be “safe,” the sysadmin will give it 8 vCPUs and 32GB of RAM. The application then runs for a year, using only 5% of those allocated resources. This massive overprovisioning, multiplied across hundreds of VMs, leads to a huge amount of wasted hardware capacity. A better approach is to start with a small allocation and then monitor the VM’s actual usage to right-size it over time.

This one small action of installing guest tools in your VMs will change their performance forever.

A sysadmin deployed a new virtual machine but noticed that the mouse movement was jerky and the network performance was poor. He had forgotten to take one small but critical step: installing the hypervisor’s “guest tools” (like VMware Tools or Hyper-V Integration Services) inside the virtual machine’s operating system. These tools contain specialized drivers that allow the virtual machine to communicate much more efficiently with the underlying hypervisor, which dramatically improves the VM’s graphics, networking, and overall performance.

The reason your virtualization host is so slow is because of resource contention between your VMs.

A virtualization host was performing poorly, even though the overall CPU and memory usage looked okay. The reason was storage I/O contention. The host had a few “noisy neighbor” virtual machines that were running very disk-intensive applications. These VMs were consuming all of the available storage IOPS, which was starving the other, less aggressive VMs on the same host and making them unresponsive. Identifying and isolating these “noisy neighbors” is a key part of managing a multi-tenant virtual environment.

If you’re still running a single application on a physical server, you’re losing a massive amount of hardware resources.

In the old days, every single application would get its own dedicated physical server. A typical server would run at only 5-10% of its total CPU and memory capacity, meaning that 90% of the expensive hardware resources were being completely wasted. Virtualization changed this. By running multiple virtual machines on a single physical server, companies were able to dramatically increase their server utilization rates, which led to a massive reduction in hardware costs, power consumption, and data center footprint.

Disaster Recovery (DR)

Use a cloud-based disaster recovery (DRaaS) solution, not a secondary physical data center.

A company was paying a fortune to maintain a second, identical physical data center for disaster recovery. It was a massive capital expense, and the data center sat idle 99% of the time. They switched to a Disaster Recovery as a Service (DRaaS) solution. They could now replicate their critical systems to the cloud for a fraction of the cost. In the event of a disaster, they could spin up their recovery environment in the cloud on a pay-as-you-go basis. This transformed DR from a huge capital expenditure to a manageable operational expense.

Stop doing annual DR tests. Do automate your failover testing on a regular basis instead.

A company would conduct a massive, all-hands-on-deck disaster recovery test once a year. It was a stressful, disruptive, and expensive event. A modern company with a cloud-based DR solution could automate this process. They could write a script that would automatically fail over a non-critical application to the DR site, run a series of tests to ensure it was working, and then fail back, all without any human intervention. This allowed them to test their DR plan every single week, not just once a year.

The #1 secret for a near-zero RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

The secret is continuous data replication and automated failover. An organization with a critical application needed to ensure they could recover from a disaster in minutes (RTO) with almost no data loss (RPO). They used a technology that continuously replicated their data and virtual machines to a secondary site in real-time. They also had an automated workflow that could, with a single click, fail over their entire application to the recovery site. This combination of continuous replication and automation is the key to achieving near-zero downtime.

The biggest lie you’ve been told about tape backups.

The lie is that tape backups are a dead technology. While tape is not suitable for fast, operational recovery, it is still one of the most cost-effective and reliable solutions for long-term, archival data retention and for creating an “air-gapped” copy of your data. An air-gapped tape backup, stored physically offline, is one of the only ways to protect your data from an online threat like a ransomware attacker who is trying to delete all of your backups.

I wish I knew this about the importance of a DR plan that covers people and processes, not just technology.

I was part of a team that had a fantastic technical disaster recovery plan. We could fail over our entire data center in under an hour. We had a real disaster, and the technical failover worked perfectly. But our recovery was still a mess. We had never created a plan for how our employees would communicate, where they would work from, or how we would contact our customers. I wish I had known that a DR plan is not just about the technology; it’s about the people and the processes.

I’m just going to say it: Your backup is worthless if you haven’t tested your restore process.

A company diligently backed up their critical database every single night. The backup software always reported “Success.” They were hit with a ransomware attack and went to restore their data. They discovered that the backup files were corrupted and had been for months. Their backups were completely worthless. A backup is not a backup until you have successfully performed a full restore from it. Regularly testing your restore process is the most critical and often the most overlooked part of any data protection strategy.

99% of businesses make this one mistake in their disaster recovery plan.

The most common mistake is failing to update the plan. A company created a detailed disaster recovery plan. Two years later, their production environment had changed significantly—new applications, new servers, new dependencies. But they had never updated the DR plan. When they had to use it, it was completely out of date and useless. A DR plan is a living document. It must be reviewed and updated every single time there is a significant change to the production environment.

This one small action of replicating your critical data to a different geographic region will change your resilience to outages forever.

A company was running their entire business out of a single cloud data center. They were hit by a major, region-wide outage caused by a hurricane, and their business was offline for a full day. They took one small action: they started replicating their critical databases and applications to a different cloud region, a thousand miles away. Now, if their primary region goes down, they can fail over to the secondary region and continue operating. This geographic redundancy is the key to surviving large-scale disasters.

The reason your business will fail after a disaster is because your DR plan is sitting on a shelf collecting dust.

A company spent a lot of money creating a detailed, 200-page disaster recovery plan. They put it in a binder and put it on a shelf. They never trained their employees on it, and they never tested it. When a real disaster struck, nobody knew what to do or where to even find the plan. A DR plan is not a document; it’s a state of readiness. It requires regular training, testing, and practice to ensure that when the worst happens, your team can execute the plan calmly and effectively.

If you’re still not using immutable backups, you’re losing your last line of defense against ransomware.

A company was hit by a sophisticated ransomware attack. The attacker not only encrypted their production data, but also gained access to their backup system and deleted all of their backups. They had no way to recover and were forced to pay the ransom. A smarter company used immutable backups. An immutable backup, once written, cannot be altered or deleted by anyone—not even an administrator—for a set period of time. This provides a clean, guaranteed-to-be-uncorrupted copy of your data that can survive even the most destructive ransomware attack.

IT Automation

Use a centralized automation platform like Ansible or SaltStack, not a collection of disparate scripts.

An IT department had a collection of hundreds of different shell scripts and PowerShell scripts, written by different people over many years. There was no version control and no consistency. It was chaos. They decided to standardize on a centralized automation platform. By using a tool like Ansible, they were able to create a single, version-controlled repository of reusable automation “playbooks.” This made their automation more reliable, maintainable, and easier for the whole team to collaborate on.

Stop doing repetitive manual tasks. Do automate everything that can be automated instead.

An IT administrator spent the first hour of every single day manually checking the health of 50 different servers. It was a boring, repetitive, and soul-crushing task. He finally decided to automate it. He wrote a simple script that would check the servers for him and send him a single email report with the status. This one small act of automation saved him five hours a week and freed him up to work on more interesting and valuable projects.

The #1 tip for building a culture of automation in your IT team.

The secret is to celebrate the small wins and make the value of automation visible. A team leader started an “automation of the week” award. Every week, the team would celebrate the person who had automated the most tedious or time-consuming manual task. They even created a dashboard that showed how many hours of manual work the team had saved through automation that month. This made automation fun, competitive, and clearly demonstrated its value to both the team and to management.

The biggest lie you’ve been told about IT automation taking away jobs.

The lie is that if you automate a task, the person who used to do that task will be fired. A help desk technician used to spend half her day manually resetting user passwords. The team then automated this process. She wasn’t fired. Instead, she was now free to work on more complex and engaging problems, like troubleshooting difficult user issues and training people on new software. Automation doesn’t eliminate jobs; it eliminates the boring parts of jobs, allowing people to focus on higher-value work.

I wish I knew this about the importance of idempotency when I was writing my first automation scripts.

When I wrote my first automation script, it was not “idempotent.” If I ran it once, it would install a piece of software. If I ran it a second time, it would fail because the software was already installed. I wish I had known to write my scripts to be idempotent. An idempotent script can be run a hundred times, and the end result will always be the same. The script will check if the software is already installed, and if it is, it will do nothing. This makes your automation much more robust and reliable.

I’m just going to say it: If you have to do a task more than twice, you should automate it.

A developer had to go through a ten-step manual process to set up her local development environment for a new project. She had to do it again a few months later. The third time she needed to do it, she stopped. She spent an afternoon writing a single script that automated the entire ten-step process. This one-time investment of a few hours saved her and every other developer on her team from ever having to waste time on that manual process again.

99% of IT professionals make this one mistake when they start with automation.

The most common mistake is trying to automate a huge, complex process as their very first project. A junior sysadmin will try to automate the entire deployment process for a complex application. They will get overwhelmed by the complexity and give up, concluding that automation is too hard. A better approach is to start with a small, simple, low-risk task. Automate the process of checking server disk space, or restarting a service. These small, early wins will build your confidence and your skills.

This one small action of starting with a small, low-risk automation project will change your team’s confidence in automation forever.

An IT team was skeptical about automation. Their manager took one small action: she identified one, simple, repetitive task—the daily process of checking backup logs—and worked with the team to automate it. The project was a success. The team saw that automation could save them time and reduce manual errors. This one small, successful project completely changed their mindset. They were no longer skeptical; they were excited, and they started actively looking for the next thing they could automate.

The reason your IT department is so slow to respond to requests is because of a lack of automation.

A user submits a ticket to the IT department to get access to a new software application. The request then has to be manually processed by three different people. The whole process takes a week. In a highly-automated IT department, this same request could trigger an automated workflow. The system would get the manager’s approval via email, automatically provision the license, and send the user an email with their login information, all within a few minutes, without any human intervention.

If you’re still manually provisioning user accounts, you’re losing hundreds of hours of productivity per year.

Think about the process of a new employee starting at a company. The IT department has to manually create their account in a dozen different systems: email, HR, payroll, etc. It’s a slow and error-prone process. By integrating these systems and using an IT automation platform, the entire onboarding process can be triggered from a single action in the HR system. When a new employee is added to the HR system, all of their other accounts can be provisioned automatically, securely, and instantly.

Green IT

Use modern, energy-efficient servers and data center designs, not outdated and power-hungry hardware.

A company was running a data center filled with servers that were ten years old. These old servers were consuming a huge amount of electricity and generating a lot of heat, which required even more electricity for cooling. They finally did a hardware refresh, replacing the old servers with modern, energy-efficient models. Their electricity bill was cut in half, and because the new servers ran cooler, their cooling costs also dropped significantly. Green IT is not just good for the planet; it’s good for the bottom line.

Stop running your servers at 10% utilization. Do use virtualization and consolidation to increase your server efficiency.

A typical, non-virtualized data center had hundreds of physical servers, each running a single application, and each server was only using about 10% of its CPU capacity. It was a massive waste of hardware and electricity. By using virtualization, they were able to consolidate those hundreds of applications onto a much smaller number of physical servers, driving the utilization of each server up to 70-80%. This server consolidation had a huge impact on their energy consumption and data center footprint.

The #1 secret for reducing your data center’s PUE (Power Usage Effectiveness).

The secret is to focus on optimizing your cooling. PUE is a measure of how much of the energy going into your data center is actually used to power the IT equipment, versus how much is used for things like cooling. A typical data center’s biggest energy consumer, after the servers themselves, is the air conditioning system. By using modern, efficient cooling techniques—like hot aisle/cold aisle containment or liquid cooling—you can dramatically reduce the amount of energy wasted on cooling, which is the key to lowering your PUE.

The biggest lie you’ve been told about the environmental impact of the cloud.

The lie is that the cloud is a magical, green solution. The reality is that the “cloud” is just someone else’s massive, energy-intensive data center. While major cloud providers are investing heavily in renewable energy and are often more efficient than a typical on-premise data center, the rapid growth of cloud computing is still a major driver of global energy consumption. The cloud can be more efficient, but it is not without a significant environmental footprint.

I wish I knew this about the e-waste problem in the IT industry when I was a junior technician.

As a junior technician, my job was to decommission old servers and PCs. I would just put them in a pile to be taken away. I had no idea where they went. I wish I had known about the massive global problem of electronic waste. Many of these devices end up in landfills in developing countries, where toxic materials can leach into the environment. I learned that it’s crucial to work with certified e-waste recycling companies that can safely and responsibly dispose of old IT equipment.

I’m just going to say it: The IT industry has a responsibility to be a leader in sustainability.

The IT industry is one of the most powerful and innovative sectors of the global economy. It is also a major and growing consumer of global electricity. This gives the industry a unique responsibility and opportunity. From designing more energy-efficient chips and building data centers powered by 100% renewable energy to creating software that helps other industries to be more efficient, the tech industry must be at the forefront of developing the solutions needed to address the climate crisis.

99% of companies make this one mistake with their old IT equipment.

The most common mistake is simply storing it in a closet to be forgotten. A company will have a storage room filled with old laptops, servers, and monitors that are five or ten years old. This equipment still contains sensitive company data, and it also contains valuable materials that could be recovered. A responsible company has a clear IT Asset Disposition (ITAD) policy. This involves securely wiping the data from the old devices and then sending them to a certified recycler to be properly disposed of.

This one small action of setting your servers to a power-saving mode during off-peak hours will change your energy bill forever.

A company’s servers were running at full power 24/7, even though their office was only open from 9 to 5. The servers were mostly idle overnight and on weekends. An administrator took one small action: he configured the servers to automatically enter a low-power state outside of business hours. This simple change, which had no impact on the business, significantly reduced the data center’s overall energy consumption and resulted in a noticeable decrease in their monthly electricity bill.

The reason your data center is so expensive to run is because of its poor energy efficiency.

A company’s data center was costing them a fortune in electricity. The reason? They were using old, inefficient servers, and their cooling system was poorly designed. Their Power Usage Effectiveness (PUE) was very high, meaning a huge portion of their electricity bill was being wasted on just running the air conditioning. By investing in modern, energy-efficient hardware and optimizing their cooling, they were able to dramatically reduce their operating costs.

If you’re still not considering the environmental impact of your IT purchasing decisions, you’re losing the trust of your customers and employees.

Two companies were bidding for a large contract. The first company had no public sustainability goals. The second company had a clear corporate responsibility program, a commitment to using renewable energy, and a policy of purchasing energy-efficient hardware. The client chose to work with the second company. In today’s world, a company’s commitment to sustainability is becoming an increasingly important factor for customers, partners, and for attracting and retaining top talent.