People are always asking me why complying with the PCI standards is important as in, “What’s in it for my company?” So I thought I would take a known, documented breach and walk through where PCI compliance would have made a difference. And for those naysayers that point to the PCI DSS and say that compliance does not matter, I intend to show that compliance does lead to security.
The breach I am going to use is the Wal-Mart breach which was documented in an article in Wired magazine back in October 2009. Wal-Mart has what most professionals would consider a robust control environment. However, what this breach shows is that even with such an environment, a breach can still occur. That is not to say that Wal-Mart did not make mistakes and it is those mistakes that I want to point out so that we can all learn.
For some background, the Wall-Mart breach occurred sometime between 2005 and November 2006 when it was discovered by Wal-Mart. The good news, at least as far as Wal-Mart has ever publicly shared, was that no cardholder data was ever released as a result of the breach. However, the final report issued internally by Wal-Mart was never shared outside the company, so it is anyone’s guess as to whether the claim that no cardholder data was ever released is accurate. This was not Wal-Mart’s first cardholder data breach. In the Fall of 2005, a small number of Sam’s Clubs gas station systems were accessed by intruders and around 600 credit card accounts were believed to have been compromised.
The breach was discovered by accident when a server crashed. During the investigation to figure out what had happened to the server, one of the investigators found that L0phtcrack had been installed on the failed server and that it was L0phtcrack that had caused the server to fail. Obviously, L0phtcrack was not an approved application to have installed. As a result, this information caused an even larger investigation to be launched.
Before we discuss L0phtcrack, let us discuss file integrity monitoring. This incident points out why the PCI DSS mandates file integrity monitoring in requirement 11.5. But just monitoring known files is not enough. This is where an organization needs to be going above and beyond in order to better ensure their security. While monitoring critical files, you also need to be monitoring any new files that might be added to a system. And alerts generated by your file integrity monitoring system need to be reconciled to all changes being made to the systems. Any file addition, change or deletion not documented in changes needs to be investigated to determine its cause. Based on the timeline, while Wal-Mart may have had critical file monitoring going on, it either was monitoring only a limited number of files and directories, not monitoring for new files and/or any alerts were not being followed up in a timely manner.
Then there is a topic not even mentioned in the PCI DSS but just as important. Root Cause Analysis (RCA) is something that everyone should conduct in the event of a failure and needs to be an activity conducted as part of an organization’s incident response process. Because of their RCA process, Wal-Mart found that L0phtcrack was the cause of the server failure. Since L0phtcrack was not an approved program and was likely installed by an attacker, Wal-Mart personnel broadened their investigation to determine if L0phtcrack was installed on other systems.
While L0phtcrack should be an obvious program that should not be installed, it is not always that easy. This is why requirements 2.2 and 12.3.7 are important, so that when doing an investigation, the investigators know what to expect to see installed as well as what was approved so that they can quickly determine if the server was running approved software. Again, I am certain that L0phtcrack would not have been part of those standards.
That even larger investigation led to Wal-Mart to determining that over 800 systems and servers had been compromised or attempted to be compromised. The compromise was traced back to a remote access VPN account that was used by a former Wal-Mart employee in Canada. That account had been used by the intruder to enter Wal-Mart’s network and begin the compromise of their systems. While investigating the breach, Wal-Mart personnel suspended that account and the intruder moved over to another terminated employee’s account. When they disabled the second account, the intruder moved over to a third terminated employee account.
Requirement 8.5.4 states that accounts for terminate employees should be disabled or removed immediately and this was obviously not followed in this case. Requirement 8.5.5 states that inactive accounts should be removed if not used for 90 days or more. Unfortunately, we do not know if any of the accounts had been inactive for more than 90 days. We also do not know if any of these accounts were disabled. However, in such a breach, if the attacker has any sort of administrative access, it takes almost no time to activate a disabled account. That is why an organization needs to remove those accounts as soon as possible, particularly any account that might have administrative privileges.
The investigation quickly focused on one particular Wal-Mart system, point-of-sale (POS). Documentation from the investigation indicates that the intruder(s) were very focused on POS source code, executables, databases and documentation. The intruder(s) were so focused on POS, that they even downloaded the latest technical specifications for Wal-Mart’s POS system. As a result, investigators focused much of their efforts on POS systems at store locations and at corporate.
If not already obvious, investigators inspected log files to determine that the compromise went at least as far back as June 2005. If you want to have a concrete example of why log information and proper time keeping are important and requirement 10 is so focused on log data and time setting, there is no better example than this breach. Thanks to an obviously large retention of log data, Wal-Mart was able to at least figure out when and where the breach started as well as trace the actions of the intruder(s) through their network and systems. It is implied that the time settings on servers and network devices must have been fairly closely synchronized as it is never mentioned if there were time correlation issues in the log data. Had Wal-Mart had to rely on system and event logs that were contained only on the network devices and servers, the when, where and how of this breach might have never been known.
Unfortunately, the log data was not as complete as it could have been. As a result, the Wal-Mart investigators were somewhat stymied in their efforts to better understand the breach. Server logs were only configured to log unsuccessful logon attempts. As a result, they were not able to track the successful logon attempts of the disabled accounts that were being used by the intruder(s) and therefore trace the actions of the intruder(s) through their network and systems. A lot of administrators save log space on internal systems by not logging all activities. I am also guilty of doing this as I also used to believe that successful attempts internally were not a big deal to miss. However, as the internal threat has become more and more prevalent, I have changed my opinion and now I log everything I possibly can on all systems. Requirement 10.2.5 implies that logging of all authorization and identification mechanisms are logged, but it does not specifically call out that successful and unsuccessful attempts are to be logged.
The saddest fact of all was that none of this should have been a shock to Wal-Mart IT and security personnel. Almost six months prior to discovering the breach, Wal-Mart’s QSA had completed their PCI assessment and had found numerous areas where Wal-Mart was not compliant. A lot of the areas of non-compliance were the direct result of how the breach occurred.
So what are the lessons that should be learned from this incident?
- Compliance does matter and does result in security. I do not care whether you follow the PCI DSS, FISMA or any other well known security standard. The purpose of all security standards is to provide guidance on how to secure hardware and software so that it is difficult to compromise. If you comply with any of these standards, you greatly enhance your security posture. However, the best security comes down to more than just complying with a standard. If an organization really wants to be secure it will have to go beyond just what the standard requires.
- Security is not perfect. The purpose of any security programs is to limit the damage of incidents when they occur so that they do not get out of control. All we can expect to gain out of a security programs is minimizing the potential risk that an incident results in a breach of sensitive information. A good friend of mine has a great quote on this point. He always likes to say, “I just want my security program to be sufficient enough that it makes everyone else an easier target than my company.” What security standards let you have is the information you need to know where the bar is set so that you can make investments to do that little bit more.
- Most breaches are discovered by accident. It has been my experience that even with great tools and instrumentation, the discovery of a breach or compromise all comes down to the uncovering of information that results in someone becoming curious and digging further into the incident and discovering that systems and/or data have been compromised. This is not to say that monitoring and alerting is not worthwhile. It is just that it is very rare that a breach or compromise is uncovered when the initial alert was issued. It takes follow up on all of the alerts to actually uncover the breach or compromise.
- Follow up should be the standard for all alerts and a documented Root Cause Analysis (RCA) process should be followed as part of an organization’s incident response plan. This is where most organizations get sloppy and miss the signs of a breach or compromise. They do not treat all alerts consistently, do not perform the RCA process every time and therefore earlier warnings go undiscovered until the situation gets truly serious such as when a production server crashes.
- If you do not have at least a year’s worth of log data, you are probably going to be in the dark about how, when and where the compromise occurred. There is a lot of push back from organizations about hanging onto log data, particularly more than three months worth. A lot of this comes down to the cost of storing such a huge amount of data. However, had Wal-Mart only had three months worth of log data, they never would have known when they had been breached nor the focus of the breach.
- What gets logged is also very important. Wal-Mart’s breach would have been a bit easier to investigate had the log data been complete. Just because you are on the inside of the network is not an excuse to not log everything. As I have pointed out before, log data is IT’s version of a commercial airliner’s flight data recorder. Without all of the data, it can be almost impossible to isolate the cause of a compromise.
- As soon as employees and contractors are terminated, they need to be removed from the access control system. I know that this can cause issues with some operating environments, but there are work arounds to avoid those complications.
- And finally, there are no easy ways to ensure security. Security requires diligence. Extended diligence typically results in tedium which then results in diligence faltering. As a result, organizations interested in maintaining their security need to combat tedium by rotating security and operations personnel through positions so that tedium does not set in. This has an added benefit in improving cross training of personnel.