Adventures In Finding Cardholder Data

On page 10 of the PCI DSS v3 under the heading of ‘Scope of PCI DSS Requirements’, second paragraph, is the following sentence.

 “At least annually and prior to the annual assessment, the assessed entity should confirm the accuracy of their PCI DSS scope by identifying all locations and flows of cardholder data and ensuring they are included in the PCI DSS scope.”

Under the first bullet after that paragraph is the following.

 “The assessed entity identifies and documents the existence of all cardholder data in their environment, to verify that no cardholder data exists outside of the currently defined CDE.”

In the past, organizations would rely on their database and file schemas along with their data flow diagrams and the project was done.  However, the Council has come back and clarified that the search for cardholder data (CHD), primarily the primary account number (PAN).  The Council has stated that this search needs to be more extensive to prove that PANs have not ended up on systems where it is not expected.

Data Loss Prevention

To deal with requirement 4.2, a lot of organizations invested in data loss prevention (DLP) solutions.  As a result, organizations with DLP have turned those DLP solutions loose on their servers to find PANs and to confirm that PANs do not exist outside of their cardholder data environment (CDE).

Organizations that do this quickly find out three things; (1) the scope of their search is too small, (2) their DLP solution is not capable of looking into databases, and (3) their DLP tools are not as good at finding PANs at rest as they are when it’s moving such as with an email message.

On the scope side of the equation, it’s not just servers that are in scope for this PAN search, it’s every system on the network including infrastructure.  However, for most infrastructure systems such as firewalls, routers and switches it is a simple task to rule them out for storing PANs.  Where things can go awry is with load balancers, proxies and Web application firewalls (WAF) which can end up with PANs inadvertently stored in memory and/or disk due to how they operate.

Then there is the scanning of every server and PC on the network.  For large organizations, the thought of scanning every server and PC for PANs can seem daunting.  However, the Council does not specify that the identification of CHD needs to be done all at once, so such scanning can be spread out.  The only time constraint is that this scanning must be completed before the organization’s PCI assessment starts.

The second issue that organizations encounter with DLP is that their DLP has no ability to look into their databases.  Most DLP solutions are fine when it comes to flat files such as text, Word, PDF and Excel files, but the majority of DLP solutions have no ability to look into databases and their associated tables.

Some DLP solutions have add-on modules for database scanning but that typically required a license for each database instance to be scanned and thus can quickly become cost prohibitive for some organizations.  DLPs that scan databases typically scan the more common databases such as Oracle, SQL Server and MySQL.  But legacy enterprise databases such as DB/2, Informix, Sybase and even Oracle in a mainframe environment are only supported by a limited number of DLP solutions.

Another area where DLP solutions can have issues is with images.  Most DLP solutions have no optical character recognition (OCR) capability to seek out PANs in images such as images of documents from scanners and facsimile machines.  For those DLP solutions that can perform OCR, the OCR process slows the scanning process down considerably and the false positive rate can be huge particularly when it comes to facsimile documents or images of poor quality.

Finally there is the overall issue of identifying PANs at rest.  It has been my experience that using DLP solutions for identifying PANs at rest is haphazard at best.  I believe the reason for that is that most DLP solutions are relying on the open source Regular Expressions (RegEx) to find the PANs.  As a result, they all suffer from the same shortcomings of RegEx and therefore their false positive rates end up being very similar.

The biggest reason for the false positive rate is the fact that most of these solutions using RegEx do not conduct a Luhn check to confirm that the number found is likely to be a PAN.  That said, I have added a Luhn check to some of the open source solutions and it has amazed me how many 15 and 16 digit combinations can pass the Luhn check and yet not be a PAN based on further investigation.  As a result, having a Luhn check to confirm a number as a potential PAN reduces false positives, but not as significantly as one might expect.

The next biggest reason RegEx has a high false positive rate is that RegEx looks at data both at a binary level and character level.  As a result, I have seen PDFs flagged as containing PANs.  I have also seen images that supposedly contained PANs when I knew that the tool being used had no OCR capability.

I have tried numerous approaches to reduce the level of false positive results, but have not seen significant reductions from varying the RegEx expressions.  That said, I have found that the best results are obtained using separate expressions for each card brand’s account range versus a single, all-encompassing expression.

Simple Solutions

I wrote a post a while back regarding this scoping issue when it was introduced in v2.  It documents all of the open source solutions available such as ccsrch, Find SSNs, SENF and Spider.  All of these solutions run best when run locally on the system in question.  For small environments, this is not an issue.  However, for large organizations, having to have each user run the solution and report the results is not an option.

In addition, the false positive rates from these solutions can also be high.  Then there is the issue of finding PANs in local databases such as SQL Lite, Access or MySQL.  None of these simple solutions are equipped to find PANs in a database.  As a result, PANs could be on these systems and you will not know it using these tools.

The bottom line is that while these techniques are better than doing nothing, they are not that much better.  PANs could be on systems and may not be identified depending on the tool or tools used.  And that is the reason for this post, so that everyone understands the limitations of these tools and the fact that they are not going to give definitive results.

Specialized Tools

There are a number of vendors that have developed tools that have been developed to specifically find PANs.  While these tools are typically cheaper than a full DLP solution and some of these tools provide for the scanning of databases, it has been my experience that these tools are no better or worse than OpenDLP, the open source DLP solution.

Then there are the very specialized tools that were developed to convert data from flat files and older databases to new databases or other formats.  Many of these vendors have added modules to these tools in the form of proprietary methods to identify all sorts of sensitive data such as PANs.  While this proprietary approach significantly reduces false positives, it unfortunately makes these tools very expensive, starting at $500K and going ever higher, based on the size and environment they will run.  As a result, organizations looking at these tools will need more than just use their need for PAN search capability to justify their cost.

The bottom line is that searching for PANs is not as easy as the solution vendors portray.  And even with extensive tuning of such solutions, the false positive rate is likely going to make the investigation into your search results very time consuming.  If you want to significantly reduce your false positive rate, then you should expect to spend a significant amount of money to achieve that goal.

Happy hunting.


3 Responses to “Adventures In Finding Cardholder Data”

  1. June 14, 2014 at 1:20 AM

    Your post focuses heavily on the technical solutions in the market place which often prove time consuming and distracting.

    In my experience the quickest, cheapest and easiest way to identify the bulk of the instances of card data within a merchants environment is to do what most Techies, QSA’s and ‘Consultants’ are frightened to do, Speak to people. By just starting the communications channels and opening dialogue can flush out in minutes what it takes techies and their servers months and £££’s to do.

    Every QSA has anecdotes about end of programme PCI audits that flush out previously unknown but long standing business processes and vast repositories of card data that havent been accounted for or detected and only found because somebody was finally spoken to.

    And often these areas pose a greater risk of Fraud and breach than some 5 year old, expired card data that is sitting in deleted disk space of a desktop PC in an admin department.

    • June 14, 2014 at 9:00 AM

      Don’t lump in all QSAs as afraid to speak to people. I speak to all sorts of people as a QSA (first, it’s required by the DSS, but second because it’s the only way to find out about their card processing. but more importantly, to determine if an organization has truly integrated security into its culture). However, even with all of the interviews I conduct there are still areas that do not come to light due to the size and complexity of a lot of Level 1 merchants. Merchants are not just Target or Wal*Mart with a single POS application and consistency, they are also large health care institutions, distributors and a wide variety of companies that have acquired other companies and card processing is not consistent.

      If not bad enough, there are changes occurring all of the time that managers and executives do not realize fall under PCI scrutiny because it’s outsourced, in “the Cloud”, whatever, only to find out that it’s not as outsourced as they believed. Thanks to “the Cloud” and the fact that everyone is now an IT “expert”, you have all sorts of IT decisions being made by non-IT people away from normal channels by people unaware of legal and regulatory requirements until it is too late. As a result, QSAs and merchants stumble across these changes all of the time because they are hidden away, not because people are deliberately hiding them, but because people didn’t realize the ramifications of what they were doing. Unfortunately, a lot of these discoveries occur towards the end of assessments, too late to include in the current assessment due to the time it would take to get the new discovery assessed.

      Then when you do find them, the business unit fights and argues over PCI scope making things even more tenuous and stressful. Even if it is out of scope, a QSA must do some amount of procedures to determine that it is truly out of scope. Even that effort comes under a certain amount of argument. As a result, it can take weeks or even months to determine what is what and whether something is in scope or not.

  2. May 18, 2014 at 11:18 AM

    I am glad someone actually went through and laid out all the issues to solve this problem.
    We have gone through all of these issues while building out card data discovery product.
    We tackle almost all the things you mentioned in the article except of course as we all know OCR is not a magic bullet – not even the best solutions can identify the scribbles of a front desk person at a hotel jotting down the card number while on the phone!

    It is great that you do identify the fact that DLP solutions mostly do a regex – the false positive rate on a regex and even a regex + luhn is huge.

    We run so many more checks to limit false positives and keep improving our 5 year old algorithm with every version.

    We do all kinds of databases natively – and as you pointed out not many vendors can do that.

    We do have a free basic version for download and a heavy duty enterprise grade product.

    Have a look and let us know.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Welcome to the PCI Guru blog. The PCI Guru reserves the right to censor comments as they see fit. Sales people beware! This is not a place to push your goods and services.

May 2014

%d bloggers like this: