What is Data Masking?

umberto-FewHpO4VC9Y-unsplash

Data masking is referred to in many ways, redaction, obfuscation, scrambling, de-identification to name a few.  It’s the process of obscuring the meaning of data as an added layer of data protection when developing, testing or for use in live systems (for example shielded customers). In the event of a data breach the masking will make personal and personal sensitive data useless to attackers.  The organisation — and any individuals in the data — will remain uncompromised.  Organisations should always prioritise masking sensitive information in their data as data breaches are becoming more common and expensive to fix. 
 
To achieve data masking, organisations redact data by removing or substituting all or part of a fields value. 

Masking is always the correct approach for testing any environment with personal data that would be at risk if there is any form of breach. Masking makes sure that your data is GDPR compliant and that you always have consistent, accurate testing results. 

Why do I need data masking? 

Data masking is an incredibly important tool from a data governance perspective, not only does it protect your sensitive data but it also strengthens your ability to test environments in a safe and responsible way. 
 
Say for example you are looking to test an outbound email system, this would need some form of masking so that you can test the functionality of the email system without sending out any sensitive information that would be compromised. 
 
If you have multiple interlinked systems to are able to mask across the board keeping the masking consistent across all areas. This keeps all of your testing accurate while also making sure that each separate system is secure from any breach and will not compromise the entire network. 

Masking

When should I mask data?

Masking your data is a requirement for making sure that personal and personal sensitive data that you are responsible for, is completely protected during any testing you may need to do on your pre-production systems. 
 
When testing an environment you should be masking your sensitive data as much as possible to make sure there are no data breaches that could leave you and your organisation liable, for example if you are testing an outbound email system that sends out personal data you will want to make sure all personal data sent out for testing is masked for anonymity and is sent out to an internal address to ensure there are no leaks. 
 
Another use for masking would be in a demo system, as for this use case you will want users to experience your full functionality while also not being able to access or tamper with important, sensitive data. Masking data will give a full, functional demo of your system with no risks attached. 
 
However, when masking data to support any system testing, it is also important to ensure that this does not impact on the validity of the system testing process. If testing is not carried out properly, especially for systems that contain personal data, then this in itself is also regarded as a breach of Data Protection laws. Simply overwriting real data with unrealistic test data, is not correct. Because of this, you must make sure you are using the right masking techniques for the right data scenarios. 

Data Masking techniques


Column Based

Column based security can ensure a sensitive column is not exposed to a user without the proper privileges. This method, while effective, can present issues to the calling application (like a BI tool or application screens), as it is expecting a certain number of columns to be returned from the database query.

Redaction

One of the most common methods of data masking is redaction, this is where you ‘redact’ a column and replace with a set value. In this method, the query returns the proper number of columns, but instead of the actual value, the column value is replaced with a constant. For example, when applying masking to a credit card number the result might be ‘N/A’ or ‘XXX-XXX-XXXX.

Scrambling

Another method of masking data is scrambling the data with random values, for example ‘Derek-Paul’ may become ‘Kpeal-kopl’ by replacing the same number of characters with random values from a-z. You can replace only certain characters in a column so GBP-2321-1231 may keep the GBP but scramble the rest. Numeric scrambling can involve moving random values or can be based on randomised numeric calculations. Care needs to be taken if the value has a check digit on the end as calculations can often create invalid check digits. 

Shuffling

Another technique that can be used to mask data is shuffling values into columns instead of having a constant value. For example, a column ‘First Name’ might have a value of “John” and the look-up would get a name from a random list and replace it with “Katherine”.  Another example would be replacing names with different names from other records (such as taking the forename from the 100th record above, and the surname from the 1200th record below. For master data management (MDM) products, consistent shuffling is needed so all “Pauls” always become “Arif”.

Machine Learning

Machine learning may need to be used for more complex strings, for example if in a comments section contains “Mandy Smith called from 070000000 and wants you to call back about credit card 0000-1111-2222-0000” The data Mandy, Smith, 070000000 and 0000-1111-2222-0000 this would need machine learning to effectively identify and mask the data types correctly.

Partial Masking

An example of where you may need to be using partial masking is where you may need part of a data string as an identifier, for example with credit card numbers almost all of the time you will notice that you are asked for the last 4 digits of your card number as a form of verification, this is almost always due to the agent doing the verification only being able to see those last 4 digits to ensure security. This uses partial masking to obscure the number leaving only the final 4 digits.

Putting Data Making Into Practice

To get started with data masking there are some key considerations that need to be made. This entails reviewing your sensitive data and identifying which data needs to be deidentified using masking and then deciding which masking technique would be most suitable for deidentifying. You will also need to take into consideration if the masked data will still be able to maintain utility for data analysis later down the road.

If for example you are looking into implementing and carrying out user testing on a CMS system with 5 different data sources that all need to interact. Each system has multiple database tables, warehouse schemas or even data lakes to pull data from. Often modern databases do not just store a surname once, it’s all over the place, you need to identify all the places similar or grouped data is held, all the date of births, forenames, middle names, surname, NINO’s etc and you need to treat them all as a group.

You need to group together similar data to put in place the right masking rules and this can be achieved by using expert users, applications such as Microsoft purview or even machine learning to find groups.

Once you have groups of data you can then apply whichever method is the best approach to securing the best masking for data.

When Not To Use Data Masking 

While data masking is a very powerful data protection tool, there are times where you do not want to be using it. 
 
Masking is typically not reversible (apart from shuffling and table backups). For example: If you have redacted all but the last 4 digits of a NINO, and after some analysis decide you wish to have the full, actual NINO, that is not possible. If this reversal or re-identification to the real value is needed, other privacy techniques, such as tokenisation or reversable hashing, should be used. 
 
If the redacted field is a unique or direct identifier, or a unique key in database terms, partial masking can remove the ‘uniqueness’. Depending on the use of the field, this can be problematic. For example, if a report rolls up transactions by 8-digit account number, and the account number is redacted to the first 4 digits, this will cause an issue. In this case, 12347777 and 12348888 both become 1234. So, transactions for both accounts would roll up under the same number, which is not desired behaviour. 
 
Also if you are testing an environment where personal data is absolutely necessary to the functionality of how the environment you are testing operates (for example having to match customer data between databases) masking will not be an option as it will invalidate your results, for this you will need to set up a “walled garden” environment. This is where you make sure all access to the system is secured and only authorised, necessary users are able to access the data as if the test system was a live production system. 

Data masking is a key piece for any effective implementation of best practice for data governance however, the challenge is knowing what data you need to apply masking to and when it is the correct time to do so. 

More on masking and keeping your data secure

iStock-1176400518

Data Masking

Read More

iStock-1368883753

Data Governance

Read More

Padlock Medium

Security

Read More

Transform data use in your organisation

Book your free data assessment today, and find out how much of an impact Sentinel's Master Data Management tools can have on your business.

  • Strict control and monitoring of data quality
    and completeness.
  • Built using the ICO’s ‘data protection by design’
    approach.
  • Trusted by public sector organisations and local authorities.
  • Experienced, dedicated team of data integration and data sharing specialists.

Take the hassle out of data management. Call us on +44(0)800 612 2116 or send us an email [email protected].