7 Big Data Security Issues to Know in an Era of Hacks

header big data security

We live in an age of constant large-scale cyber attacks. Or at least it can seem that way.

Criminal hackers have targeted Capital One, Equifax and Marriott International, to name a handful of high-profile marks, and made off with a trove of sensitive personal information that impacted hundreds of millions of people and shattered their collective sense of privacy.

And yet, by cybersecurity standards, those breaches weren’t actually “destructive.” Which is to say, hackers stole data, but didn’t erase or change it. According to Tom Kellermann, though, destructive attacks are on the rise.

“You’ve got wipers [destructive malware] that delete data,” says Kellermann, who is chief cybersecurity officer at Carbon Black. “You’ve got ransomware that doesn’t ask for a ransom; it just encrypts all the data so you can’t get into it; you have the manipulation of timestamps, which is a nightmare.”

That’s not to mention “island hopping” attacks, invasions in which hackers break into an institution’s system and impersonate that institution in order to obtain sensitive information from anyone who trusts it: individuals, companies, government officials and others.

Governments, in fact, are common targets and perpetrators of destructive cyber attacks.

“Geopolitical tension is now manifesting in cyberspace,” Kellermann says. “We need to be very, very concerned [about] these massive data lakes. They have become targets not just for traditional hackers and disillusioned individuals, but also for nation states.”

So what’s the best way to safeguard enormous datasets, which are key to everything from cancer diagnosis to artificial intelligence training? Hackers can always crack weak passwords and employ social engineering tactics like phishing, but companies also need to be mindful of vulnerabilities tied to big data’s sheer size and recent appearance on the tech scene.

We asked three Boston-based experts to weigh in on some potential cybersecurity risks associated with big data.

Cloud configuration problems

By definition, big datasets won’t fit on traditional desktops and laptops — they have to be stored on a platform designed for huge volumes of data. That often means public cloud platforms, like Amazon Web Services (or AWS), Google Cloud or Microsoft Azure. Storing private information on public platforms, though, requires careful configuration. Here's what Kellerman, Rickard Carlsson, CEO of detectify, and Ryan Weeks, Datto's chief information security officer, had to say.

Rickard Carlsson

CEO • Detectify

In the old days, you stored data on your own databases in your own servers. Today you store data, for example, in S3 buckets in AWS. An S3 bucket is just a general hole where you can store any type of data. It’s very common that people configure those buckets in the wrong way, though, so they become publicly accessible. Instead of only people inside the company having access, anyone can access it.

Ryan Weeks

Chief Information Security Officer • Datto

It’s a known attacker tactic and technique to basically just crawl for open Amazon S3 buckets, so I think most companies need a security operations program that is doing the same thing: looking for those open S3 three buckets. But yeah, when you do anything in public cloud, you need to have an architecture in mind for how you’re going to store and provide access to your data. You need a plan to ensure that your configuration standards persist, too. That’s why you’re seeing the rise of services like CloudCheckr. Personally, I don’t view public cloud as any more or less challenging than private cloud, though. It’s just a different model.

Tom Kellerman

Chief Cybersecurity Officer • VMWare Carbon Black

Everyone assumes by moving to the public cloud, the cloud provider — Microsoft, let’s say — is going to protect their data automatically. But that’s not the case. The cloud is like a condo building. Microsoft is going to provide you with a concierge, maybe access to floors and elevators by key, maybe a security guard. But if you leave your door open at night that is still your problem. You definitely need to configure your own application security and endpoint security regardless of what cloud you’re using.

Weak identity governance

Enterprises that collect vast amounts of data often have correspondingly vast teams, but few employees need access to an entire enterprise worth of data to do their jobs. “Identity governance” means each employee has access to the data she needs — and nothing more. But when it’s weak or improperly administered, it’s a problem.

Weeks: A lot of the newer database containers that allow you to store large amounts of data in a non-relational way use a highly-privileged user as their default. If a company deploys that technology and doesn’t plan for it, they give their employees more data access than they need to do their jobs, and it creates a lot of challenges. Like if something gets accessed inappropriately, it creates a really difficult audit trail. You don’t know who did that thing.

Some of these NoSQL platforms, the way they’re architected, they don’t necessarily provide the robustness of role-based access that you might expect. A lot of times you actually need to put a layer in front of these platforms, like a middleware or an API, where you can specify which users can do certain things.

Kellermann: It’s not just that big data requires access privileges tailored to the user’s role — ideally, you have the capacity for just-in-time administration. So, for example, you need access to some sensitive information today, and you’ll get it just for an hour, and then your privileges expire. Maybe you’ll even be a temporary administrator. But nobody should have permanent administrators. With the modern consolidation of data, it just allows for a lot of theft and nefarious things.

Clashing softwares

While a small dataset can be stored and analyzed in one program, storing and analyzing big data involves a whole ecosystem of technology. There’s the baseline storage platform, and then, often, various attached business intelligence tools, like Tableau or Kibana. Often, there’s a total of seven or eight interlocking applications extracting insights from a large dataset. Updating one of them can bring it into conflict with the rest of the system, or open up security vulnerabilities.

Weeks: One good practice around the management of that dataset is to have a single source of ingress and egress — access and exit — that you can monitor. If you have that middleware layer between your data and all your applications, you have to plan a change to that layer if you want to modify the data, or interact with it differently. It helps you maintain the security controls that you have. Too often people will deploy these databases very nicely, but when one application changes, that impacts the rest of the use of the data.

Kellermann: Yeah, it’s definitely a challenge. Enterprises with complex data management systems should definitely have expedited vulnerability management. They should also implement application control, which limits the behavior of the application. If it gets hacked or manipulated, it basically would stop operating.

Unsecured caches

Sometimes, data analysis software creates a cache, or a local copy, of a frequently-queried subset of a remotely-stored big dataset. This can present security problems.

Weeks: Let’s say I’m a software developer and I create an application that accesses big data, but I have a common query that I run a lot and I want it to go faster. I might pull the relevant data out and store it in a cache. That cache now becomes a store of potentially sensitive data and the cache has access and security considerations that come along with it. A lot of these modern application stacks that interact with big databases have very elegant performance solutions, but they need to be thought through in a robust way.

Remote data storage

This is another aspect of cloud storage — when companies store in a public cloud, or any kind of off-site facility managed by a third party, they don’t have direct control over the data. Does this create security issues?

Carlsson: The companies we work with have everything stored off-site, because everything is stored with AWS. But let’s say you’re a small company, you have your own data center and your data center can no longer hold the amount of data you want to put out. So you start to use AWS. Most likely you have increased your security. AWS has many thousands of people working on security. At a medium-sized company, you might have two people or three people working on security. AWS will always be a magnitude better than you are with your own data, in your data center. Unless you store it on actual paper, in a file in a vault. Then you might be able to handle it better.

Kellermann: If you’re going to consolidate your data in a remote cloud environment and that remoteness makes you nervous, one way to gain some control of your data is with BYOK, or bring your own cloud encryption key. (This means even if someone inappropriately accesses the keys to your cloud platform’s default encryption, they still can’t access your data.)

Human error

Brand-new data management technology comes with a risk of IT department flubs, a.k.a. human error. Mastery takes practice.

Carlsson: Human error is often the biggest issue in new technology. Maybe an IT team is very used to configuring MySQL database in a very secure way. But suddenly they go to work with ElasticSearch or Hadoop, and they don’t really know how to secure that. When you’re adopting any type of new technology that you’re unfamiliar with, then of course you increase the risk of something. If you’ve done something 100 times before, then you’re less likely to do something wrong.

Kellermann: A lot of times you have people that are trained and have skills, but they don’t understand how to manage security risks, and they take shortcuts. Greater attention needs to be paid to training folks in DevOps, specifically training folks in security DevOps and security life cycles. This is a massive training issue. The Sans Institute, hands down, is the best place to get training for anything related to risk management. They’re a nonprofit that basically has the best certifications out there. I swear by them.

Glitchy new software

Carlsson: When data software’s new, often there are unknown vulnerabilities or security issues in it. The more time a team’s had looking at something, the more likely they’ve found all the bugs and security issues in it.

Weeks: A lot of these big data solutions get open sourced, or maybe they’re a minimum viable product to prove that there’s a gap in the industry, and then they get adopted very quickly. Next thing you know, they’re scrambling to bolt additional security controls on. When a company does due diligence on software, they should have a low water mark — so if someone has a security program less mature than X, they will not do business with that vendor.

But I wouldn’t immediately exclude a new software solution that provides an innovative approach. A new software and an established software could both use the same vulnerable library. So I don’t like the idea that I should use something that’s known and established over something new. One thing I really dislike is a false sense of security.

Answers have been condensed and edited. Images courtesy of Shutterstock and interviewees.