Map Your Data Integrations, Or You Don’t Understand Your Risk

It’s 3am. Do you know, exactly, where your data is?

Which vendors receive or touch your data?

What type of data each vendor is involved with?

Startups and software move fast.  Pretty soon the SaaS and vendor integrations pile up.  Security and Privacy implications are real but invisible, versus the actual connections over which your company’s data is flowing.

Thesis: you need to maintain a living map of your SaaS and vendor integrations, otherwise you don’t actually understand your data tenancy or security.

Ask yourself, at your company do you know who owns the list of all vendors and data flows?  “I think someone knows…”  “We could pull that together…”  “It exists somewhere…”  You’re already behind.

If you’re in a regulated industry or in a high expectations contractual space with customers, not having this data integration map is a risky way to operate.

  • Security
    • Data movements
    • Blast radius
  • Privacy and Data Governance
    • Where is PII/PHI flowing?
    • Subprocessors (approval and reviews)
    • Data retention + deletion obligations
  • Compliance & Customer Trust
    • SOC2, HITRUST, HIPAA, PCI, etc.
    • Enterprise security questionnaire: “List all your subprocessors”

Without the data integrations map, you’re running a structural risk not just an operational inconvenience.

Data Integrations Graph (DIG)

Yes, ‘graph‘ as in the discrete mathematical sense.  A visualization with nodes and edges. An xlsx, google-sheet, or table simply isn’t going to cut it.

  • Nodes: your systems, vendors, SaaS tools, data partners
  • Edges: direct connections and data flows between Nodes

And a clarification: Data Flows vs. Data Integrations (and why this matters). The DIG can extend the idea of a formal Data Flow Diagram (DFD) beyond your own systems and services, to include every external system your data might touch. This is where most companies lose visibility.

Codify

As mentioned above, a spreadsheet can’t convey the graph.  Instead use a lightweight syntax like Mermaid, graphviz, or D2.  (There are many many many online renderers for graphviz.  And if you have a high security environment, you’ll want to download and run graphviz within your network to mitigate the chance of publicly leaking the graph.)  See basic and complex examples in graphviz syntax.  (Generated pics below.)

The graph may become complex visually. Rather it’s the code/syntax definition which is meant to be referenced (i.e. control-F for all mentions of a node) and not for the sole purpose of visualization generation. (But visualizations _are_ fun.)

Living Document

This is not just a “generate once” and now we’re done.  Put the graph under source control, assign an owner, and involve processes and teams to ensure it stays up to date.

  • Reviews: annual or quarterly for network or inventory
  • Incident Response: understand blast radius quickly
  • Shadow IT discovery: cross-check against other inventories or reality
  • Vendor / SaaS acquisitions, reviews, and offboarding
  • Threat Modeling
  • Data Flows: use the graph to inform the flow diagram

CTA

Start small (pick 5-10 systems), map the integrations, put it in source control, and integrate into regular company processes.

You don’t need a perfect graph. You need a living one.

The companies that win on trust aren’t the ones with the most policies — they’re the ones that actually understand their systems.

Side Project: Wordle Solver

TLDR: a side-project Wordle Solver, and the GitHub repository (with files/lines specifically linked throughout the rest of this post).

A New Side Project

“Side projects are good and fun.  So is Wordle.”

I always try to have a side-project in the mix.  In software development it’s quite important to stay pliable (a la Tom Brady), adaptable, and stay current to the latest in software languages, frameworks, and hosting paradigms (not necessarily Cloud just for the sake of ‘Cloud’).  It’s also important from a Product aspect.  With a side project you (as an engineer/technologist) have total control over the direction of the implementation.  The act of organizing/prioritizing what you want to implement can vastly help in your professional life where there is not as much control over the direction of Product (but on your own you will have recognized pitfalls, best-practices, or tools.)  And the value goes 4x when you collaborate on a side-project with 1+ other people.

Wordle Trie Search

“Using less electricity is good.”

One day my former roommate from college (Alex, a very bright computer scientist) sent me a text with a link to his Github repository.  He had a very advanced start on a Wordle guess validation algorithm, implemented using a recursively traversable trie structure containing ~all/most of the English words in the 5 character space.  (What’s nice about search trees is that search operations are much more efficient than a naive/linear approach.)

From the Wikipedia article Trie
Credit: Booyabazooka (based on PNG image by Deco).  Modifications by Superm401., Public domain, via Wikimedia Commons

Immediately I’m mentally committed.  This Wordle thing had taken off, I had played it a couple times and I loved the idea of being able to work with Alex again and build something in the Wordle arena. 

Updating the Algorithm

“He will win who knows when to fight and when not to fight.”— Sun Tzu

Alex had the algorithm at an 80% complete state.  Though we recognized it was not using all the information of a guess which had a correct letter but wrong location.  This code change/commit fixed the algorithm and would preclude unnecessary traversals of the trie.

For the Internet

Real artists ship.”— Steve Jobs

(No I’m not claiming to be an artist.  Just a technologist.)

I didn’t start this project, so I went looking for how I could bring extra value (i.e. enter a space for implementation that wasn’t being served yet).

Software is useless unless you have a channel to distribute it.  That’s why the Internet is so valuable.  Professionally I was already very familiar with the Java Spring framework, so I committed myself to creating a REST API to expose the underlying algorithm.

I created a Spring sub-project within the same repo, and referenced the algorithm and supporting files using symlinks which actually worked with the build!  I thought this was a neat way to include Alex’s code.  (I don’t know if I would recommend this approach professionally, it’s a little hacky.)

Automated Testing

They test it.  Exactly.

A nice addition by our third collaborator, Tyler, brought in some github workflows driving some unit tests.  This helped identify if anything was broken by a feature/bugfix branch.  Bonus: the unit tests in the ‘sub-project’ Spring application could be run consecutively after the root level tests.

For the Internet, take two

Wrapping the algorithm in Spring was not the correct idea.  I had not thought through on how I wanted to host the application.  An executable jar could have been compiled, but would have needed a virtual host or container to run on.  So instead I spent a weekend to wrap the algorithm a second time but using the AWS Lambda Handler so it could be run serverless.  (This could cut low-traffic hosting from $20/month down to about $2.)  Also some AWS ClouFormation automation helped (from an AWSDocs repo) with the iteration and deployments.  Though I manually integrated an API GW to the Lambda.

Front End

“If there’s a ‘trick’ to it, the UI is broken.”— Douglas Anderson

A little Bootstrapv4 CSS can go a long way, visually.  I’m not a front-end developer but being able to sling together some bare HTML, Bootstrap, forms and JQuery:ajax makes possible a lot of webapp creation.

I also included a fun animated background from a codepen project.

The HTML page is completely static and I uploaded to AWS S3 and aliased a Route53 record for a domain I own.

Lastly, it’s not quite a security thing (rather towards exclusivity), I modified the application to include an Access-Control-Allow-Origin in the header for every request to the API/algorithm.  This will instruct browsers to stop other websites from using my API, though anyone could curl against it if desired.  Or the repository is public, someone could deploy their own Lambda!

Thanks, and Wordle on!

Hello World!

After reading and benefiting from so many peoples’ cool blogs, I decided I wanted to have fun too.

I’m a Software Consultant, love talking about technology, and have wide interests outside of work (because I’m human).

To start my blog, I first started building a tech stack to host my blog. I was armpit deep in standing up a jekyll app, connected to a github repo, using an EC2 instance to build, then deploy to S3 where I could then CloudFront, then buy a cert in ACM to front and then use Rt 53… blah.

I just want to blog, man. I know I have hated on WordPress before. Credit card swiped and I have a personal account for a year, and my own domain!

Hello World!