Building an Enterprise Git Repository (the Hard Bits!)

Log in to watch

San Francisco 2016

Building an Enterprise Git Repository (the Hard Bits!)

Source code: Just put it in git, right? Enterprise scale? Github!

But what about when you have a lot of source code? Thousands of repositories? No problem! Github Enterprise or Bitbucket Server to the rescue!

Now: Add PCI & SOX. Confidential information. Separation of concerns. Audit. SSO. Centralized SSH key management. DR. Geographic diversity.

This is the part where you roll up your sleeves, and start doing the real work.

This talk starts where the vendors stop- discussing workflows to keep work moving, security & audit protections to ensure code integrity, and automation to connect to other enterprise systems.

Chapters

Full transcript

The complete talk, organized by section.

Matthew Barr

I'm Matthew Barr from Akamai. I actually am an architect there. And I'm going to be talking about enterprise Git and some of the hard bits that the vendors don't actually give us.

First of all, I'm not a lawyer. I'm not compliance. I'm not internal audit. I'm not a PCI assessor or QSA. I'm not the mama, but I am the daddy.

I actually am a sysadmin and DevOps engineer for the last 20 years or so. I came out of Lehman, MarkitSERV, so financials, but also some dating and social networking sites, as well as Nokia.

My focus at Akamai is actually in developer productivity. We provide tools to our engineers. Our group provides SCM, so both Git and Perforce, build systems, and CI and test systems.

My current project is actually working on building a horizontally scalable farm for build with Docker agents, which is a little interesting. And then also, the next step is going to be looking at enterprise-scale CI, but without necessarily being with multi-tenancy.

So, you want to store your code in Git. How many people actually remember that game? Okay. Yeah, small numbers.

I'm going to first start with some of the current existing solutions. You all probably know these. GitHub and Bitbucket: those are the cloud versions. Hosted, great features, low overhead, great for small teams. Pretty decent even for small-team, medium-sized stuff.

You have self-hosted options like GitLab, Gitolite, cgit. And then you've got enterprise options: GitHub Enterprise, they're upstairs; Bitbucket Server from Atlassian, integrates nicely with Jira. It used to be called Stash. GitLab Enterprise, I know some of their folks are here too. And then Perforce GitSwarm, which is a little different because it's backed on the Perforce side.

I'm going to give you a little information on how we're doing it at Akamai. We're just now starting to move into the Git universe. We have been Perforce for about 15 or more years. We're adding some teams using Git. About 6,000 repositories or more, 115 projects, which is what Swarm calls organizations. It's not, by any means, the primary code repository right now.

It was launched about a year and a half ago or so. And we're using Stash Data Center Edition, which provides some things I'll talk about a little later.

We're currently in two sites, so two app servers, two DB nodes, NetApp filers, and load balancers. And I don't know if people saw, they have APIs now upstairs, supposedly, and that's going to be a big help for actually figuring out what's going on with your replication.

I don't know if people have seen this tendency to shorten "operationalization," but who the heck wants to type that?

So, you do need to operationalize these tools. They come out of the box. Yes, they can be clustered, in some cases, the enterprise ones, but they don't just magically happen. The HA features, the DR features, geo diversity, backups.

All the tools differ between themselves. For example, GitHub Enterprise has clustering, has active and passive nodes. Point-in-time snapshots, they do.

First of all, I may be wrong on some of these, and if I am, feel free to correct me if you're from the vendor or you know better. These tools have been moving very rapidly. I know Stash is releasing a new version pretty much every month. They're putting out a minor version, not even a patch. So you're getting major features being delivered nearly every one to two months from Stash.

GitHub, I'm not sure their cadence. And the key is we actually have Stash, which is why I know that a little bit better.

But Bitbucket Server has self-service backups, so you're required to do all the backups. And you're also required to do all the database replication, and you need to provide it with snapshots.

There have been some improvements recently. They do have smart mirrors, which means you can actually put a local... So you've got a central site with your master servers, but you also can have remote replicas, full copies in remote locations, including anywhere in the world. Nice thing with Git is you can actually specify a read location and a write location when you're doing the origin setup. So you can actually fairly seamlessly have remote replicas.

By the way, if people have questions, I'll probably take them at the end, or I'll hang out around and I can take them afterwards because I know this is really tight on time.

Another problem we had with pulling any of these tools in is authentication for the enterprise. We have a mandate: no passwords. You're not allowed to use a password to a system that's brought in. You must use SSO. If you do have to use a password, it needs to be local to the system. It can't be a reuse of another credential set. And no one likes that at all.

And we really don't want to have that in place because then you can't shut people down. It's just a pain on anything that's going to be enterprise scale.

We have three types of access that's needed to these tools. You've got the web UI. You've got Git over SSH or HTTPS, and typically you're going to want to go ahead and provide your users with API access as well. I specify that they're slightly different because you may want to do different types of credentials on web UI versus API.

In our case, we actually use SAML via an Apache reverse proxy to go ahead and actually do the web UI authentication. We use actual SSH keys synced from a central LDAP server through Active Directory, actually. But we have multiple SSH keys in Active Directory, and those are replicated out to the whole company for various types of uses.

So we can go ahead and use that and synchronize your SSH keys from that directly. We'll actually take and overwrite whatever you've uploaded from whatever you have in the central system. But we also have X.509 client cert auth for APIs, because you can't really easily do SAML for an API connection.

This has been something we had to negotiate with InfoSec and enterprise security. So it's just a question you're going to have to determine for your own enterprise, what's correct.

Moving into safety and best practices. PCI and SOX kind of boil down to prevent unauthorized changes, is one, and review changes.

Now, the prior presenter just a few minutes ago was talking about code in various contexts of compliance. This is really looking at the code context of: what do you do about securing your code and things like that, and safety integrity? We're not talking only about PCI and SOX, so the whole range of all the controls for PCI or for SOX. But this is really about, in the context of your source control, what you need to worry about.

From what we can tell, and what we've talked with our auditors about and our InfoSec team, we really need to have code review for almost all the various types of compliance. And we need to have sign-offs. So in some tools, that's a plus one, and in some tools, that's an approver.

We need to prevent merges without pull requests to master, or to, say, a release branch or something like that. You don't want people making changes that didn't go through your pull requests.

Pull requests, to us, are actually our audit mechanism. They are logged. They're done on the server side. They're not done on the actual client side on your laptop. And so we can actually record those and keep track of those and provide that as evidence to the various assessors and things along those lines.

And this is a big question. People are like, "Why don't you allow fast-forward merges?"

Well, we actually found that the merge commit itself from the pull request is an audit point. That is actually where you've put into your Git log, you've actually got a point in there that says, "Pull request merged by so-and-so. It was approved by so-and-so. Here's the actual changes and the Jira issues that were in it, and here's a description of the change."

And that's how you know what's actually in that changeset. So if you get rid of those, you lose that audit point, and it's been encoded into the Git log. It can't easily be changed unless someone rewrites the whole history of Git. Well, you need to make sure that doesn't happen.

In order to do some of this, we also really looked at a couple different types of workflows. We actually came to focus on a branching workflow.

How many people here use Git? Right. See, that's so simple this way. People are probably pretty familiar with, or have heard of, Git flow, feature branches, things like that. Yep. Nod yes. Sounds good. Okay. Anyone not familiar with any of this? That's great.

All right. So we actually designed a combination of Git flow and a feature branch, or the GitHub workflow. One thing we noted: develop didn't necessarily add a whole lot of value for us, for people that know Git flow, but we did want the flexibility for a QA team to work instead of going directly off of master.

So if you want to go ahead and make our workflow CD, really deploy directly off of master, you could. You just wouldn't necessarily make that release branch at the end.

We actually had to go ahead and put in place controls to protect branches. Thankfully, some of the tools have added this. Stash actually now has it. Well, Bitbucket Server. And we also had to limit the users that can merge things.

We need to make sure that people don't force-push onto protected branches. What we mean by protected branches, though, are things like master, for example, and your release branches, where basically they act as release candidates, where you're going to have a look at what's going to go out to production if it survives testing.

We also found that we had wanted to unapprove pull requests when they were modified. So something that may happen, and actually in Stash's case, or Bitbucket, it doesn't actually support it. You actually have to add an optional plugin to, if you make a change, it marks it as unapproved so that someone has to reapprove it when that is fixed.

I don't know how you would do this with GitHub Enterprise, where you've got plus ones, but what I'm talking about here is: you've pushed code, you've opened a pull request, someone has gone in and approved it. And then another reviewer looks at it and says, "Hey, you missed some problems here." You go in and push some new fixes to it.

Well, that's not actually an approved code base. They've had a review on code that wasn't actually fully reviewed. So you really want to make sure that you're actually capturing the fact that there was a change after the approval had been granted, and now it needs a new approval.

Something that some of you all may not have actually thought of is that the pusher of a piece of code from Git is not the committer. You can change the committer to whomever you'd like on your laptop. No problem, right?

When you first set up your laptop to talk to Git, you generally probably do this the once and forget about it from here on in. But, well, how do I know who wrote the code?

If you look in the Git log, it shows the committer, maybe the author, which is also supported by Git, but there's no proof that the committer is the person that actually wrote the code. And the server really only knows who the pusher was, for example.

So what do you do? Well, GPG signing is one way, and I think GitHub Enterprise just added it earlier this fall. But that's really painful and requires a whole other set of public key infrastructure.

Another option would be to do things like log all commits by pusher, so that you know when a piece of code, the commit, the SHA, is actually not going to change unless you rewrite the whole code base, the history. So if you know the pusher and the commit and the SHA, you actually can identify where that actually came into your system, if it entered into the central server. So that's something to consider.

There's no perfect answers around this stuff right now, but people need to be aware that they are... People that are coming from SVN or Perforce, where you do have 100% confidence that the person that pushed the code into your system was the person that you say is listed as the author of that code.

It's not like someone can't take a piece of code, hand it to a friend, have them actually commit it to SVN for them, but at least it'll record the person that actually uploaded the code accurately.

Another question we ran into is actually a question about access control. We've got thousands of repositories, and in many cases that yields potentially thousands of ACLs. Obviously, there's organizations and projects, which are good, but how do you decide who gets to write to those projects?

And in many cases, as people have talked about in other places, the ability to go ahead and self-service is great. But if someone has the ability to add and change your access permissions as a setting, they probably have the ability to change the actual settings on the technical controls, turning off some of the things we rely on, first of all.

But second of all, they also have, at the same time, access to the system to write to the repository. Well, that may violate your audit needs. You may want to be able to go ahead and say, the person who's approving access changes or granting the ability to write to the actual repository may not be allowed to have write access to the repository at the same time.

So who's managing and who's approving access requests? Are they the same people who should have access to write to the repository, as I mentioned? And what about access audits? Do you do them quarterly? Do you do them more often?

We also need, in many cases, to worry about separations of concerns so that ops can't actually modify code that was written by the developers, because that's how some of the controls work.

Now, can you prove it is the other side. Some of these tools, there may not be universal read access available. You may have to be a sysadmin in order to go ahead and be able to see any repository, the permissions on any repository. Well, now you can change things.

Also, we strongly recommend automation. APIs, of course. Thank God almost all the enterprise Git tools actually support APIs and provide really nice ones. And they let you configure and do things, but there's not 100% coverage on all the settings in some cases. I think we're getting pretty close to the point with Bitbucket Server where almost everything you want is now available via an API.

But it's not necessarily... We need to make sure the vendors are actually providing the ability to set up a new repository, for example, a new project, without actually having to go ahead and go in and have someone do something. We really want to make sure that happens.

Now, we've actually gone ahead and looked at this and said we're probably going to need a separate external front end for user management. We also are going to need separate front ends for managing settings. You may want to go ahead and webhook-notify a Slack system for this repository, but this other repository may need this.

Another area of interest is also some audit tooling that we have, that we need to provide the ability for a repository owner, or the manager of the repository owner, more importantly, like the business owner, to actually be able to say, "Okay, provide the evidence." Store the evidence of the audit so that when the QSA comes and reviews the evidence, which is somewhat an audit of an audit, so the language kind of breaks down, it's a little confusing. So we need to have those.

And that's pretty much what I have, but I'm going to go ahead and note these are not going to help you. You can just Google them, and they'll be available later.

So one other thing we have found, though, is, speaking of webhooks: webhooks are not actually guaranteed delivery. Has anyone noted that?

If you're not polling, which is terrible, obviously. Polling is terrible for your SCM system, right? And you're not doing a nightly build, and you've got a webhook that says, "Oh, I've changed," to my build system, and the build system doesn't get that webhook, did it happen?

So we've actually looked and are thinking of actually using something akin to RabbitMQ or something like that to go ahead and send a message of the fact that there's been a change to something. And on the build system, that the build has finished, and I can notify the downstream consumers, maybe my CI system, to go ahead and start running tests. But it's just something to think about. Your webhooks are not actually guaranteed.

So here's some references. And I went relatively fast, actually. Perfect. I am happy to take questions and thoughts.

Q&A

Matthew Barr: Do we not have a microphone? So I'll repeat what you say when you ask. Try to speak loud.

Q: We use both Git and Perforce.

Matthew Barr: Okay.

Q: Perforce is used because there's more than just text, so we have content. Was that the reason why Akamai used Perforce?

A: So the question is, they're using Git and Perforce. We actually have Git and Perforce. Why do we use Git and Perforce?

Right now, Git has some very interesting attributes, meaning some really nice workflows that are possible. We've been using Perforce for something over 15 years. Git didn't exist. SVN, I don't believe existed. Perforce was really the best option, from what I understand then. I've been at the company about two years, so I really don't know what the thinking was 15 years ago.

Having said that, you did mention that Perforce supports things that are larger than text or large files. So in that vein, we've been looking at Git LFS, and both, I believe, GitHub Enterprise and Bitbucket Server support both Git LFS, as well as Artifactory and a couple other tools. So there's some interesting options available out there now.

We also have a custom build and dependency system, and so we have to actually look at... One of my projects is to figure some of this out, actually.

We're looking at Git LFS for things that are really integral to the actual repository, maybe a binary that's needed as part of the build, but not really a test object. But maybe a tarball for, say, a Debian might be versioned independently directly to Artifactory and not be put in as part of the actual repository, but just listed as a dependency in the file. Similar to, say, a Maven-type dependency or a Python or a RubyGem, where you're listing a dependency of, "Oh, I've got a dependency on that file."

That could be versioned independently. But yes, it's a valid question. And Git does impose some serious problems if you want to store large files because every clone needs to get the whole thing. And your CI tools are going to pound this.

One of the reasons I noted the mirrors is that that actually really helps there, to have the smart mirror, because it can have a full copy locally to your build system or all of your build systems around the world without having to go back to the central tool.

And because most Git commits are small, having the writes go a long distance and have a higher latency is much nicer than having those reads coming from far away. And so that heavy load of a CI build system on the master is really not so great, especially if it's any distance away.

We've got locations in the Bay Area, up in Washington State, Cambridge in Massachusetts, Poland, Israel, Bangalore, and those are just the major engineering offices. So we really don't have a perfect solution, which is why we're worrying about these kinds of problems.

And also why messaging buses are interesting, because we're going to want to build something for a developer that's in, say, the Bay Area, someplace close to the Bay Area. But I don't want to build something for a guy in Israel that produces a one-gig artifact from a one-line change in the Bay Area. I want to build it someplace near Israel so I don't have to transfer the one gig from here all the way there because it was only a one-line, 1K delta on the actual code base, but it produced a one-gig artifact. That's terrible.

Any other questions?

Okay. Oh, back there.

Q: Are you doing anything in your downstream system? So you talked about your webhooks, but further on too, you talked about how critical the auditing is in Git itself. Are you using that auditing further down the system as you move into deployment to verify things about your builds and verify things about your deployments?

A: So as I said, we're in the early stages of this, and we're designing some of the stuff. The question is, are we using any of the information from Git and the audit trails and things like that in the downstream systems?

And the answer is yes. We're actually currently updating the information in Bugzilla when a pull request is merged, which is a custom plugin that we wrote to do this inside of Stash, which is actually one thing to note.

We really like the idea of having the ability to put plugins into the code base for the tool. And the fact that Atlassian gives us the source code and lets you just add a plugin that you wrote can be really helpful when you're trying to deal with something that you don't have.

If you don't have a feature, you can write it at least, so you're not fighting the product dev time. But yes, there's some element. We're using some of that. We'll probably use more of it in terms of using the SHAs as part of the actual audit state and things along those lines.

And I think I am done. And I think you're supposed to go ahead and fill out the survey. They asked me to remind you. Bye.