This message was deleted.
# general
s
This message was deleted.
j
If you follow the concepts of immutable infrastructure, you don't want files to change on servers. Creating a new image of the server, bringing it up and rotating the old ones out is the way to do it. If it's not practical - then you need some kind of configuration management tool to help you mutate the state of servers. Ansible, Chef ir Puppet are popular ones.
k
But assuming you have such immutable infrastructure, then a rolling deploy or a blue-green deploy is how you deploy it out without bringing things down. We do this as part of our release process, twice a day almost every server gets replaced with a new one. There's a few minutes where everything is doubled, but it's far better than trying to manage instances That being said there's a few machines we have where we can't just replace them all the time. We've been pruning the number down, but those ones are where we use config management
b
I find its way easier and faster to change files, rather than replace images.
Because if you use packer for example the entire proces to change and validate that is not a walk in the park
k
No, but it can be totally automated which means the image build time is irrelevant. There's advantages like not having to worry about patching os's, garbage filling up volumes, etc
Anyway, I'd never go back to just patching machines, but if you still want to go that route I'd pick ansible over chef or puppet
b
Ok so if you have to setup something, ie during the development life cycle, how do you manages that, do you build pack deploy and watch how that plays out? Cos that is about 15 mins min to see changes reflected
Patchibg helps alot with the development life cycle, the feedback loop is way shorter
k
I suppose to clarify, the immutable infrastructure in my case is at production, with hundreds of servers and dozens of services. Also machine images for us are different than the artifact image. The image pulls down the artifact when it starts so we aren't building images all the time At dev test, it's easy to replace a file manually
b
Gotcha, that is what i was thinking. But even before you get to production you need to validate you config by creating a machine image as the final step
But i get you
k
Generally config changes at a slower pace than code. So they are handled as independent releases
Also, to clarify a bit more, the images contain templated config and an agent (proprietary in our case). Config that is environment specific is pulled by agent from things like userdata, secrets manager, etc on load. The image itself is rebuilt regularly to pick up OS patches and whenever a template change is required. If a template change is required, it's deployed with the previous artifact version to ensure it doesn't break anything ahead of the artifact requiring the config change. However if it's just a matter of tweaking things that are templated, it's just a matter of changing that in the source for the values and doing a redeploy, which can be done in about 2 minutes
We're now mostly k8s based, which makes all the above irrelevant, but it works well for us for our stuff that still lives on EC2 instances
b
Your k8 is managed right?
a
If immutable infrastructure isn't an option, then if you're building your AMIs using Packer and say the Ansible provisioner, you can use AWS dynamic inventory with Ansible to target those pet machines later and update them in-situ https://developer.hashicorp.com/packer/plugins/provisioners/ansible/ansible https://docs.ansible.com/ansible/latest/collections/amazon/aws/docsite/aws_ec2_guide.html
b
interesting.
so what happens when you change the playbook?
a
You both rebuild the AMI and re-run the playbook against any EC2 Instances running using historical versions of that AMI. If they're all tagged consistently should work as expected
If you hook in https://molecule.readthedocs.io tests you can get semi-automated in terms of building the AMIs safely from PRs and rolling out changes to your fleet in a canary fashion
k
@Bubunyo Nyavor yes managed, we used to manage a fair bit ourselves, but our team is small and in the time vs money debate, time was more important. We still run a very tight ship cost wise, but the effort to manage things like databases and k8s ourselves wasn't worth it. Of course you're mileage may vary, and what was the right decision for us can be the wrong one in your case
a
Very much agree with Kyle, what I suggested can work, but worth evaluating what makes sense for your situation and your team in the short-term and the long-term. Managing pet servers becomes a real pain-in-the-butt beyond a certain point
b
@Kyle Campbell agreed.
@Andrew Kirkpatrick what i am actually thinking is when you run a packer build, it should identify all instances in your fleet and patch then forward, that way you dont have to roll foward to make changes persist. this will present its own unique problems, but having to tear down and rebuild instance all the time is a wast of time i reckon.
a
That's essentially what I was (trying to) describe. Update the AMI and use the same playbook used to build to the AMI to update the fleet too.
having to tear down and rebuild instance all the time
As Kyle mentioned IMO that'll really depend on how big your fleet is and what it is running. Ansible and even Chef/Puppet hits a practicality limit at a certain point to manage pet servers (very much depending on each company)
k
As for tearing down/rebuilding, in our case it’s actually faster than patching. We use a blue green deploy process, so we launch all the machines at once and flip over to the new set, an operation that only takes 2-3 minutes, then kill the old ones. With patching (which we used to do), we had to roll machine-by-machine to prevent things from going down. It was long, and invariably you’d hit an issue after a dozen servers requiring a big ordeal to resolve and continue the process
I call it “just-in-time” blue green actually, as we don’t want to maintain a whole other set of servers all the time. It’s just during the deployment window
b
interesting point all around. I will give the ansible approach a try on my next run
k
I’m curious what your process is currently
b
i am trying to bootstrap a nomad/consul fleet on a non work project, and i run into a few issues configuring tls. i am using packer to generate ami’s and provisioning using manage instances and templates in gcp. Anytime i make a small change i had to pack, and create a new template and using click ops to roll over. the rollover take about 15 mins. so i had to ssh and do the changes there. but, tls is about comms between two nodes so i had to have two terminal open. the ttl of my changes was absurd. At work we run more than 1000 bare metal instance. i am not a platform engineer though, just a SWE. but i know it is impossible to have immutable images, it is impractical. we built our own tooling, which we are currently migrating to ansible. Cons, at any time, two machines that are supposed to be doing the same thing might have different configs, and given that our network is also proprietary, network issues is something we have to deal with on the constant due to the constantly diverging states. Pros are, changes are lightning fast, no matter the number of machines you target. So naturally, i am thinking how can i have both cakes.
k
Ah, baremetals. That definitely complicates things. Any of the big 3 Ansible, Chef or Puppet are basically made for that scenario and would help detect configuration drift. I’m not quite sure any of them I’d call “lightning fast” though. Ansible playbooks can be time consuming to push to all the machines, and with Chef it’s more an eventual consistency model whenever the agents on the machine pick up the change to the cookbook. I’ve never worked with Puppet, but I know it’s conceptually similar to Chef.
b
our custom tool was written in perl, idk how fast it is but if changes are distributed in less than a minute, it sure beats 15 minutes.
k
Given how fast it distributes the changes, my assumption it doesn’t query the machines current config state. It’s usually that process of reconciling differences where tools like Ansible need time. Ansible in particular is essentially describing your end state as code, it takes some time for it to figure out how to get from your current state to the desired state which is the same philosophy that ended up in Terraform. It may not be possible to have your cake and eat it too, but one of the things you can do to mitigate how long it will take is have the playbook run on a regular recurring basis so that it’s always bringing your machines into alignment with the config, so that pushing out a new change should take minimal time