We deleted the production database by accident


442 points | by caspii 3 days ago


  • skrebbel 3 days ago

    I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

    This is not good! We don't want to scare people into writing less of these. We want to encourage people to write more of them. An MBA style "due to a human error, we lost a day of your data, we're tremendously sorry, we're doing everything in our power yadayada" isn't going to help anybody.

    Yes, there's all kinds of things they could have done to prevent this from happening. Yes, some of the things they did (not) do were clearly mistakes that a seasoned DBA or sysadmin would not make. Possibly they aren't seasoned DBAs or sysadmins. Or they are but they still made a mistake.

    This stuff happens. It sucks, but it still does. Get over yourselves and wish these people some luck.

    • t0mas88 3 days ago

      The software sector needs a bit of aviation safety culture: 50 years ago the conclusion "pilot error" as the main cause was virtually banned from accident investigation. The new mindset is that any system or procedure where a single human error can cause an incident is a broken system. So the blame isn't on the human pressing the button, the problem is the button or procedure design being unsuitable. The result was a huge improvement in safety across the whole industry.

      In software there is still a certain arrogance of quickly calling the user (or other software professional) stupid, thinking it can't happen to you. But in reality given enough time, everyone makes at least one stupid mistake, it's how humans work.

      • janoc 3 days ago

        It is not only that but also realizing that there is never a single cause to an accident or incident.

        Even when it was a suicidal pilot flying the plane into a mountain on purpose. Someone had to supervise him (there are two crew members in the cockpit for a reason), someone gave him a medical, there is automation in the cockpit that could have at least caused an alarm, etc.

        So even when the accident is ultimately caused by a pilot's actions, there is always a chain of events where if any of the segments were broken the accident wouldn't have happened.

        While we can't prevent a bonkers pilot from crashing a plane, we could perhaps prevent a bonkers crew member from flying the plane in the first place.

        Aka the Swiss cheese model. You don't want to let the holes to align.

        This approach is widely used in accident investigations and not only in aviation. Most industrial accidents are investigated like this, trying to understand the entire chain of events in order that processes could be improved and the problem prevented in the future.

        Oh and there is one more key part in aviation that isn't elsewhere. The goal of an accident or incident investigation IS NOT TO APPORTION BLAME. It is to learn from it. That's why pilots in airlines with a healthy safety culture are encouraged to report problems, unsafe practices, etc. and this is used to fix the process instead of firing people. Once you start to play the blame game, people won't report problems - and you are flying blind into a disaster sooner or later.

        • oblio 3 days ago

          The user? Start a discussion about using better programming language and you'll see people, even here, blaming the developer.

          The common example is C: "C is a sharp tool, but with a sufficiently smart, careful and experienced developer it does what you want (you're holding it wrong").

          Developers still do this to each other.

          • m463 3 days ago

            That reminds me of the time during the rise of the PC when windows would do something wrong, from a confusing interface all the way up to a blue screen of death.

            What happened is that users started blaming themselves for what was going wrong, or start thinking they needed a new PC because problems would become more frequent.

            From the perspective of a software guy, it was obvious that windows was the culprit but people would assign blame elsewhere and frequently point the finger at themselves.

            so yes - an FAA investigation would end up unraveling the nonsense and point to windows.

            That said, aviation level of safety is reliable and dependable and few single points of failure and... there are no private kit jets darnit!

            There is a continuum from nothing changes & everything works to everything changes & nothing works. You have to choose the appropriate place on the dial for the task. Sounds like this is a one-man band.

            • watwut 3 days ago

              Yeah, but "pilot was drinking alcohol" would be considerate issue, would lead to fired pilot and would lead to more alcohol testing.

              I understand what you are taking about, but aviation has also strong expectations on pilots.

              • neillyons 3 days ago

                This sounds quite interesting. Any books you could recommend on the "pilot error" topic.

                • ENOTTY 2 days ago

                  The idea that multiple failures must occur for catastrophic failure is found in certain parts of the computing community. https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

                  • suzakus 3 days ago

                    It's a piece of software for scoreboards. Not the Therac-25, nor an airplane.

                  • ganafagol 3 days ago

                    It's good to have a post mortem. But this was not actually a post mortem. They still don't know how it could happen. Essentially, how can they write "We’re too tired to figure it out right now." and right after attempt to answer "What have we learned? Why won’t this happen again?" Well obviously you have not learned the key lesson yet since you don't know what it is! And how can you even dream of claiming to guarantee that it won't happen again before you know the root cause?

                    Get some sleep, do a thorough investigation, and the results of that are the post mortem that we would like published and where you learn from.

                    Publishing some premature thoughts without actual insight is not helping anybody. It will just invite the hate that you are seeing in this thread.

                    • ordu 3 days ago

                      > I'm appalled at the way some people here receive an honest postmortem of a human fuck-up. The top 3 comments, as I write this, can be summarized as "no, it's your fault and you're stupid for making the fault".

                      It seems that people annoyed mostly by "complexity gremlins". They are so annoyed, that they miss previous sentence "we’re too tired to figure it out right now." Guys fucked up their system, they restored it the best they could, they tried to figure out what happened, but failed. So they decided to do PR right now, to explain what they know, and to continue the investigation later.

                      But people see just "complexity gremlins". The lesson learned is do not try any humor in a postmortem. Be as serious, grave, and dull as you can.

                      • rawgabbit 2 days ago

                        For me, this is an example of DevOps being carried too far.

                        What is to stop developers for checking into Github "drop database; drop table; alter index; create table; create database; alter permission;"? They are automating environment builds and so that is more efficient right? In my career, I have seen a Fortune 100 company's core system down and out for a week because of hubris like this. In large companies, data flows downstream from a core system. When you have to restore from backup, that cascades into restores in all the child systems.

                        Similarly, I once had to convince a Microsoft Evangelist who was hired into my company, not to redeploy our production database, every-time we had a production deployment. He was a pure developer and did not see any problems of dropping the database, recreating the database, and re-inserting all the data. I argued that a) this would take 10+ hours b) the production database has data going back many years and that the schema/keys/rules/triggers have evolved during that time-- meaning that many of the inserts would fail because they didn't meet the current schema. He was unconvinced but luckily my bosses overruled him.

                        My bosses were business types and understood accounting. In accounting, once you "post" a transaction to the ledger that becomes permanent. If you need to correct that transaction, then you create a new one that "credits" or corrects the entry. You don't take out the eraser.

                        • bromuro 2 days ago

                          I think you should wait 10+ hours to read different kind of comments on HN.

                          For example, if i open the comments about a “14 hours ago” post, I usually see a top comment about other comments (like yours).

                          I then feel so out of the loop because i don’t see the “commenters” your are referring too - so the thread that follows seem off topic to me.

                          • caspii 3 days ago


                            • qz2 3 days ago

                              I disagree.

                              Culturally speaking we like to pat people on their back when they do something stupid and comfort them. But most of the time this isn’t productive because it doesn’t instil the requisite fear required when working out what decision to make.

                              What happens is we have growing complacency and disassociation from consequences.

                              Do you press the button on something potentially destructive because your are confident it is ok through analysis, good design and testing or confidence it is ok through trite complacency?

                              The industry is mostly the latter and it has to stop. And the first thing is calling bad processes, bad software and stupidity out for what it is.

                              Honestly these guys did good but most will try and hide this sort of fuck up or explain it away with weasel words.

                              • jurre 3 days ago

                                You should have zero fear instilled when pressing any button. The system or process has failed if a single person pressing a button can bring something down unintended. Fix the system/process, don’t “instill fear” onto the person, it’s toxic, plus now you have to make sure any new person on boarded has “fear instilled”, and that’s just hugely unproductive

                            • michelpp 3 days ago

                              > Computers are just too complex and there are days when the complexity gremlins win.

                              I'm sorry for your data loss, but this is a false and dangerous conclusion to make. You can avoid this problem. There are good suggestions in this thread, but I suggest you use Postgres's permission system to REVOKE DROP action on production except for a very special user that can only be logged in by a human, never a script.

                              And NEVER run your scripts or application servers as a superuser. This is a dangerous antipattern embraced by many and ORM and library. Grant CREATE and DROP to non-super users.

                              • sushshshsh 3 days ago

                                As a mid level developer contributing to various large corporate stacks, I would say the systems are too complex and it's too easy to break things in non obvious ways.

                                Gone are the days of me just being able to run a simple script that accesses data read only an exports the result elsewhere as an output.

                                • auroranil 3 days ago

                                  Tom Scott made a mistake with a similar outcome as this article, but with an SQL query that is much more subtle than DROP.


                                  By all means, find ways to fool-proof the architecture. But be prepared for scenarios where some destructive action happens to a production database.

                                  • thih9 3 days ago

                                    > You can avoid this problem.

                                    The article isn’t claiming that the problem is impossible to solve.

                                    On the contrary: “However, we will figure out what went wrong and ensure that this particular error doesn’t happen again.”.

                                    • DelightOne 3 days ago

                                      If you use terraform to deploy the managed production database, do you use the postgresql terraform provider to create roles or are you creating them manually?

                                      • bsder 3 days ago

                                        > You can avoid this problem.

                                        No, you can't. No matter how good you are, you can always "rm -rf" your world.

                                        Yes, we can make it harder, but, at the end of the day, some human, somewhere, has to pull the switch on the stuff that pushes to prod.

                                        You can clobber prod manually, or you accidentally write an erroneous script that clobbers prod. Either way--prod is toast.

                                        The word of the day is "backups".

                                        • centimeter 3 days ago

                                          > but this is a false and dangerous conclusion to make

                                          Until we get our shit together and start formally verifying the semantics of everything, their conclusion is 100% correct, both literally and practically.

                                        • oppositelock 3 days ago

                                          You have to put a lot of thought into protecting and backing up production databases, and backups are not good enough without regular testing of recovery.

                                          I have been running Postgres in production supporting $millions in business for years. Here's how it's set up. These days I use RDS in AWS, but the same is doable anywhere.

                                          First, the primary server is configured to send write ahead logs (WAL) to a secondary server. What this means is that before a transaction completes on the master, she slave has written it too. This is a hot spare in case something happens to the master.

                                          Secondly, WAL logs will happily contain a DROP DATABASE in them, they're just the transaction log, and don't prevent bad mistakes, so I also send the WAL logs to backup storage via WAL-E. In the tale of horror in the linked article, I'd be able to recover the DB by restoring from the last backup, and applying the WAL delta. If the WAL contains a "drop database", then some manual intervention is required to only play them back up to the statement before that drop.

                                          Third is a question of access control for developers. Absolutely nobody should have write credentials for a prod DB except for the prod services. If a developer needs to work with data to develop something, I have all these wonderful DB backups lying around, so I bring up a new DB from the backups, giving the developer a sandbox to play in, and also testing my recovery procedure, double-win. Now, there are emergencies where this rule is broken, but it's an anomalous situation handled on a case by case basis, and I only let people who know what they're doing touch that live prod DB.

                                          • azeirah 3 days ago

                                            Quick tip for anyone learning from this thread.

                                            If you're using MySQL, it's called a binary log and not a Write Ahead Log, it was very difficult to find meaningful Google results for "MySQL WAL"

                                            • x87678r 3 days ago

                                              Interesting, I immediately thought they would have a transaction log, I didn't think it would have the delete as well.

                                              Its a real problem that we used to have trained DBAs to own the data where now devs and automatic tools are relied upon, there isn't a culture or toolset built up yet to handle it.

                                              • mr_toad 2 days ago

                                                > I have all these wonderful DB backups lying around, so I bring up a new DB from the backups

                                                It’s nice to have that capability, but some databases are just too big to have multiple copies lying around, or to able to create a sandbox for everyone.

                                              • danellis 3 days ago

                                                > after a couple of glasses of red wine, we deleted the production database by accident

                                                > It’s tempting to blame the disaster on the couple of glasses of red wine. However, the function that wiped the database was written whilst sober.

                                                It was _written_ then, but you're still admitting to the world that your employees do work on production systems after they've been drinking. Since they were working so late, one might think this was emergency work, but it says "doing some late evening coding". I think this really highlights the need to separate work time from leisure time.

                                                • cle 3 days ago

                                                  No. Your systems and processes should protect you from doing something stupid, because we’ve all done stupid things. Most stupid things are done whilst sober.

                                                  In this case there were like 10 relatively easy things that could have prevented this. Your ability to mentally compile and evaluate your code before you hit enter is not a reliable way to protect your production systems.

                                                  Coding after drinking is probably not a good idea of course, but “think better” is not the right takeaway from this.

                                                  • cranekam 3 days ago

                                                    The whole piece has a slightly annoying flippant tone to it. We were drunk! Computers just.. do this stuff sometimes! Better to sound contrite and boring in such a situation IMO.

                                                    Also I agree with other comments: doing some work after a glass or two should be fine because you should have other defences in place. “Not being drunk” shouldn’t be the only protection you have against disaster.

                                                    • danielh 3 days ago

                                                      According to the about page, "the employees" consist of one guy working on it in his spare time.

                                                      • bstar77 3 days ago

                                                        I am not a drinker myself (drink 1-3 times a year), but in the past I have coded while slightly buzzed on a few occasions. I could not believe the level of focus I had. I never investigated it further, but I'm pretty sure the effects of alcohol on our coding abilities is not nearly as bad as it affects our motor skills. Imo, fatigue is far worse.

                                                        • beervirus 3 days ago

                                                          Sounds like they weren't trying to do work on production.

                                                          • murillians 3 days ago

                                                            I think it's just a joke

                                                            • caspii 3 days ago

                                                              I have no employees. I only have myself to blame

                                                              • yelloweyes 3 days ago

                                                                lol it's a scoreboard app

                                                              • aszen 3 days ago

                                                                I had a narrow escape once doing something fancy with migrations.

                                                                We had several MySQL string columns as long text type in our database but they should have been varchar(255) or so. So I was assigned to convert these columns to their appropriate size.

                                                                Being the good developer I was, I decided to download a snapshot of the prod database locally and checked the maximum string length we had for each column via a script. Using this script it made a migration query that would alter column types to match their maximum used length keeping the minimum length as varchar (255).

                                                                I tested that migration and everything looked good, it passed code review and was run on prod. Soon after we start getting complaints from users that their old email texts have been truncated. I then realize the stupidity of the whole thing, the local dump of production database always wiped out many columns clean for privacy like the email body column. So the script thought it had max length of 0 and decided to convert the column to varchar(255).

                                                                I realize the whole thing may look incredibly stupid, that's only because the naming for db columns was in a foreign european language so I didn't know even know the semantics of each column.

                                                                Thankfully my seniors managed to restore that column and took the responsibility themselves since they had passed the review.

                                                                We still did fix those unusually large columns but this time by simple duplicate alter queries for each of those columns instead of using fancy scripts.

                                                                I think a valuable lesson was learned that day to not rely on hacky scripts just to reduce some duplicate code.

                                                                I now prefer clarity and explicitness when writing such scripts instead of trying to be too clever and automating everything.

                                                                • heavenlyblue 3 days ago

                                                                  And you didn’t even bother to do a query of the actual maximum length value of the columns you were mutating? Or at least query and see the text in there?

                                                                  Basically you just blindly ran the migration on the data and checked if it didn’t fail?

                                                                  The lesson here is not about cleverness unfortunately.

                                                                • john_moscow 3 days ago

                                                                  Just my 2 cents. I run a small software business that involves a few moderately-sized databases. The day I moved from a fully managed hosting to a Linux VPS, I have crontabbed a script like this to run several times a day:

                                                                      for db in `mysql [...] | grep [...]`
                                                                          mysqldump [...] > $db.sql
                                                                      git commit -a -m "Automatic backup"
                                                                      git push [backup server #1]
                                                                      git push [backup server #2]
                                                                      git push [backup server #3]
                                                                      git gc
                                                                  The remote git repos are configured with denyNonFastForwards and denyDeletes, so regardless of what happens to the server, I have a full history of what happened to the databases, and can reliably go back in time.

                                                                  I also have a single-entry-point script that turns a blank Linux VM into a production/staging server. If your business is more than a hobby project and you're not doing something similar, you are sitting on a ticking time bomb.

                                                                  • candiddevmike 3 days ago

                                                                    Anyone reading the above: please don't do this. Git is not made for database backups, use a real backup solution like WAL archiving or dump it into restic/borg. Your git repo will balloon at an astronomical rate, and I can't imagine why anyone would diff database backups like this.

                                                                    • wolfgang000 3 days ago

                                                                      I don't believe having a massive repo with backups would be the ideal solution. Couldn't you just upload the backup to an s3 bucket instead?

                                                                      • Ayesh 3 days ago

                                                                        This is what I do too.

                                                                        The mysqldump command is tweaked to use individual INSERT clauses as opposed to one bulk one, so the diff hunks are smaller.

                                                                        You can also sed and remove the mysqldump timestamp, so there will be no commits if there are no database changes, saving the git repo space.

                                                                        • mgkimsal 3 days ago

                                                                          Any issues with the privacy aspect of that data that's stored in multiple git repos? PII and such?

                                                                          • bufferoverflow 3 days ago

                                                                            You should really compress them instead of dumping them raw into Git. LZ4 or ZStandard are good.

                                                                          • amingilani 3 days ago

                                                                            Happens to all of us. Once I required logs from the server. The log file was a few gigs and still in use. so I carefully duplicated it, grepped just the lines I needed into another file and downloaded the smaller file.

                                                                            During this operation, the server ran out of memory—presumably because of all the files I'd created—and before I know it I'd managed to crash 3 services and corrupted the database—which was also on this host—on my first day. All while everyone else in the company was asleep :)

                                                                            Over the next few hours, I brought the site back online by piecing commands together from the `.bash_history` file.

                                                                            • tempestn 3 days ago

                                                                              Seems unwise to have an employee doing anything with production servers on their first day, let alone while everyone else is asleep.

                                                                              • netheril96 3 days ago

                                                                                Why does the DB get corrupted? Does ACID mean anything these days?

                                                                              • xtracto 3 days ago

                                                                                This happened to me (someone in my team) a while ago but with mongo. The production database was ssh-tunneled to the default port of the guys computer and he ran tests that cleaned the database first.

                                                                                Now... our scenario was such that we could NOT lose those 7 hours because each customer record lost meant $5000 usd penalty.

                                                                                What saved us is that I knew about the oplog (binlog in mysql) so after restoring the backup i isolated the last N hours lost from the log and replayed it on the database.

                                                                                Lesson learned and a lucky save.

                                                                                • fma 3 days ago

                                                                                  Same happened to me many years ago. QA dropped the prod db. It's been many years but if I recall, I believe in the dropdown menu of the MongoDB browser, exit & drop database were next to each other...Spent a whole night replaying the oplog.

                                                                                  No one owned up to it, but had a pretty good idea who it was.

                                                                                  • 3np 3 days ago

                                                                                    A dangling port-forward was my first thought to how this happened.

                                                                                  • muststopmyths 3 days ago

                                                                                    >Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. We’re too tired to figure it out right now. The gremlins won this time.

                                                                                    Obviously, somehow the script ran on the database host.

                                                                                    some practices I've followed in the past to keep this kind of thing from happening:

                                                                                    * A script that deletes all the data can never be deployed to production.

                                                                                    * scripts that alter the DB rename tables/columns rather than dropping them (you write a matching rollback script ), for at least one schema upgrade cycle. you can always restore from backups, but this can make rollbacks quick when you spot a problem at deployment time.

                                                                                    * the number of people with access to the database in prod is severely restricted. I suppose this is obvious, so I'm curious how the particular chain of events in TFA happened.

                                                                                    • amluto 3 days ago

                                                                                      I have a little metadata table in production that has a field that says “this is a production database”. The delete-everything script reads that flag via a SQL query that will error out of it’s set in the same transaction as the deletion. To prevent the flag from getting cleared in production, the production software stack will refuse to run if the “production” flag is not set.

                                                                                      • mcpherrinm 3 days ago

                                                                                        The blog mentions it's a managed DigitalOcean database, so the script likely wasn't run on the host itself.

                                                                                        More likely, I'd suspect, is something like an SSH tunnel with port forwarding was running, perhaps as part of another script.

                                                                                        • StavrosK 3 days ago

                                                                                          Someone SSHed to production and forwarded the database port to the local machine to run a report, then forgot about the connection and ran the deletion script locally.

                                                                                          • PeterisP 3 days ago

                                                                                            One aspect that can help with this is separate roles/accounts for dangerous privileges.

                                                                                            I.e. if Alice is your senior DBA who would have full access to everything including deleting the main production database, then it does not mean that the user 'alice' should have the permission to execute 'drop database production' - if that needs to be done, she can temporarily escalate the permissions to do that (e.g. a separate account, or separate role added to the account and removed afterwards, etc).

                                                                                            Arguably, if your DB structure changes generally are deployed with some automated tools, then the everyday permissions of senior DBA/developer accounts in the production environment(s) should be read-only for diagnostics. If you need a structural change, make a migration and deploy it properly; if you need an urgent ad-hoc fix to data for some reason (which you hopefully shouldn't need to do very often), then do that temporary privilege elevation thing; perhaps it's just "symbolic" but it can't be done accidentally.

                                                                                            • jlgaddis 3 days ago

                                                                                              > the number of people with access to the database in prod is severely restricted

                                                                                              And of those people, there should be an even fewer number with the "drop database" privilege on prod.

                                                                                              Also, from a first glance, it looks like using different database names and (especially!) credentials between the dev and prod environments would be a good idea too.

                                                                                            • unnouinceput 3 days ago

                                                                                              Quote: "Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine. Also: of course we use different passwords and users for development and production. We’re too tired to figure it out right now.

                                                                                              The gremlins won this time."

                                                                                              No they didn't. Instead one of your gremlins ran this function directly on the production machine. This isn't rocket science, just the common sense conclusion. Now it would be a good time to check those auditing logs / access logs you're suppose to have them enabled on said production machine.

                                                                                              • skytreader 3 days ago

                                                                                                > Instead one of your gremlins ran this function directly on the production machine.

                                                                                                Exactly my first hypothesis too. But then keepthescore claims,

                                                                                                > of course we use different passwords and users for development and production.

                                                                                                How would this hypothesis explain that?



                                                                                                John Watson: "I deduce that someone changed the dev config source so that it uses the production config values."

                                                                                                Sherlock Holmes: "My dear Watson, while that is sensible, it seems to me that the balance of probability leans towards their production instance also having the development access credentials."


                                                                                                Just my way of saying, I think this case isn't as shut and closed as most comments (including parent) imply. I personally find the /etc/host mapping a likelier hypothesis but even that can't explain how different credentials failed to prevent this. Without more details coming from a proper investigation, we are just piling assumptions on top of assumptions. We are making bricks, without enough clay, as Holmes would say.

                                                                                                • ir123 3 days ago

                                                                                                  If managed DBs on DigitalOcean are anything like those on AWS, you can not run commands directly on them since SSH is prohibited. EDIT: there's also the deal with different credentials for dev and prod envs.

                                                                                                  • toredash 3 days ago

                                                                                                    2 cents his hosts file points localhost to the prod db IP

                                                                                                    • dschuetz 3 days ago

                                                                                                      As someone else said it below: "hardcoded to localhost" doesn't mean it's hardcoded. It means it goes to whatever localhost resolves to. Really hardcoded should ALWAYS mean:

                                                                                                    • This is bad operations.

                                                                                                      That it happened meant that there were many things wrong with the architecture, and summing up the problem to “these things happen” is irresponsible, most importantly your response to a critical failure needs to be in the mindset of figuring out how you would have prevented the error without knowing it was going to happen and doing so in several redundant ways.

                                                                                                      Fixing the specific bug does almost nothing for your future reliability.

                                                                                                      • cblconfederate 3 days ago

                                                                                                        > Computers are just too complex and there are days when the complexity gremlins win.

                                                                                                        Wow. But then again it's not like programmers handle dangerous infrastructure like trucks, military rockets or nuclear power plants. Those are reserved for adults

                                                                                                        • geofft 3 days ago

                                                                                                          I'm not sure I follow your point - I think you'll find the same attitude towards complexity by operators of military rockets and nuclear power plants. If you look at postmortems/root-cause analyses/accident reports from those fields, you'll generally find that a dozen things are going wrong at any given time, they're just not going wrong enough to cause problems.

                                                                                                          • yunruse 3 days ago

                                                                                                            I feel that computers make it easier for this danger to be more indirect, however. The examples you give are physical, and even the youngest of child would likely recognise they are not regular items. A production database, meanwhile, is visually identical to a test database, if measures are not made to make it distinct. Adults though we may be, we're human, and humans can make really daft mistakes without the right context to avoid them

                                                                                                            • ineedasername 3 days ago

                                                                                                              Yep, I hate when I'm dealing with a system literally comprised of logic and a magic gremlin shows up to ruin my day.

                                                                                                              Seems like they though a casual "everyman" type of explanation would suffice, but really who would trust them after this?

                                                                                                              • eezurr 3 days ago

                                                                                                                One explanation for the author feeling that way is that the system is has too much automation. Being in a situation where you take on more responsibilities of the system at a shallower level leads to less industry expertise. This, as it turns out, places the security of the system in a precarious position.

                                                                                                                These is pretty common, as devs tool belts have grown longer over time.

                                                                                                                I think at some point we will stop automating or reverse some of the automation.

                                                                                                              • ricksharp 3 days ago

                                                                                                                Are you sure it was the production database that was affected?

                                                                                                                If you are not sure how a hard coded script that was targeting localhost affected a production database, how do you know you were even viewing the production database as the one dropped?

                                                                                                                Maybe you were simply connected to the wrong database server?

                                                                                                                I’ve done that many times - where I had an initial “oh no“ moment and then realized I was just looking at the wrong thing, and everything was ok.

                                                                                                                I’ve also accidentally deployed a client website with the wrong connection string and it was quite confusing.

                                                                                                                In an even more extreme case: I had been deploying a serverless stack to the entirely wrong aws account - I thought I was using an aws named profile and I was actually using the default (which changed when I got a new desktop system). I.e. aws cli uses —profile flag, but serverless cli uses —aws-profile flag. (Thankfully this all happened during development.)

                                                                                                                I now have deleted default profiles from my aws config.

                                                                                                                • jrochkind1 3 days ago

                                                                                                                  The lack of the seriousness/professionalism of the postmortem seemed odd to me too. So, okay, what is this site?

                                                                                                                  > KeepTheScore is an online software for scorekeeping. Create your own scoreboard for up to 150 players and start tracking points. It's mostly free and requires no user account.

                                                                                                                  And also:

                                                                                                                  > Sat Sep 5, 2020, Running Keepthescore.co costs around 171 USD each month, whilst the revenue is close to zero (we do make a little money by building custom scoreboards now and then). This is an unsustainable situation which needs to be fixed – we hope this is understandable! To put it another way: Keepthescore.co needs to start making money to continue to exist.


                                                                                                                  So okay, it's basically a hobby site, for a service that most users probably won't really mind losing 7 hours of data, and that has few if any paying customers.

                                                                                                                  That context makes it make a little bit more sense.

                                                                                                                  • emodendroket 3 days ago

                                                                                                                    This post is embarrassing. "yeah we were drinking and accidentally nuked the prod DB. Not sure why. Shit happens!" Who would read this and think they should trust this company? Any number of protections could have been taken to prevent this and production access in any state other than fully alert and attentive shouldn't happen unless it is absolutely necessary for emergency reasons

                                                                                                                    • bstar77 3 days ago

                                                                                                                      I think it's kind of funny they chose to post this story rather than do a typical post mortem.

                                                                                                                      • corobo 3 days ago

                                                                                                                        This reply is embarrassing. It's a person working on their side project. Have a glass of wine mate.

                                                                                                                        • tcbasche 3 days ago

                                                                                                                          Yeah why should I treat anything this company does with any level of seriousness? Why should anyone?

                                                                                                                          It's lucky it's just some online scoreboard because I'm sure as shit this stuff has happened before with more critical systems and it scares the hell out of me that engineers are fine blaming "gremlins" instead of taking responsibility for their own incompetence.

                                                                                                                        • mbroshi 3 days ago

                                                                                                                          I love this post. This sort of thing happens to everyone, most people just are not willing to be so open about it.

                                                                                                                          I was once sshed to the production server, and was cleaning up some old files that got created by an errant script, one which file was '~'. So, to clean it up, I type `rm -rf ~`.

                                                                                                                          • meowface 3 days ago

                                                                                                                            Somewhat similar story from many years ago. Was in ~/somedirectory, wanted to clear the contents, ran `rm -rf *`. Turns out somewhere in between I had done a `cd ..`, but I thought I was still in the child directory. Fastest Ctrl+C ever once I saw some permission errors, but most of the home directory was wiped in that second or two.

                                                                                                                            Didn't have a backup of it unfortunately, though thankfully there wasn't anything too critical in there. Mostly just lost a bunch of utility scripts and dotfiles. I feel like it's beneficial in the long run for everyone to make a mistake like this once early on in their career.

                                                                                                                          • heelix 3 days ago

                                                                                                                            Ah man, these things happen. One of our developers - very new to elastic - was asked to modify some indexes. Folks were a bit too busy to help or heading out on holiday. One stack overflow answer later... delete and recreate it... and she was off to the races. When the test was tried, it looked like things still worked. A quick script did the same to stage and prod, in both data centers. Turns out that is not a great way to go about it. It deleted the documents. We got lucky, as we still had not killed off the system we were migrating off of and it only took three days of turn and burn to get the data back on the system.

                                                                                                                            So many lessons learned that day. I trust her with the master keys at this point, as nobody is more careful with production than her now. :)

                                                                                                                            • fideloper 3 days ago

                                                                                                                              RDS is so very worth paying for this type of issue (in many cases, obviously $60 to multiple thousands a month isn’t great for everything).

                                                                                                                              Otherwise having a binlog based backup (or WAL, I guess, but i don’t know PG that well) is critical.

                                                                                                                              The key point there is they provide point in time recovery possibilities (and even the ability to rewrite history).

                                                                                                                              • latch 3 days ago

                                                                                                                                Barman (1) is really easy to setup and lets you avoid the many pitfalls of RDS (lower performance, high cost, no superadmin, no arbitrary extensions, waiting for releases, bad log interface).

                                                                                                                                (1) https://www.pgbarman.org/

                                                                                                                              • lysp 3 days ago

                                                                                                                                I had a client who had prod database access due to it being hosted internally. They called up saying "their system is no longer working".

                                                                                                                                After about an hour of investigation, I find one of the primary database tables is empty - completely blank.

                                                                                                                                I then spend the next hour looking through code to see if there's any chance of a bug that would wipe their data and couldn't find anything that would do that.

                                                                                                                                I then had to make "the phone call" to the client saying that their primary data table had been wiped and I didn't know what we did wrong.

                                                                                                                                Their response: "Oh I wrote a query and accidentally did that, but thought I stopped it".

                                                                                                                                • dvdbloc 3 days ago

                                                                                                                                  At my job, the company computers are configured to send “localhost” to the local company DNS servers, of which they happily reply with the IP address of the last machine that’s gotten a DHCP lease with the hostname “localhost”. Which happens often. Needless to say, our IT dept isn’t the best.

                                                                                                                                  • junglejoose 3 days ago

                                                                                                                                    You aren’t a real engineer until you do this. So congrats on the promotion! :)

                                                                                                                                    • zmmmmm 3 days ago

                                                                                                                                      Indeed - after incidents like this I usually say, "This is called experience that you can't pay to get for any price. Learn from it well, and value your lesson."

                                                                                                                                      • li4ick 3 days ago

                                                                                                                                        Yeah, imagine if a bridge engineer said the same thing: "You aren't an engineer until your bridge collapses. Congrats!" I am starting to hate tech culture. Nobody cares about correctness and discipline. Mention "math" and everybody spreads like cockroaches.

                                                                                                                                        • emerongi 3 days ago

                                                                                                                                          I wrote a migration that dropped columns for a functionality that was no longer to be used.

                                                                                                                                          Then the client wanted that functionality back. Oops.

                                                                                                                                          • Guest42 3 days ago

                                                                                                                                            Definitely, the likelihood of these things happens goes way up alongside the amount of concurrent tasks, meetings, or other forms of distraction. I’ve seen my fair share of production administration as a side task.

                                                                                                                                          • smadge 3 days ago

                                                                                                                                            SSH tunnel from localhost to prod on database port?

                                                                                                                                            • macNchz 3 days ago

                                                                                                                                              A likely culprit. Having worked on a bunch of early-stage products where best practices are a distant future dream, I’ve developed a few “seatbelt” habits I use to avoid these kinds of things. One of them is to always use a random high-number local port if I’m tunneling to a production service.

                                                                                                                                              Another is to change my terminal theme to a red background before connecting to anything in production...never want to click the ‘psql ...’ tab, run “truncate table app_user cascade” and realize afterwards it was a lingering connection to production...

                                                                                                                                              • sfkdjf9j3j 3 days ago

                                                                                                                                                That's my guess too. Naughty naughty. That means his production creds are the same as his development creds!

                                                                                                                                                • rmrfrmrf 3 days ago

                                                                                                                                                  vscode remote extension perhaps

                                                                                                                                                • Negitivefrags 3 days ago

                                                                                                                                                  If you are using postgres, configure it to keep the WAL logs for at least 24 hours.

                                                                                                                                                  They could have used point-in-time recovery to not lose any data from this at all.

                                                                                                                                                  • yjftsjthsd-h 3 days ago

                                                                                                                                                    If you can do this, then yes by all means do it, but that has significant impact on disk usage.

                                                                                                                                                  • madsbuch 3 days ago

                                                                                                                                                    Things like these happen, and we should be compassionate towards them.

                                                                                                                                                    Often small changes to the structure drastically reduce probability of stuff like this happening.

                                                                                                                                                    Eg. we use docker to setup test and dev databases and seed from (processed) dumps. When we need to clean our database, we simply put down the docker container. Ie. we do not need to implement destructive database cleanup eliminating structure that could potentially fail.

                                                                                                                                                    Having policies about not accessing production database directly (and allow the extra time for building tooling around that policy), good preview / staging environments, etc. All fail eliminating structure.

                                                                                                                                                    • prh8 3 days ago

                                                                                                                                                      Very unsettled by the flippancy of the entire article

                                                                                                                                                      • 3np 3 days ago

                                                                                                                                                        And all the emojis on top of that

                                                                                                                                                      • anonunivgrad 3 days ago

                                                                                                                                                        I wouldn’t want to be on the wrong side of a lawsuit, defending drunk employees working on production data. What outrageous recklessness. And how imprudent to admit this to the public. Some things are best kept to yourself. No one needs to know that.

                                                                                                                                                        • tedk-42 3 days ago

                                                                                                                                                          If you built good systems that have reliable backups or rollback mechanisms, dev should be able to have a beer on a friday and do a deployment without worrying about the fire they just caused.

                                                                                                                                                          I'd rather a culture where people admit to their mistakes than one where they try hide or get whipped for owning up to them.

                                                                                                                                                          We're people after all and some of us like a glass of wine and unwind while still performing our duties as engineers. After all, it's not like we're in charge of life support or critical systems which absolutely cannot fail.

                                                                                                                                                          • robjan 3 days ago

                                                                                                                                                            This is someone's side project. I'm doubtful there will be any lawsuits involved

                                                                                                                                                            • colesantiago 3 days ago

                                                                                                                                                              thank you for your highly valued expert opinion.

                                                                                                                                                              edit: a stunning new record of bots flagging my post within 3 minutes. woo hoo...

                                                                                                                                                              never change hn.

                                                                                                                                                            • suzzer99 3 days ago

                                                                                                                                                              We have something similar with AWS Cognito. If a user signs up but doesn't go through with the verification process, there's no setting to say "remove them after X days". So we have to run a batch job.

                                                                                                                                                              If I screw up one parameter, instead of deleting only unconfirmed users, I could delete all users. I have two redundant checks, first when the query is run to get the unconfirmed users, and then again checking the user's confirmed status before deleting them. And then I check one more time further down in the code for good measure. Not because I think the result will be different, but just in case one of the lines of code is altered somehow.

                                                                                                                                                              I put BIG LOUD comments everywhere of course. But it still terrifies me.

                                                                                                                                                              • 6nf 3 days ago

                                                                                                                                                                Soft deletes reduces the scariness

                                                                                                                                                              • a-b 3 days ago

                                                                                                                                                                Recreate and seed test database is totally ok in RoR world.

                                                                                                                                                                I think the main reason of this accident is lack of separation between development and operations.

                                                                                                                                                                • jacquesm 3 days ago

                                                                                                                                                                  This happens more often than you might think.

                                                                                                                                                                • xnyan 3 days ago

                                                                                                                                                                  localhost is an abstraction, it's a non-routable-outside-your-machine network...except it's not. It's nothing more than normal TCP traffic except with a message to the OS and other programs that whatever is on that local computer network, you don't want it routed outside the local computer.

                                                                                                                                                                  There's absolutely nothing stopping anything with access to localhost from routing it anywhere that process wants. Does not even take a malicious actor, all kinds of legit programs expose localhost. It's really not something you should use for anything except as a signal to other well-behaving programs that you are using the network stack as a machine-local IPC bus.

                                                                                                                                                                  • defen 3 days ago

                                                                                                                                                                    The fact that the production db has the same username/password as the development one is perhaps more troubling.

                                                                                                                                                                  • rafamvc 3 days ago

                                                                                                                                                                    A very similar thing happened to living social during their brightest years, but the replication and the backups had failed too. The oldest valid backup was about a week old. It took the whole company offline for 2 days. It took a team of 10 people and some extra consultants to come up with a half backed version of the latest database based on elastic cache instances, redis caches and other "not meant to be a backup" services. It was insane walking in an office that had hundreds of employees and see they all gone while we rebuild this cobbled together db.

                                                                                                                                                                    At one point someone called it "good enough" and they basically had to honor the customer word if they had purchased something and it wasn't there.

                                                                                                                                                                    It was a mess.

                                                                                                                                                                    It was on all major news, and it was really bad press. In the end, they actually had a massive bump in their sales afterwards. Everyone went to checkout their own purchases and ended up buying something else, and the news was like free ads.


                                                                                                                                                                    • bArray 3 days ago

                                                                                                                                                                      Hmm, there seems to be some holes in their system. A database might go down for any reason.

                                                                                                                                                                      I also have daily backups, but I write logs (locally and regularly copy from the production server) for all database actions to disk for the purpose of checking through them if something goes wrong, or having the option to replay them encase something like this happens. SO you have your database backups as "save points" and the logs to replay all the "actions" for that day.

                                                                                                                                                                      • jlgaddis 3 days ago

                                                                                                                                                                        The wonderful thing about computers is that they do exactly what they are told to do.

                                                                                                                                                                        The worst thing about computers? They do exactly what they are told to do.

                                                                                                                                                                        • fma 3 days ago

                                                                                                                                                                          I do a lot of work with middle/high school students. Without fail someone would yell "Why is it doing x!"...to which my standard reply is "Because you told it to".

                                                                                                                                                                          • wruza 3 days ago

                                                                                                                                                                            they do exactly what they are told to do

                                                                                                                                                                            Well, tell them to UNDROP and see what happens.

                                                                                                                                                                          • fallingfrog 3 days ago

                                                                                                                                                                            I once replaced a bunch of customer photos with a picture of Spock as part of a test on my first week on the job.. The dB admin had just overwritten a sales force dev dB from production and a previous developer had hardcoded the IP address of production in the code of a script somewhere..

                                                                                                                                                                            • webel0 3 days ago

                                                                                                                                                                              This is my greatest fear when it comes to terraform:

                                                                                                                                                                              > terraform destroy

                                                                                                                                                                              (And either a confirmation or a flag) and everything is deleted.

                                                                                                                                                                              I know you can add some locks but still :/

                                                                                                                                                                              • malwrar 3 days ago

                                                                                                                                                                                You can save yourself from scary operations like deleting everything by a.) not rooting your entire infra in the same main.tf and b.) using Terraform's lifecycle meta-argument: https://www.terraform.io/docs/configuration/resources.html#l...

                                                                                                                                                                                I like to use the lifecycle feature for suuuper core things that will never be deleted (VPC, r53 zone, etc) and eventually when I start targeting multiple DCs w/ lots of infra I'll eventually move to many state roots (or use tools like Terragrunt, which make things mildly scary again).

                                                                                                                                                                              • boltefnovor 3 days ago

                                                                                                                                                                                Companies tend to get really good at backups after events like this.

                                                                                                                                                                                • yla92 3 days ago

                                                                                                                                                                                  I am sorry this happens.

                                                                                                                                                                                  > local_db = PostgresqlDatabase(database=database, user=user, password=password, host='localhost', port=port)

                                                                                                                                                                                  I am guessing this part. Even though the host is hardcoded as "localhost" , when you do a ssh port-forwarding, the localhost might actually be the real production. e.g sudo ssh user@myserverip -L 3333:localhost:3306

                                                                                                                                                                                  • codegladiator 3 days ago

                                                                                                                                                                                    > Computers are just too complex and there are days when the complexity gremlins win.

                                                                                                                                                                                    > However, we will figure out what went wrong and ensure that that particular error doesn’t happen again.

                                                                                                                                                                                    How can you say statement 2 just after statement 1 ? Isn't statement 1 just plain acceptance of defeat ?

                                                                                                                                                                                    And looking at all the replies here, is this a feel good thread for the mistakes you made ?

                                                                                                                                                                                    • caconym_ 3 days ago

                                                                                                                                                                                      In context, statement 1 regards proactively eliminating all bugs and risks. Statement 2 regards understanding the root cause of this particular incident and reactively fixing it so it won’t happen again.

                                                                                                                                                                                      Acknowledging statement 1 doesn’t mean giving up—it simply means being clear and realistic about the nature and scale of the problem we’re facing when we try to build complex software systems. In the face of that we can give up, or we can just do the best we can, and it sounds more like these people are doing the latter.

                                                                                                                                                                                      • ssalka 3 days ago

                                                                                                                                                                                        I like to think that, by addressing the known bugs as they pop up, over time you can box the complexity gremlins into tighter and more predictable spaces. Though as long as humans are building these systems, that box will always be there, and the predictability of those gremlins' behavior will only go so far.

                                                                                                                                                                                      • 0kl 2 days ago

                                                                                                                                                                                        > Note that host is hardcoded to localhost. This means it should never connect to any machine other than the developer machine.

                                                                                                                                                                                        Just to help with the postmortem:

                                                                                                                                                                                        1) “localhost” is just a loopback to whatever machine you’re on

                                                                                                                                                                                        2) the user and pw are pulled from config

                                                                                                                                                                                        So someone was running this from the production server or had the production DB mapped to localhost and ran it with a production config for some reason (working with prod data maybe). The hard coding to localhost will only ensure that it works for the machine it’s called on - in this case the prod server.

                                                                                                                                                                                        Things you might do to avoid this in the future include a wide spread of things, the main recommendations I’d have are:

                                                                                                                                                                                        1) only put production artifacts on prod

                                                                                                                                                                                        2) limit developer access to prod data

                                                                                                                                                                                        Best of luck

                                                                                                                                                                                        • fencepost 3 days ago

                                                                                                                                                                                          Does the database name value allow specifying a host as part of it?

                                                                                                                                                                                          • mijoharas 3 days ago

                                                                                                                                                                                            This is what I came to say. Postgres usually allows a connection string that contains all the details as the database name postgres://username:password@host:port/dbname and I think it will take priority over separately specified host, depending on the client.

                                                                                                                                                                                            Tough way to learn that lesson though.

                                                                                                                                                                                          • exabrial 3 days ago

                                                                                                                                                                                            If you keep configuration in the environment (/etc/default/app-name) rather than in the application package, it's nearly impossible to make this mistake (especially with proper firewall rules). You can even package your config as a deb and keep it encrypted version control.

                                                                                                                                                                                            • smarx007 3 days ago

                                                                                                                                                                                              A question to the DBA experts from a developer: is there a way in MySQL and Postgres to configure a log specifically for destructive SQL queries so that it's easier to investigate a situation like this? I.e. to log most queries except for usual SELECT/INSERTs.

                                                                                                                                                                                              Also, @oppositelock pointed out that WAL would contain the destructive query too. How does one remove a single query from a WAL for replay or how does one correctly use WAL to recover after a 23-hour old backup was restored?

                                                                                                                                                                                              Finally, how does one work on the WAL level with managed DBs on AWS or DO if they block SSH access?

                                                                                                                                                                                              • restlessbytes 3 days ago

                                                                                                                                                                                                While I'm very sympathetic to "we accidentally nuked our prod DB" because, let's admit it: we've all been there at some point, I'm also a bit baffled here because I don't think that the problem lies with too much wine, Postgres permissions or scripts on localhosts but the fact that recreating a database by FIRST dropping all tables and THEN recreating them is like deliberately inviting those gremlins in.

                                                                                                                                                                                                But, as I said, that happens and blaming doesn't fix anything, so, for the future:

                                                                                                                                                                                                1. make a temporary backup of your database 2. create tables with new data 3. drop old tables

                                                                                                                                                                                                • glintik 3 days ago

                                                                                                                                                                                                  "at around 10:45pm CET, after a couple of glasses of red wine, we deleted the production database by accident". That's not an accident, guys..

                                                                                                                                                                                                  Stop drink and deploy something on production, especially at late evening time.

                                                                                                                                                                                                  • vox17 3 days ago

                                                                                                                                                                                                    This line's the winner for me : "Thankfully nobody’s job is at risk due to this disaster. The founder is not going to fire the developer – because they are one and the same person."

                                                                                                                                                                                                    • ClumsyPilot 3 days ago

                                                                                                                                                                                                      To be fair he could, and least in theory - he could get someone else to do development, for money or equity, for the project and do something else himself.

                                                                                                                                                                                                    • jugg1es 3 days ago

                                                                                                                                                                                                      I totally sympathize with you and yours, I've made sphincter-clenching mistakes a handful of times during my 20 years of experience.

                                                                                                                                                                                                      This is an abject lesson that understanding human psychology is actually a huge part of good architecture. Automating everything you do in production with a script that is QA tested prior to use is the best way to avoid catastrophic production outages.

                                                                                                                                                                                                      It does take a bit longer to get from start to finish, and younger devs often try to ignore it, but it is worth putting a layer of oversight between your employees and your source of revenue.

                                                                                                                                                                                                      • jzer0cool 3 days ago

                                                                                                                                                                                                        Could someone explain more what caused the prod wipe? The snip here indicates it is using a 'dev' credential (it is a different pass than prod right?) - how does a db connection occur at all?

                                                                                                                                                                                                        • fma 3 days ago

                                                                                                                                                                                                          Good catch. Wouldn't surprise me if there's 1 username/password for all their DB environments.

                                                                                                                                                                                                        • sergiotapia 3 days ago

                                                                                                                                                                                                          I did the same thing once by accident and thankfully only lost 1 hour of data. The single lowest point of my career and thinking about the details of that day makes my stomach sink even now as I type this.

                                                                                                                                                                                                          I ran a rake db schema dump command of some kind and instead of it returning the schema of the database, it decided to just completely nuke my entire database. Instantly. It's very easy to fuck up so cover your ass gents, and backup often and run your restores periodically to make you can actually do them in case of an emergency.

                                                                                                                                                                                                          • flurdy 3 days ago

                                                                                                                                                                                                            In one of my first jobs I deleted the last 30 days of our production data.

                                                                                                                                                                                                            Shit happens. You learn and try to never repeat it. And share with others so hopefully they learn.

                                                                                                                                                                                                            Ps. Don't do knee-jerk late at night quick patches. For example don't stop a database that has run out of disk space, try to migrate the data in memory first... And also do proper backup monitoring, and restores. Having 30 days of 0 byte backups is not that helpful. :)

                                                                                                                                                                                                            • ing33k 3 days ago

                                                                                                                                                                                                              I run a replicated ClickHouse server setup, Clickhouse uses zookeeper to enable replication. The zookeeper instance was not replicated.it was a single node. The server on which zookeeper was running ran out of hard disk and Clickhouse went into read only mode. Luckily,no data was lost while this happened because we use RabbitMq to store the messages before it gets written to the db. Thanks to RabbitMq's ACK mechanism.

                                                                                                                                                                                                              • cmeacham98 3 days ago

                                                                                                                                                                                                                While several other users have posted takeaways for how to prevent this from happening, I'd be interested in if anybody has an idea of how this happened given the code that was posted?

                                                                                                                                                                                                                Presumably, a managed DB service should essentially never be available on `localhost`. Additionally, it would be very weird for `config.DevelopmentConfig` to return the production database credentials.

                                                                                                                                                                                                                • luord 3 days ago

                                                                                                                                                                                                                  This is what nightmares are made of.

                                                                                                                                                                                                                  > We’ve learned that having a function that deletes your database is too dangerous to have lying around.

                                                                                                                                                                                                                  Indeed, anything that might compromise the data, anything that might involve deletion anyway, should require manual confirmation whether you manage the database or it's a service provided.

                                                                                                                                                                                                                  Sadly, I learned this the hard way too, but at least it was a single column with a non-critical date and not the entire database.

                                                                                                                                                                                                                  • ystad 3 days ago

                                                                                                                                                                                                                    > Thankfully our database is a managed database from DigitalOcean, which means that DigitalOcean automatically do backups once a day. Do cloud providers provide a smaller window for backup. Are there better ways to reduce the backup window here for DBs? Would love to understand any techniques folks use to minimize the backup window?

                                                                                                                                                                                                                    • axegon_ 3 days ago

                                                                                                                                                                                                                      I did something similar once - I had fiddled with my /etc/hosts and subsequently connected to the production database without realizing. I dropped a table but thankfully it wasn't much of a deal - the monitoring rang the bell and I recreated it a few seconds later. All that happened was that I had logged out several hundred users.

                                                                                                                                                                                                                      • dschuetz 3 days ago

                                                                                                                                                                                                                        Someone might have copy&pasted it elsewhere and that propagated away. This is why writing code can also be dangerous in open dev. Whoever programmed anything also should be sensible enough to judge their own code whether it could be dangerous in the wild. Once out there (or worse: on stackoverflow) it could wreak havoc.

                                                                                                                                                                                                                        • sdepablos 3 days ago

                                                                                                                                                                                                                          VPN to production and same hostname for dev and prod?

                                                                                                                                                                                                                          • kerng 3 days ago

                                                                                                                                                                                                                            ...and same credentials apparently also - there are lots of things that could have prevented something like this.

                                                                                                                                                                                                                          • rawgabbit 3 days ago

                                                                                                                                                                                                                            The article should be renamed to “We coded a function that drops database tables and were surprised when it broke production.”

                                                                                                                                                                                                                            • EugeneOZ 3 days ago

                                                                                                                                                                                                                              I love the honesty, self-irony and transparency of the article. It's sad and annoying to see so many young naive devs writing ”oh they are so bad, it will never happen to me”.

                                                                                                                                                                                                                              Yes, people are not perfect and computer systems are complex. Admit it and don't be so overconfident.

                                                                                                                                                                                                                              ”Errare humanum est”, prepare your backups.

                                                                                                                                                                                                                              • aerxes 3 days ago

                                                                                                                                                                                                                                This postmortem is incomplete: it fails to address the main three roots of the problem:

                                                                                                                                                                                                                                1. This business is too flippant with their write-able production access.

                                                                                                                                                                                                                                2. No user should have DROP DATABASE grants on production.

                                                                                                                                                                                                                                3. Clearly one of their employees was using a port forward to access production.

                                                                                                                                                                                                                                • hannofcart 3 days ago

                                                                                                                                                                                                                                  It takes a great deal of integrity to admit that you deleted a database because you were mucking around in your infra after red wine.

                                                                                                                                                                                                                                  And it bodes well for your firm that that doesn't get you fired either.

                                                                                                                                                                                                                                  These things happen to the best of us but having dealt with it responsibly and honestly as a team is something you can be proud of IMO.

                                                                                                                                                                                                                                  • bachmeier 3 days ago

                                                                                                                                                                                                                                    > And it bodes well for your firm that that doesn't get you fired either.

                                                                                                                                                                                                                                    Maybe there's more than one interpretation of "bodes well" but not knowing how to do the one thing customers were paying you to do isn't consistent with my definition.

                                                                                                                                                                                                                                    > having dealt with it responsibly and honestly as a team is something you can be proud of IMO.

                                                                                                                                                                                                                                    "We were drinking wine and deleted the database and now your data's gone LOL" is not something that should make you proud.

                                                                                                                                                                                                                                  • xupybd 2 days ago

                                                                                                                                                                                                                                    I don't understand how this happened if localhost is hard coded and the password is different. I don't think they fully understand why this happened. At least enough to prevent it from happening again.

                                                                                                                                                                                                                                    • minkeymaniac 3 days ago

                                                                                                                                                                                                                                      That is why we can only access the prod db from a jump box in our shop.... And even then it's just certain people with less privileges than a sysadmin account. No way you can do this accidentally from your laptop then...

                                                                                                                                                                                                                                      • pachico 3 days ago

                                                                                                                                                                                                                                        I still remember how many years ago, when someone in my team told me one Friday afternoon "there's not something like 'undo' for 'drop table', right?". He spent the weekend recreating the data.

                                                                                                                                                                                                                                        • pachico 3 days ago

                                                                                                                                                                                                                                          Yes, Manuel, I'm talking about you! :)

                                                                                                                                                                                                                                        • throwdbaaway 3 days ago

                                                                                                                                                                                                                                          The script also unnecessary complicates things. If it just does the equivalent of rake db:drop, this incident wouldn't have happened, since postgres wouldn't allow a database with active connections to be dropped.

                                                                                                                                                                                                                                          • gregors 3 days ago
                                                                                                                                                                                                                                            • majkinetor 3 days ago

                                                                                                                                                                                                                                              Once it happened to me, and now all scripts in last 10 years have

                                                                                                                                                                                                                                                  if (Env.Name -like '*prod*' ) { then throw }
                                                                                                                                                                                                                                              and similar to all destructive stuff.
                                                                                                                                                                                                                                              • nix23 3 days ago

                                                                                                                                                                                                                                                To all the people who say "that could never happen to me" work less then 5y in the industry. That can happen to anyone anytime.

                                                                                                                                                                                                                                                Remember: You just fix the errors YOU can think of.

                                                                                                                                                                                                                                                • jbverschoor 3 days ago

                                                                                                                                                                                                                                                  > Why? This is something we’re still trying to figure

                                                                                                                                                                                                                                                  Probably the admin has set the hba config to trust localhost. Solution - don't use the same db name in prod jsut to be sure

                                                                                                                                                                                                                                                  • beefbroccoli 3 days ago

                                                                                                                                                                                                                                                    Anyone else think this is just a clever ad for DigitalOcean?

                                                                                                                                                                                                                                                    • info781 3 days ago

                                                                                                                                                                                                                                                      Not really they lost a day of data, toy database.

                                                                                                                                                                                                                                                    • info781 3 days ago

                                                                                                                                                                                                                                                      Why did they not have archive log mode on in production? Losing a db is one thing, but should have only been one hour of data lost.

                                                                                                                                                                                                                                                      • AdrianB1 3 days ago

                                                                                                                                                                                                                                                        Focus, time, expertise and cost. For a side project with no revenue, cost is a very important factor. The others come with having side projects.

                                                                                                                                                                                                                                                      • noja 3 days ago

                                                                                                                                                                                                                                                        A function called database_model_create() should not drop something.

                                                                                                                                                                                                                                                        Here it would have failed to create the already existing tables and raised an error.

                                                                                                                                                                                                                                                        • freitasm 3 days ago

                                                                                                                                                                                                                                                          Lost seven hours of data? Daily backup with no transaction log backup?


                                                                                                                                                                                                                                                          • lisper 3 days ago

                                                                                                                                                                                                                                                            Yeah, this. The problem is not that the production database was deleted by accident. The problem is that it was possible to (unrecoverably) delete the production database by accident.

                                                                                                                                                                                                                                                            • jrockway 3 days ago

                                                                                                                                                                                                                                                              You'll find that most users of cloud databases are in this boat. For example, on GCP, deleting the database instance deletes the backups! You have to write your own software if you want to survive that button click.

                                                                                                                                                                                                                                                            • tijuco2 3 days ago

                                                                                                                                                                                                                                                              That's the reason I never authorize dev machines connect to production. And that's the reason developers hate my security team.

                                                                                                                                                                                                                                                              • ummonk 3 days ago

                                                                                                                                                                                                                                                                What a frustrating post :( Provides just enough technical detail to pique our curiosity then leaves us hanging.

                                                                                                                                                                                                                                                                • zerr 3 days ago

                                                                                                                                                                                                                                                                  It reads like the db was deleted intentionally for the sake of blog post - marketing that is. :)

                                                                                                                                                                                                                                                                  • jw360 3 days ago

                                                                                                                                                                                                                                                                    If your script connects to a production database by accident, you have a whole different issue.

                                                                                                                                                                                                                                                                    • gcc_programmer 3 days ago

                                                                                                                                                                                                                                                                      The maturity of the article is laughable. I'm sure my age is the same as the people who wrote it, but this is unacceptable: dropping databases in prod is a serious issue, not a joke. I think the culture of the company is toxic and not professional at any level. #change-my-mind

                                                                                                                                                                                                                                                                      • azeirah 3 days ago

                                                                                                                                                                                                                                                                        It's one individual's side-project, chill.

                                                                                                                                                                                                                                                                      • tus88 3 days ago

                                                                                                                                                                                                                                                                        Well u have backups right...or did the gremlins eat them?

                                                                                                                                                                                                                                                                        • tzs 3 days ago

                                                                                                                                                                                                                                                                          Here is how we had our database deletion error, about 15 years ago. Our DBs had been on leased servers at a hosting company in New York City. They were getting out of the datacenter business so we had to move. We were moving to colocated servers at a Seattle datacenter.

                                                                                                                                                                                                                                                                          This was the procedure:

                                                                                                                                                                                                                                                                          1. Restore DB backups in Seattle.

                                                                                                                                                                                                                                                                          2. Set up replication from NYC to Seattle.

                                                                                                                                                                                                                                                                          3. Start changing things to read from Seattle, with writes still going to NYC.

                                                                                                                                                                                                                                                                          4. After everything is reading from Seattle and has been doing so with no problems for a while, change the replication to be two-way between NYC and Seattle.

                                                                                                                                                                                                                                                                          5. Start switching writes to Seattle.

                                                                                                                                                                                                                                                                          6. After both reads and writes are all going to Seattle and it has been that way for a while with no problems, turn off replication.

                                                                                                                                                                                                                                                                          7. Notify me that I can wipe the NYC servers, for which we had root access but not console access. I wasn't in the IT department and wasn't involved in the first 6 steps, but had the most Unix experience and was thought to be the best at doing a thorough server wipe.

                                                                                                                                                                                                                                                                          My server wipe procedure was something like this.

                                                                                                                                                                                                                                                                          8. "DELETE FROM table_name" for each DB table.

                                                                                                                                                                                                                                                                          9. "DROP TABLE table_name" for each DB table.

                                                                                                                                                                                                                                                                          10. Stop the DB server.

                                                                                                                                                                                                                                                                          11. Overwrite all the DB data files with random data.

                                                                                                                                                                                                                                                                          12. Delete all the DB data files.

                                                                                                                                                                                                                                                                          13. Delete everything else of ours.

                                                                                                                                                                                                                                                                          14. Uninstall all packages we installed after the base system install.

                                                                                                                                                                                                                                                                          15. Delete every data file I could find that #14 left behind.

                                                                                                                                                                                                                                                                          16. Write files of random data to fill up all the free space.

                                                                                                                                                                                                                                                                          The problem was with step #6. They declared it done and turned it over to me for step #7 without actually having done the "turn off replication" part of step #6. Step #8 was replicated to Seattle.

                                                                                                                                                                                                                                                                          It took them a while to figure out that data was being deleted and why that was happened.

                                                                                                                                                                                                                                                                          We were split across three office buildings, and the one I was in had not yet had phones installed in all the offices, and mine was one of the ones with a phone. None of the people whose offices did have phones were in, so they lost a few more minutes before realizing that someone would have to run a couple blocks to my office to tell me to stop the wipe.

                                                                                                                                                                                                                                                                          It took about 12 hours or so afterwards for them to restore Seattle from the latest backup, and then replay the logs from between the backup time and the start of the deletes.

                                                                                                                                                                                                                                                                          After that they were overly cautious, taking a long time to let me resume the NYC wipe. They went right up to the point where I told them if we didn't start now we might not finish, and reminded them that those machines had sensitive customer personal information on them and were probably going to end up being auctioned off on eBay by the hosting company. They came to their senses and told me to go ahead.

                                                                                                                                                                                                                                                                          • rullelito 3 days ago

                                                                                                                                                                                                                                                                            LOL, I forgot how few safety measure startups have in place.

                                                                                                                                                                                                                                                                            • tluyben2 3 days ago

                                                                                                                                                                                                                                                                              Like this never happens in big corps: sure, mostly on departmental level but still enterprises, not startups. They try to escape the annoying and slow as molasses dba’s/devops and install/create/buy/saas systems to avoid the red tape. But then the same things as with startups go wrong. And rather often too; we often had these calls asking if we could restore.

                                                                                                                                                                                                                                                                              Obviously there are plenty of large enterprise wide data breaches, which, I would say is actually worse than losing a day of data in a lot of cases. So also not so many satefy measures, again, worse than startups; at least they have an excuse of being understaffed and underfunded.

                                                                                                                                                                                                                                                                              • xwdv 3 days ago

                                                                                                                                                                                                                                                                                Given the short time frames that startups aim to capture value in its not worth investing time in safety until you have plenty of capacity to spare later.

                                                                                                                                                                                                                                                                              • ineedasername 3 days ago

                                                                                                                                                                                                                                                                                Related: Little Bobby Tables (https://xkcd.com/327/)

                                                                                                                                                                                                                                                                                • Step 0) automated backups Step 1) full manual testing of said automated backup Step 2) weekly test of the supposedly automated process ... Now you can continue to your hair-brained engineering "processes"

                                                                                                                                                                                                                                                                                  • sanmak 3 days ago

                                                                                                                                                                                                                                                                                    That's awesome and crazy at the same time. It has thrill along with foolishness as well. It happens with one of my colleague friend's and he was fired at that time only. Best of luck!

                                                                                                                                                                                                                                                                                    • known 3 days ago

                                                                                                                                                                                                                                                                                      Sack whoever is responsible;

                                                                                                                                                                                                                                                                                      • Biganon 3 days ago

                                                                                                                                                                                                                                                                                        Yeah, this person working alone on this hobby project should sack themself. Thank you for your macho no-bs insight.

                                                                                                                                                                                                                                                                                        • boltefnovor 3 days ago

                                                                                                                                                                                                                                                                                          So sack management?

                                                                                                                                                                                                                                                                                        • f223ff23 3 days ago

                                                                                                                                                                                                                                                                                          Haha alcoholics complains they deleted a productive database by mistake :D

                                                                                                                                                                                                                                                                                          • NikolaeVarius 3 days ago

                                                                                                                                                                                                                                                                                            Thinking that localhost is anything special is like year 1.5 developer mistake