cool hit counter The deletion of the library caused trouble, the senior engineer of Shunfeng "was run away"!_Intefrankly

The deletion of the library caused trouble, the senior engineer of Shunfeng "was run away"!


The libraries we deleted and the roads we ran through all those years ago.

Image from the Internet

Among IT practitioners, there is a group of people who are more low-key than programmers and in charge of the life and death of enterprises in the era of big data, they are the legendary DBAs. When it comes to the job function of a DBA, many people say it's much more complex than a programmer's day to day, not only dealing with applications, but also diving into the operating system and hardware. So when it comes to following up on what it's like to be a great DBA? Many people who have been here have chimed in, can you understand the sour feeling of being able to delete the library and not be able to run?

Recently, a technical engineer from Shunfeng told us first-hand, "Yes, that's right, that's kind of what it feels like. ".

A few days ago, according to the microblogging well-known Internet information blogger @big brother square gossip broke the news, Shunfeng technology data center of a Deng a production database by mistake, resulting in a service can not be used and lasted 590 minutes.

From @BigGuyPlaceGossip Twitter

Later, a netizen broke the news that this Deng was a senior engineer in IT operations and development of the Internet product operations and maintenance group of the application delivery technology department of the IT data center of Shunfeng Technology.

And the cause of the incident, according to screenshots of emails circulating online, shows that

After receiving the change request, Deng mistakenly selected the RUSS database during the operation, intending to delete the executed SQL. When selected for deletion, the cursor jumps back to the instance of the RUSS library because of its sloppy operation, and executes the deletion by delete without seeing the selected content, while Deng ignores the pop-up window reminder and enters directly, resulting in the deletion of the RUSS production database.

The temporary car insurance issuance function on the OMCS was unavailable for approximately 590 minutes due to a malfunction in the OMCS operations monitoring and control system caused by sloppy operation by operations and maintenance staff.

Currently, according to the relevant provisions of the company, Shunfeng has dismissed Deng, and in the Shunfeng technology network-wide notification of criticism. The incident immediately sparked a lot of buzz among programmers in the circle. Many people are helplessly saying that the library is deleted, so what's the point of not running away? Remember to watch the map and piggyback on the plane run or you'll be caught back in just a few minutes due to traffic, like the following.

In fact, it is no coincidence that, at home and abroad, the deletion of the library has long been not the first time, compared to the consequences of the Shunfeng deletion is not the most serious, next, we will jointly review those years, the deletion of those libraries brought about what consequences? And how do we avoid the frequent occurrence of bank deletion runs?

All those years, deleted libraries

Big Factory says.

In February 2017, a system administrator at GitLab suffered a DDoS attack while doing load balancing work on an online database. After stopping the attack, the operations staff discovered the database out-of-sync problem and started fixing it, during which the database directory delete command was incorrectly executed on the production environment at

This resulted in 300GB of data being deleted as 4.5G and GitLab being forced offline.

In June 2017, a former administrator of verelox.com, a cloud host in The Hague, Netherlands, deleted all of the company's customers' data and wiped the content on top of most servers.

This eventually led Verelox to take the network offline temporarily. In an official announcement released, Verelox said it has been working hard to recover the data, but unfortunately, all data that has been lost may not be recovered at this time.

In September 2017, a technical engineer of an IT major helped Guangxi Mobile to carry out capacity expansion cutover (i.e., increase system capacity), and accidentally formatted and deleted user data inside the HSS equipment, resulting in the loss of Guangxi Mobile's data of nearly 800,000 users.

Netizen says.

Meanwhile, a number of people from Knowing (https://www.zhihu.com/question/58802374) were horrified by the libraries that had been deleted, and are still haunted to this day by.

@ChangKaula.

A new graduate in the company is very weak at work and doesn't know how to do anything. Just start by asking her to help inventory the equipment assets in the server room.

Just because there was an asset tag that was hard to read, she pulled the blade server right out, and the person next to her with her saw it and was instantly petrified. The business system was down for ten minutes and the leader didn't say anything, he just wouldn't let her continue the inventory, and she, for one, didn't even know she was in trouble.

@PotatoDad.

Years ago (2001), when it was still a Unix character interface, I was doing routine maintenance in the middle of the night and deleted a library containing 200,000 books. After ten minutes of confirming to myself that something was wrong, I started sweating and my stomach felt like it had been punched so hard it started cramping and hurt so much I couldn't sit down.

It took a while for me to go over to the channel and smoke two cigarettes before I recalled that I'd done a full system backup the day before and not much data had been lost!

It was a feeling that will last a lifetime.

@ai0by:

The server is on Vultr, with more than 1000 users and a lot of visitors. One day I opened another test machine on Vultr, and when I was done testing and ready to delete it, I deleted the wrong machine and deleted the one with the website ...... (Need to gripe about Vultr's server interface, I thought the newly opened machine must be the one at the bottom, then deleted it without looking, not realising the one at the bottom wasn't the latest one opened! )

At that time, I can only say that I was very panicked, as if in a dream, sweating, I could only watch a message indicating the success of the deletion, then immediately submitted a ticket, Vultr told me that the deleted machine is not recoverable, instantly felt that a long time of business all for naught, it is hard to imagine that after operating for so long a mistake operation is all finished.

I found out later that that machine had been backed up before, and opened another machine to restore the image to the new one, a week ago, so at least it was saved, and the lost data was later replaced manually by myself.

The moment I deleted it, so many users came to me, I could only reply calmly that it was under maintenance, in actual panic, after the problem was almost solved, my own back was wet, never want to have it again, remember to make a backup, remember remember remember!

What else could we have done in those years, before running away?

Compared with the above deletion of the library incident, many netizens questioned the results and system of Shunfeng, have said: dismissed the engineers involved, Shunfeng itself is completely free of responsibility? All those lessons spent to train an OPM and just give it away?

Dissecting the surface, we can't help but think deeply, can Sooner really set aside its responsibility for the process issues due to the sacking? The accident, the good news is that the impact has not yet caused irreparable consequences, Shunfeng should do is not the first time to dismiss the employees involved, but through the lesson to see the internal problems:.

The deletion of the library incident occurred on the one hand due to the engineer's own mistakes, on the other hand, does it reflect the laxity of the daily management process, and the irregularity of the operation?

There is no separation of safety responsibilities and their direct supervisors should not be held responsible, except for the employee involved?

Confusing permissions control, where only one Ops engineer can directly manipulate the database?

Weak disaster recovery capabilities, and the incident took 590 minutes from occurrence to recovery for the large SF enterprise?

Therefore, in view of the above problems, how can we avoid the recurrence of incidents such as "running away" from the deletion of banks again?

In this regard, while companies first do a good job of managing permissions and multiple auditing mechanisms, CSDN has also taught many programmers how to

Use rm carefully on Linux

To avoid the tragedy of running from deletion of the library to.

One option is to redirect the rm command to be grafted to the mv command, which is the equivalent of a custom recycle bin for Linux systems. This is achieved as follows.

Finally, write the above script to /etc/bashrc and immediately execute the command source /etc/bashrc to take effect immediately.

The above script defines several commands.

rl: view the files under the recycle bin.

unrm filename or directory: revert to the current path.

rmtrash: Empty the recycle bin, but it will be a friendly prompt.

Executing rm will not actually delete, but will use mv to move to the recycle bin we specified. If you really want to delete it, you can do so with /bin/rm. Also, note that some of the parameters of the previous rm command may no longer be used, as rm is now actually mv.

And whether it's Ops, DBAs, or programmers, you should pay close attention to your daily coding practices and remember the consequences of "one mistake makes a thousand. It is also important to do the steps for automatic disaster tolerance, data synchronization when reviewing, and finally, and importantly, three times, don't forget to.

Back up!

Back up!

Back up!


Recommended>>
1、Norms and guidelines for agile teams
2、Google Brain Deep Learning from Beginner to Master Video Course 910 computer vision convolutional RBM
3、JavaWeb18jquery Study Notes Java Full Stack Development
4、Javascript floatingpoint problem analysis and solution
5、Design Patterns No 12 State Patterns

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号