Jump to content

The Surprising Truth Behind the Mysterious Failure of My Raspberry Pi


Recommended Posts

  • Community Administrator

The start of the day…

It was a typical Friday morning on March 17th, 2023, and just like any other day, @Neptune and I started our day at our computers. My routine usually began with a sprint to catch up on emails and sift through the "Stuff that happened whilst you were asleep" pile of work that always seemed to accumulate overnight - a fun perk of being business owners, as we liked to sarcastically joke. As @Neptune brewed us a cup of coffee, we dove into our work, but our grumbling stomachs soon interrupted us. It was time to fuel up.

Our stomachs were growling fiercely, so we headed straight to the kitchen, hoping to satiate our hunger pangs. We were both craving food - and we wanted it fast. After scouring the cupboards, we settled on a fresh loaf of bread and decided to make some good old-fashioned toast. Cereal isn’t really our thing, so we rarely keep it in the house. We retrieved some butter from the fridge, pulled out a few slices of bread, loaded them into the toaster, and pushed down the lever to initiate the toasting process. Side note: it turns out that our toaster had a special mode for bagels, although we aren't entirely sure how that differs from regular toasting, though, or why it’s even a thing.

Without warning, the toaster's supposedly innocuous red heating elements sparked a blinding blue flash. In an instant, all the devices plugged into our outlets went dead. We had inadvertently tripped the RCD in our fuse box, causing a total power outage in our home. As if that wasn't bad enough, the toaster had also failed to properly toast our bread, leaving it completely untoasted.

Fantastic… So my technical difficulties aren’t really limited to my streams!

In the wake of the terrifying incident, I couldn't help but let out a few expletives, while my wife was understandably distraught over the loss of all her hard work. After restoring the breaker to bring power back on, I set out to investigate what had gone wrong. Turning off the toaster at the wall, I removed the bread and took a closer look. It quickly became clear that the metal cage meant to hold the bread in place had become stuck, inadvertently making contact with those hazardous red wires. Thankfully, with a bit of jostling, I was able to free the cage and get the toaster up and running again, allowing us to finally enjoy our toast without any further complications.

After enjoying our toast, reassured by the fact that we had identified and resolved the issue, we switched our computers back on and resumed our day without a care in the world. The incident was quickly forgotten as we focused on the work ahead, grateful that it hadn't turned into something more serious.

But something was wrong. Very badly wrong.

Backstory…

In addition to being a business owner, I am also a streamer on Twitch. I stream three times a week, every Monday, Wednesday, and Friday in the evening (UK time), primarily for my own enjoyment. However, I also happen to be a software developer (admittedly not a great one, but I have my moments) with a knack for Linux servers and hardware. I consider myself a well-rounded tech enthusiast. As part of my streaming activities, I run a Discord server where my followers can gather and communicate. It's also where my bot, "Tem's Little Helper," posts a notification in a designated channel whenever I start streaming.

image.png

Twitch sends the emails out to my followers sometimes up to an hour after I start streaming, whereas my little helper can do it just 20 seconds after I start streaming. Sometimes, even less time. So Tem’s Little Helper is very important to me, it’s coded in .NET Core 7 (C#). This little fellow runs on my Raspberry Pi 4. Yep. One of these:

1_c1vv7b34H8WdvqeD_SBxgw.jpg

You see, tucked away quietly on my desk is a Raspberry Pi 4, 8GB model, that has been… Modified, to put it nicely. My Raspberry Pi 4 is a small computer that's perfect for running low-power tasks like this. It's always on, connected to the internet, silently doing its job. I love that little thing, and I'm constantly finding new things to do with it. I was extremely lucky to get my hands on one, because right now they’re being absolutely scalped on the second hand market, and their prices are obscene - even if you can get a hold of one. But this little Raspberry Pi is the brains of my Discord Bots - with the exception of @ECCHIA for the EcchiDreams Discord Server who runs the actual website’s server. My brother in-law also runs several bots on the same Pi in C++. All of these bots are powered by the Raspberry Pi, and they’ve been running flawlessly for months now. It really does a very good job of running them.

IMG_0428.jpg

As you can see from the picture I’ve modified mine so that there is a copper shim sandwiched between thermal paste connecting it through to the massive heatsink that surrounds the CPU. For those playing along at home; when I first set this up - when I first got it, it easily achieved 80oC on the CPU which would have most likely engaged thermal throttling, and at idle it ran between 50-60oC without any cooling. So I got a heatsink, which greatly improved things, and added two fans to it, which improved things again. Originally the heatsink came with thermal pads that connected the SoC (System on a Chip) to the heatsink but I had much better performance by taking the thermal pad off, putting some quality thermal paste on the CPU, putting down a copper shim, and then putting more thermal paste on top of that which connects to the heatsink. This improved temperatures by around 5oC. By this point the idle temperature of the Pi was a few degrees above ambient, and that’s it. Under full synthetic load I was only able to get it up to 31oC/32oC max. Which allowed me to introduce this into the config:

[all]
over_voltage=6
arm_freq=2100
gpu_freq=600

With this setting, I can increase the CPU performance across all four cores to 2.1GHz. Despite the boost, the temperature never exceeds 37 degrees Celsius even under full synthetic load. Given the efficient cooling system, I don't anticipate any issue with overclocking. The use of a heatsink and fans demonstrate how simple and cost-effective modifications can make a significant impact on performance.

On February 5th, 2023, at 20:22:24 (UTC), I used the Raspberry Pi Imager to flash the SD card and started setting up the Raspberry Pi 24 minutes later at 20:46:22 (UTC), according to the boot partition timestamps visible under Windows. My brother-in-law and I spent about an hour or two configuring it before logging out of the SSH and leaving it to run in the background while we worked on other projects. Throughout this time, Tems Little Helper remained reliably punctual, announcing my streams on Discord no later than 20 seconds after I began broadcasting, he’d post on the Discord: “Everyone! Tema’s steaming such and such, right now!” Without fail.

The Storm Clouds are a Gatherin’!

Everything was going smoothly with Tem's Little Helper until Friday evening on March 17, 2023, when he unexpectedly took his first sick day. As usual, I started my stream and waited for Tem's Little Helper to announce it on the Discord channel, but nothing happened. One minute passed, then two, and still, there was no announcement. Three minutes into my stream, I began to worry that something was wrong.

I proceeded to manually send a notification, but to my dismay, I discovered that Tem’s Little Helper wasn't even listed as online on the server. In an attempt to resolve the issue, I restarted my Raspberry Pi, but unfortunately, there was nothing more I could do at that moment. I had to abandon the problem and start streaming the game as planned. It seemed that this was only the beginning of my troubles since the power cut had also caused some unexpected problems with my OBS settings. It became apparent that the issues I was experiencing with my bot and my stream were somehow linked to the earlier incident that day.

So yeah, not great.

Following the stream, I planned to investigate the issue, but I started feeling drowsy and decided to take a nap. 

My problems are about to get worse… 

After waking up and having a proper meal, I connected the Raspberry Pi to my HDMI capture card and plugged in a USB keyboard to run some diagnostics. However, when I restarted the Pi, I was greeted by an unexpected screen.

Thu Jan 1 00:00:04 UTC 1970
writable: recovering journal
writable: Superblock needs_recovery flag is clear, but journal has data.
writable: Run journal anyway

writable: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. 
	(i.e., without -a or -p options)

fsck exited with status code 4
The root filesystem on /dev/mmcblk0p2 requires a manual fsck

BusyBox v1.35.0 (Ubuntu 1:1.35.0-1ubuntu1) built-in shell (ash)

Enter ‘help’ for a list of built-in commands.

(initramfs) _

I am familiar with "Status Code 4" as it indicates that the errors on the filesystem were not corrected. I recall encountering a similar issue on a Linux system with a failing hard drive. When I tried to access it on a Windows system, I encountered multiple S.M.A.R.T errors, and eventually, the hard drive had to be disposed of as electronic waste.

"Well, bugger," I exclaimed aloud. I had never encountered this issue on a Raspberry Pi before, especially not with a high-end 128GB Sandisk Extreme MicroSD-XC card. It seemed likely that the improper shutdown resulting from the RCD trip that morning was to blame. After all, I had never experienced a failure with a Sandisk MicroSD card before. I attempted to use my Linux knowledge to repair the filesystem, but despite my efforts, nothing seemed to work. Although the system appeared to be writing the changes, they were not taking effect and the old data remained.

I removed the microSD card from the Raspberry Pi and inserted it into my computer running Windows. I opened the Disk Management service and noticed that the drive was labelled as "Offline: This disk is offline due to a policy set by an administrator." I found this strange, as I didn't recall setting any such policy. To fix this, I launched 'diskpart', selected the disk, brought it online, removed any attributes such as "readonly", and used the "clean" command. Diskpart reported that the disk clean was successful, and when I refreshed the disk management, I found that the partition and drive were still present and intact.

The current situation didn't match the expected outcome. Normally, when you run the "Clean" command in Diskpart, it should remove the partitions and volumes from the drive, making it ready to be formatted. However, in this case, the partition and drive were still intact after running the command. It's worth noting that using the "Clean" command in Diskpart is not recommended unless you have a good understanding of what you're doing.

I revisited Diskpart and repeated the clean command, followed by formatting the SD card within Diskpart. The formatting process seemed to complete without any issues, but when I checked the SD card afterwards, I found that all of the old data and partition tables were still present. Despite my computer's display of successful formatting, it hadn't actually wiped the data. This led me to consider the possibility that the SD card had failed, but I was left wondering what could have caused it to fail.

Could the reason for the failure of my memory card be attributed to the RCD trip or its underlying cause? It's difficult to fathom. There are multiple layers of safeguards, such as fuses, power supplies, and other protective measures between my kitchen toaster and my Raspberry Pi, which is connected to the 90W USB-C port on my monitor. There has to be a more plausible explanation for the memory card's failure, although the timing is a bit suspicious.

I attempted to resolve the issue by using a third-party partition software that promised to delete the partitions and fully format the new partition. The software even reported that the partition was created successfully, but in reality, nothing had changed. Although Windows was writing to the SD card, the writes were not persistent. I even added my own files to the boot partition, which were copied and stored on the drive, and I was able to copy them back from the MicroSD card. But upon unplugging and plugging back in the MicroSD card, the files vanished without a trace. It was as if they had never existed in the first place.

At this point, I was puzzled and decided to load up my preferred hex editor, HxD, and attempted to manually write some information to the memory card. To my surprise, the edits were successfully saved and appeared correct, but when I scrolled to another sector and went back to the one I just wrote to, the changes had disappeared. In a last-ditch effort, I turned to Linux and used the ‘dd’ command to write every single bit with a ‘0’ from start to finish, but it still retained the old data. As a final attempt, I ran a writing test using ‘HW2TESTW’ and discovered that none of the data it had written could be read back successfully.

Upon conducting several searches and utilising a non-Google based search engine tool, I stumbled upon an alarming discovery that my memory card was permanently locked in read-only mode due to hardware failure or other causes. It appears that as the card approaches the end of its lifespan, it will become locked in read-only mode to prevent any data corruption or loss. This can be likened to a final safeguard for your files, giving you ample time to back them all up. Fortunately, I was able to extract and safeguard all of the files from the MicroSD Card, and I was able to read from it without issue. Basic tests conducted also yielded favourable results.

Things aren’t as they appear…

I couldn't believe that our toaster was the culprit, it just didn’t make any sense to me. There should have been other hardware failures, least of all of which being my monitor. But as I needed a replacement, I ordered a new MicroSD Card from Amazon (this time a Samsung-branded one). 

Whilst I was restoring data onto the new MicroSD card (thanks to same-day delivery!), I noticed a pattern in the backup image: all files were last modified on February 6th. 

Get ready for this because I had a sudden realisation. I had security software installed that frequently modifies a file by recording the Unix timestamp in a .lock file. And what do you know? The last time that file was updated was on February 6, 2023, at 1:30:42 AM, and it only contained the Unix timestamp 1675647042 which is the exact same timestamp the file was modified on. I realised that it wasn't the toaster after all! 

It turned out that the MicroSD Card had failed a mere 4 hours, 44 minutes, and 20 seconds after we had initially set it up. Even though we had spent several hours configuring the Pi, it had appeared to be functioning correctly at the time.

Indeed, the Raspberry Pi was still functioning properly since it had all the necessary components loaded into its RAM. However, it was only capable of reading files from the MicroSD Card and not writing to them, rendering it a "zombie" device. The Pi continued to operate in this state for about two hours after we went to bed, without either of us realising it. As previously mentioned, the device is typically silent and unobtrusive, and sits quietly on my desk. It was dead Jim, but not as we know it.

The MicroSD card of the Pi became read-only nonetheless, the bot continued to function flawlessly until the Pi was ultimately power cycled. The question of how long it would have continued to work without a reboot is unanswerable. Nevertheless, from February 6th, 2023, until the day it was rebooted, the bots running on the Pi were functioning perfectly, including the one my brother-in-law had written in C++ for his Discord server.

The fact that the Pi and bots continued to function flawlessly despite the damage to the MicroSD card is truly remarkable. We were left wondering what could have caused this to happen. Unfortunately, my brother-in-law and I couldn't come up with a definite answer. However, I will be sending the MicroSD card back to SanDisk for a warranty claim and they provided a possible explanation:

4cae4487348dcf4cbe4135b1b0527a01b27b80e1 475887accd7845a33bc30c748f167c66599694bf

The representative from SanDisk proposed a theory that the Raspberry Pi had written and overwritten data to the MicroSD card so frequently that it ultimately caused the card to fail. This theory has some merit, given that all NAND Flash has a limited number of times that data can be overwritten before it starts to fail. However, it is unclear how this is supposed to be prevented in high or maximum endurance cards.

Perhaps the solution lies in the controller itself. It's possible that a standard SD card controller allows for data overwriting on certain cells, leading to their eventual wear-out and triggering the read-only mode. This is somewhat concerning, as it implies that a few cells going bad could render the entire drive useless, rather than simply marking them as "bad sectors" like a hard drive would.

My assumption is that high-endurance or max-endurance cards incorporate either built-in wear-levelling or over-provisioning, or use superior NAND packages. It's possible that a combination of these factors is also at play. This could explain some of the differences between these cards and their standard counterparts.

Unfortunately:

2cf4866032b1f603d9e7d5092ed059b71a257678 

But they were accommodating of this. I’m in the process of sending this off to them now, although now it looks like this: 

20230321_095630.jpg

The Conclusion…

I'm apparently getting a replacement MicroSD Card from SanDisk, and they even gave me a 15% off coupon for a high/max endurance card if I order from their website. I have to say, that's some great customer service and I commend them for it. However, there's one downside - I have to ship the card from the UK to the Czech Republic, so I'm a bit worried about the shipping cost. But I'll keep you updated on how it goes. Overall, SanDisk's RMA process has been really impressive, and it definitely gives me more confidence in their products knowing that they stand behind their warranty claims. I'll link to any follow-ups on this post, so stay tuned.

Link to comment
Share on other sites

  • Community Administrator
On 21/03/2023 at 12:10, Neptune said:

R.I.P Zombie SD card. You crashed harder than Brett o7

Brett liked to party hard and crash even harder. 

On 21/03/2023 at 12:21, Icarian Dreams said:

Maaaan, that's quite the series of events. I have to say, I'm really positively surprised by how accommodating the SanDisk support has been. And glad to learn that I should be using higher-endurance cards in case I ever end up with a Raspberry 😛 RiP Zombie SD card o7

I am very surprised by this as well. Good on SanDisk, I say. Lets see if the rest of the RMA process goes smoothly. I am even more surprised that the Pi was running absolutely fine for over a month, despite it's only form of storage being effectively dead.

On 21/03/2023 at 12:53, SMFoxy said:

Well...

...

fsck

fsck, indeed.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue. Read our Privacy Policy for more information.

Please Sign In or Sign Up