Not legally an Engineer Yet, Call me an EIT
The rule for best practice backups is that you need to have three copies of anything important, stored in at least 2 different media, with at least 1 seperate location. This way if a site gets destroyed (Say Google Suddenly Cancels Google Drive, they’ve canceled weirder), you still have whatever is at your other location. If one format becomes difficult to read (Say CD players become untennable to get ahold of as obsolete tech), you still have the alternate format for your data. Overall this strategy makes you safe to most data loss events. If you keep things organized with your backups you also have the ability to always find things as you go through this.
Using Rsync was the first serious backup effort that I actually figured out to some extent. With the ability to hookup to everything it’s a great way to move data between different cloud services or even from the cloud onto a local system. Hooked up to a cron job it really is an ideal setup. My favorite use for this service is however to move from a bad system to a good system. If you currently have spotty backups to various locations and are making the swap over to a properly backed up 3, 2, 1 setup; grabbing all that data is going to be a very major task.
So keeping a reliable drive system my goto used to be that drives should be in a safe raid configuration, or tied together with some form of ZFS. This is no longer my goto. Ceph manages all storage directly at the device level and keeps things properly copied, in such a way that if you have enough drives, you can lose drives constantly, without having to manually rebuild. Just need enough drives to keep ahead of losses. You can even have your multiple locations and multiple servers build in to your setup through sufficient configuartion. If a server goes down and a new one is added, then the data replication will simply be rebuilt on the new system.
One of the best things about Ceph Storage Clusters is the fact that they can happily handle various sized drives without needing to downsize the larger drives to match the smaller, or creating any additional overhead. The system can handle making data highly available, so that servers can go down intermittently without destroying the cluster access as a whole. It’s honestly such a powerful system as to make it surprising that it isn’t more popular among individuals. Another very nice feature that is underappreciated is the fact that it can scale as more storage is added, without any drawbacks.
An example use case for an individual would be setting up their network storage setup. Say you start with 3 1TB drives, and are willing to be a bit risky with only 1 replica of any data. You’ll have a safe cluster size of 2TB, if you want 2 Replicas you need to drop down to 1TB max storage. This works for awhile, and then you buy a 10TB drive that you append, the system will rebalance the data for best safety, and you get to choose to potentially add additional replicas if you feel it’s necessary. You can then have a safe size of either 3TB (You’re limited by the size of your smaller nodes), or a risky cluster size upto 13TB. If you add an additional replica, you drop the sizes by half. If you throw an additional 10TB drive on there you can bring a 1 Repilica safe cluster size upto 13TB.
If a drive fails, you can simply replace it without bringing down the cluster. So long as not too much of the cluster fails at any point, you’re safe from data loss or downtime. With Ceph, your biggest problem becomes finding places to plug sufficient drives, most techie people will have a bank of lower space drives, that they would love to be able to add to their clusters, having sufficient power and sata ports to support them all is the actual challenge with this setup.
I setup a single node ceph cluster using Hyper V and 1 drive on my system. The idea is that ideally I’d like to expand out to many nodes, and many drives; however, for the time being setting up the access and a single OSD is sufficient to learn a little bit.
My intention with the setup is to add additional ceph nodes as possible to try and increase my available storage and reliablity Starting with my ~200GB of unsafe storage I’ll add replication and additional safety as I increase my cluster size.
Cloud storage seems cheap, but they’ll torture you if you actually use what you pay for. I’m rather fed up with google drives storage settings, and I need to go and see if I can disable the warnings. I’ve got an account with a grand total of 4TB of storage. It’s thanks to backups, 85% full. This drives google off the wall as they sell that storage clearly assuming you won’t use it all. Comparing it to their bucket storage costs, google drive is a steal of a deal, that is so long as you assume that google is keeping all your data full accessible and properly redundantly backed up (Hint, google and you’ll find instances of google losing peoples files, at least temporarily…).
Cloud Storage acts as a valid backup for local storage; however, for the average data hoarder it can get quite expensive. Especially if it’s going to be a method you’re actually acessing on a regular basis. At time of writing:
There are other storage solutions in the cloud worth consideirng; however, a lot of them can be confusing because the pricing structure is so heavily dependent on the level of activity of the storage, as well as the physical distance. Another thing to bear in mind is even simply the minimum storage time. A lot of the archival costs are based on a minimum time storaged of 1 year. $1.20/TB/month is $13.40/TB/year. Again not a lot of money, but it can certainly creep up. Using the cloud as an emergency rebuild option is not a bad idea, the high costs for retrieval coupled with the relatively low costs for the monthly storage bill is fairly reasonable. Especially remembering that you can compress everything that you’re putting into your deeper archival storage.