i have always wondered this as the amount of data sent to the google severs ever couple of seconds is like 1tb
plus there sites like YouTube have like 6 different encodes of the same video and up to 9 if original video is 8k and every video uploaded to youtube they will keep the original file
how can they keep up with so much data being sent to there servers
Google is unimaginably rich. They can afford that space. But that is interesting, what sort of drives do they use, if there are any special storage devices, etc.
One of the Google's long term missions is to be the repository for all human produced information.
Google is constantly increasing/upgrading their storage space, and there are many data centers around the world:
Which begs the question; what form of data storage has the highest density ?
Used to be tape.
Would be flash memory probably.
Not the cheapest though!
I think atomic force microscopy still has the record for densest information storage.
Now reading back the data, that's a bit more complex
Since they own their own data centres they can pretty much control every aspect of it. My guess is they have dedicated staff that literally all they do is add more disk space. Like order/assemble/rack/plug in systems. They probably have custom setups where they snap everything together, put in like 60 of the highest TB drives available in a single 4U chassis, and slide it in, plug it in, then the cloud system takes over from there. probably have like 10+ people who do this non stop all day. When they run out of floor space they build a new data centre. At some point they probably take down older storage servers to replace with newer higher density drives, and just repeat.
But it is mind boggling just how much data they do produce so I do wonder how they do in fact keep up, it seems even if people are putting together storage pods day after day it would still not be fast enough.
... but like mini dv does the tape degrade quickly after a couple of uses
Actually, I worked with a mate who looked after a data storage centre. The tapes can last a very long time, 10-20 yrs, depending on quality BUT, to achieve that -
they are kept in a temp controlled room, flipped every few months or so to avoid print-through, spooled and re-spooled. Orientation is very important, due to effects of gravity.
I had looked into tapes for my home backups but felt they are not really practical due to those issues. You really need to cycle them often to avoid bit rot, and each time you do cycle them, you reduce their life. They are only good for a couple thousand spools. I suppose if they are kept in a faraday cage at like -100C they might last longer. But print through might still be an issue.
While the hardware question is a good one - you have to admire the software design that manages it all. This is one place where I imagine scalability is challenged on a daily basis.
... But print through might still be an issue.
This may actually be a good school / science experiment for the space station !! They're always looking for different ideas. Send up 20 tapes, with 20 control tapes on Earth, then bring 1 tape
down every year or so, and test against the reference ones. Any kids listening ? :-)
I have it in my head that Google use commodity drives for their storage, the kind of thing you'd find in a desktop PC but that's based on a memory of a hardware reliability study they published a few years ago.
They also used to run their own blend of server which amounted to little more than a commodity board on a folded metal tray, I think they'd worked out for the number of cores and terabytes needed that by having massive redundancy in their server and disk arrays and using consumer grade parts (albeit mid to high end ones) instead of dedicated server hardware they would save a *lot* of money at that the cost of a shorter lifecycle (however I do remember the reliability difference between enterprise and commodity drives being surprisingly small after early mortalities were weeded out and if they were all treated well)
It may of course have all changed now and they've decided that racks crammed full of flash chips are much better.
I think this is the study or a close relative:
http://matrix.lt/download/google-hdd-disk-failures-analysis.white_paper.pdf
Yeah the slight performance difference between "enterprise" and consumer drive is probably too negligent to justify the increased cost. And individual hard drive is also seen as a very tiny sector as far as Google is concerned - heck, even me. I don't store data on single hard drives, I store it on multiple at once. AKA RAID. I imagine Google has their own custom raid system that can span a crazy amount of drives with a crazy amount of performance and data safety tweaks.
Due to the high volume of writes flash is probably also not viable as flash has write limitations so I imagine they'll always stay on spinning disks, maybe also ram.
how does google not run out of space
They add more, and more, and more, all the time.
A common commercial SAN uses 3U disk trays that hold 15 drives. There are now 12TB enterprise class drives. That's 180TB for every 3U, and you get 12 disk trays plus the 4U storage controller in a rack (with a couple of U left for the switch). 2 PB per rack.
And there are other brands and models of storage systems that have a much greater density.
If you're looking for 'do it yourself' storage solutions, these are two pretty good places to start. I've used numerous 6048Rs over the last few years, and they're pretty damn rugged and easy to deal with. 36 drive version is pretty annoying if the rack isn't well organized to get those back drives out, but the 24drive version is really nice and simple.
https://www.supermicro.com/products/system/4U/6048/SSG-6048R-E1CR36N.cfmBackblaze sell silly cheap backup (and block storage), and they opensourced made their own, details are here. Problem is it's impossible to replace drives 1 by 1 when they break. Lots of history about their pods on their website, and here's the most recent.
https://www.backblaze.com/blog/open-source-data-storage-server/
Having spent some time working on a Google for Work rollout last year...
Google will store your data in at least 3 and up to a max of 5 of their datacenters. You will access data from the datacenter your are nearest to. I assume it is similar for all their services... So take the 1TB / sec and it becomes 3TB / sec! Very few can compete with them - perhaps Amazon and Microsoft - but noone else could afford the infrastructure.
Due to the high volume of writes flash is probably also not viable as flash has write limitations so I imagine they'll always stay on spinning disks
That would assume they actually delete things. Flash write limitation only comes into play when you write, delete, write, delete, write, delete, over and over again, hundreds of TB per drive. If all you're doing is archiving, that's one write cycle, and then it just sits there forever being occasionally read from.
I doubt Google ever deletes anything.
However, would it actually require that much writing? Deleting is not usually the same as erasing. Isn't the data is just marked as over-writable? I suppose that might depend on the system though.
A common commercial SAN uses 3U disk trays that hold 15 drives. There are now 12TB enterprise class drives. That's 180TB for every 3U, and you get 12 disk trays plus the 4U storage controller in a rack (with a couple of U left for the switch). 2 PB per rack.
And there are other brands and models of storage systems that have a much greater density.
what you mean by common commercial SAN uses 3U .... ?
SAN = storage area network it consists of SAN switches/directors , storage arrays, hosts and all is connected through fibre channel (scsi commands encapsulated into FC protocol transported over fiber)....
the SAN directors we use have hundreds of 8Gbit/s FC ports, storage arrays we use are a full cabinet/rack (or two) with hundreds of spindles (disks) and the hosts are connected through FC host bus adapters with at least 2 ports. a common commercial SAN has 2 fabrics (each fabric is a separate FC network) for redundancy so you have at least 2 switches/directors and both storage arrays and hosts are connect to both of them (half of the available FC ports per fabric)...
so a common commercial SAN is definitely not a 3U disk array.
A common commercial SAN uses 3U disk trays that hold 15 drives. There are now 12TB enterprise class drives. That's 180TB for every 3U, and you get 12 disk trays plus the 4U storage controller in a rack (with a couple of U left for the switch). 2 PB per rack.
And there are other brands and models of storage systems that have a much greater density.
what you mean by common commercial SAN uses 3U .... ?
SAN = storage area network it consists of SAN switches/directors , storage arrays, hosts and all is connected through fibre channel (scsi commands encapsulated into FC protocol transported over fiber)....
the SAN directors we use have hundreds of 8Gbit/s FC ports, storage arrays we use are a full cabinet/rack (or two) with hundreds of spindles (disks) and the hosts are connected through FC host bus adapters with at least 2 ports. a common commercial SAN has 2 fabrics (each fabric is a separate FC network) for redundancy so you have at least 2 switches/directors and both storage arrays and hosts are connect to both of them (half of the available FC ports per fabric)...
so a common commercial SAN is definitely not a 3U disk array.
@rrinker specifically said uses 3U
disk trays. IME, that's a very common (and perhaps the most common) form factor for the disk trays.
He was clearly trying to Fermi estimate the amount of storage per rack footprint.
A common commercial SAN uses 3U disk trays that hold 15 drives. There are now 12TB enterprise class drives. That's 180TB for every 3U, and you get 12 disk trays plus the 4U storage controller in a rack (with a couple of U left for the switch). 2 PB per rack.
And there are other brands and models of storage systems that have a much greater density.
what you mean by common commercial SAN uses 3U .... ?
SAN = storage area network it consists of SAN switches/directors , storage arrays, hosts and all is connected through fibre channel (scsi commands encapsulated into FC protocol transported over fiber)....
the SAN directors we use have hundreds of 8Gbit/s FC ports, storage arrays we use are a full cabinet/rack (or two) with hundreds of spindles (disks) and the hosts are connected through FC host bus adapters with at least 2 ports. a common commercial SAN has 2 fabrics (each fabric is a separate FC network) for redundancy so you have at least 2 switches/directors and both storage arrays and hosts are connect to both of them (half of the available FC ports per fabric)...
so a common commercial SAN is definitely not a 3U disk array.
@rrinker specifically said uses 3U disk trays. IME, that's a very common (and perhaps the most common) form factor for the disk trays.
He was clearly trying to Fermi estimate the amount of storage per rack footprint.
my point is that a disk array is not a SAN , it's much more complex than that..... so a SAN is not using disk trays. the storage array connecved to a SAN is using disk trays with disks... and some of them might be using 3U trays..
That's exactly what he said in the rest of the sentence, that the SAN would be the rack with multiple trays, controller and switches...
Deleting is not usually the same as erasing. Isn't the data is just marked as over-writable?
Yes but the whole point of deleting is to make room to write something on there again... so it will eventually lead to a write cycle.
I have it in my head that Google use commodity drives for their storage, the kind of thing you'd find in a desktop PC but that's based on a memory of a hardware reliability study they published a few years ago.
...
I think this is the study or a close relative:
http://matrix.lt/download/google-hdd-disk-failures-analysis.white_paper.pdf
Yes, that's the one that came to my mind as well, as I read the OP.