Really interesting post on how Facebook is solving the crazy large data warehouse with compression efficiency. More here.
By applying all these improvements, we evolved ORCFile to provide a significant boost in compression ratios over RCFile on our warehouse data, going from 5x to 8x. Additionally, on a large representative set of queries and data from our warehouse, we found that the Facebook ORCFile writer is 3x better on average than open source ORCFile.
We have rolled out this new storage format to many 10s of petabytes of warehouse data at Facebook and have reclaimed 10s of petabytes of capacity by switching from RCFile to Facebook ORCFile as the storage format. We are in the process of rolling out the format to additional tables in our data warehouse, so we can take further advantage of the improved storage efficiency and read/write performance. We have made our storage format available at GitHub and are working with the open source community to incorporate these improvements back into the Apache Hive project.