In the end of 2016, we integrated Ceph in Punchplatform. This article gives a feedback and explains what are the major advantages of using a Ceph cluster on a Punchplatform instead of standard storage solutions.
User features
Aside from reducing storage costs through well-known erasure-coding, Ceph in Punchplatform offers some major user features you need to know.
A lot of APIs
Ceph is an object-storage technology, meaning user puts and gets flat objects into and from a cluster instead of storing files in a hierarchical system.
It also has the ability to expose:
- a (POSIX) file-system named cephfs
- a S3 Gateway, meaning Ceph is compatible with the standard S3 APIs
- Hive to request data from a cluster
In one word you have lot of ways to write, read or request data from and to a Ceph cluster. And good news: all of this is available as-is in Punchplatform.
Move it, move it
Ceph is able to scale – a lot. For example you may store several peta-bytes on a cluster. For extraction or re-enciphering needs or in order to respect a data life-cycle strategy, you could want to move (and process) your data. Because of large volumes you need resilient and performant ways to do it. Punchplatform offers you Storm or Spark to perform it, all of this by descriptive configuration.
Watch your data
When you are using Punchplatform tools to write data into a Ceph cluster, software also writes an associated summary into Elasticsearch. This is powerful: it allows an operator to watch data through these summaries, get statistics, lookup or graph sets of data over time.
Use-cases
As said, Punchplatform give you capabilities about data insertion or extraction. Let’s discover some use-cases.
Extract few from a lot
Suppose you have a continuous stream of events, let’s say security logs, at a high rate, coming from a large number of servers to a Punchplatform. After a year we have 500 TB stored on a Ceph cluster.
We suspect an attack in the middle of the year, in July 2018, we would like to extract apache logs from this period for audit. Here how it works.
Punchplatform offers several tools to perform an extraction:
- a command-line interface for very small sets of data
- a descriptive file to configure a resilient extraction task
For an extraction of 1-month data, you will probably need a resilient extraction task (large volumes extraction may take a while). Its configuration file will look like:
{ "cluster": "ceph:main-cluster", "pool": "my-client-name", "topic": "apache-logs", "from": "2018-07-01T00:00:00+01:00", "to": "2018-07-31T23:59:59+01:00", "elasticsearch_cluster": "es-main-cluster" }
As you noticed, in addition of Ceph parameters and time-scope specification, you need to define an Elasticsearch cluster used to store data index (see previous part).
That’s all, your task is configured and free to be launched. Then you are free to transport extracted data on flat files or into an Elasticsearch cluster to perform advanced requests.
Store static resources
TODO
A touch of design
Integration of Ceph in Punchplatform asked us a lot. There was some design questions to resolve, in a non-exhaustive way:
- avoid writing small objects (it drastically reduces speed) : write instead a whole batch of data in an object
- avoid doing an ls operation (it may take a very long time on a large volume of objects) : we developed a indexation solution based on Elasticsearch to retrieve data
- ensure transactionality of batches construction and write
- develop and integrate a solution to monitor a Ceph cluster : we developed a light-weight piece of software using Go and following Elasticsearch Beats patterns
- write Ansible tasks to deploy off-line a Ceph cluster
These subjects deal with unique IDs, time-stamps, concurrency contexts and idem-potency patterns, this is not a piece of cake.
Conclusion
To sum up, forget about NAS, SAN and NFS. Deploy a Punchplatform on standard servers and have it all. It works on small platforms or very large platforms the same way.