-
Notifications
You must be signed in to change notification settings - Fork 0
/
paper.tex
1194 lines (1038 loc) · 65.2 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[conference]{IEEEtran} \pdfpagewidth=8.5in
\pdfpageheight=11in
\usepackage{subfigure}
\usepackage[pdftex]{}
\usepackage{graphicx}
\usepackage{hyperref}
\addtolength{\parskip}{-0.08in}
% in order for balance columns to work, it has to be before begin document...
%\balancecolumns
\IEEEoverridecommandlockouts
\IEEEpubid{\makebox[\columnwidth]{978-1-4673-8815-3/16/\$31.00 ©2016 IEEE \hfill} \hspace{\columnsep}\makebox[\columnwidth]{ }}
\begin{document}
\title{DAOS and Friends: A Proposal for\\an Exascale Storage System}
\author{
\IEEEauthorblockN{Jay Lofstead\IEEEauthorrefmark{1}, Ivo Jimenez\IEEEauthorrefmark{2}, Carlos Maltzahn\IEEEauthorrefmark{2}, Quincey Koziol\IEEEauthorrefmark{3}, John Bent\IEEEauthorrefmark{4}, Eric Barton\IEEEauthorrefmark{5}}
\IEEEauthorblockA{\IEEEauthorrefmark{1}Sandia National Laboratories [email protected]}
\IEEEauthorblockA{\IEEEauthorrefmark{2}University of California, Santa Cruz [email protected], [email protected]}
\IEEEauthorblockA{\IEEEauthorrefmark{3}Lawrence Berkeley National Laboratory [email protected]}
\IEEEauthorblockA{\IEEEauthorrefmark{4}Segate Government Solutions [email protected]}
\IEEEauthorblockA{\IEEEauthorrefmark{5}Intel [email protected]}
}
\maketitle
\begin{abstract}
The DOE Extreme-Scale Technology Acceleration Fast Forward Storage and IO Stack
project is going to have significant impact on storage systems design within
and beyond the HPC community. With phase two of the project starting, it is an
excellent opportunity to explore the complete design and how it will address
the needs of extreme scale platforms. This paper examines each layer of the
proposed stack in some detail along with cross-cutting topics, such as
transactions and metadata management.
This paper not only provides a timely summary of important aspects of the
design specifications but also captures the underlying reasoning that is not
available elsewhere. We encourage the broader community to understand the
design, intent, and future directions to foster discussion guiding phase two
and the ultimate production storage stack based on this work. An initial
performance evaluation of the early prototype implementation is also provided
to validate the presented design.
\end{abstract}
%\category{D.4}{Software}{Operating Systems}
%\category{D.4.7}{Operating Systems}{Organization and Design}[hierarchical design]
%\terms{Design, Performance}
\section{Introduction}
Current production HPC IO stack design is unlikely to offer sufficient features
and performance to adequately serve extreme scale science platform
requirements. While new hardware, such as non-volatile memory will help, we
still need a new software stack to incorporate this new hardware as well as
address the extreme parallelism and performance requirements demanded by
exascale applications. Adding to the problem complexity is the variety of Big
Data problems users want to address using these platforms. Unlike the
centralized storage arrays favored for HPC platforms, big data analytics
systems have grown up using storage distributed on all of the nodes driving a
very different software architecture. With post-exascale platforms required to
address both workloads, a new storage stack is required.
A joint effort between the US Department of
Energy's Office of Advanced Simulation and Computing and Advanced Scientific
Computing Research commissioned a project to develop a design and prototype for
an IO stack suitable for the extreme scale environment. It will be referred to
as the Fast Forward Storage and IO (FFSIO) project. This is a joint effort led
by Lawrence Livermore National Laboratory, with the DOE Data Management Nexus
leads Rob Ross and Gary Grider as coordinators and contract lead Mark Gary. The
participating labs are LLNL, SNL, LANL, ORNL, PNL, LBNL, and ANL. Additional
industrial partners contracted include the Intel Lustre team, EMC, DDN, and the
HDF Group. This team has developed a specification
set~\cite{fastforward:2014:docs} for a future IO stack to address the
identified challenges. The first phase completed in 2014 with a second phase
underway. The first phase focused primarily on basic
functionality and design. While an idealized potential system would be the
perfect target architecture, the reality of budgets has tempered many of the
decisions. For example, extensive availability of NVRAM or SSDs on all of the
compute nodes is currently not economically feasible limiting some of the
potential design choices. With this in mind, the second phase is
incorporating fault recovery and other missing features.
The complete design seeks to offer high availability, byte-granular,
multi-version concurrency control. Multiple versions of an object are stored
efficiently by using a copy-on-write mechanism. By assuming the client
interface will be through an IO library, a more complicated interface offering
richer functionality can be incorporated while requiring only minimal end-user
code changes. Managing most data access in a platform-local layer rather than
requiring writing to centralized storage will better support the performance
and energy requirements of extreme scale application compositions.
Overall, the architecture shifts from the idea of files and directories to
containers of objects. This shift avoids the bottlenecks related to the POSIX
files and directories structure such as file creation serialization, the
file count in a directory limitation and impact, and the limited
semantics of a byte stream. Instead, the new interface focuses on high-level
data models and their properties and relationships. This concept permeates the
entire IO stack.
In addition to addressing the traditional scientific workload, this project
seeks to expand functionality to better support Big Data type applications. The
key idea is to support Arbitrary Connected Graphs (ACGs) such as those used in
Map-Reduce systems. Key system features are introduced to efficiently support
these computing models in addition to the typically bursty IO loads of more
traditional HPC applications. These features are not discussed for space
reasons.
\begin{figure}[htbp]
%\vspace{-0.10in}
\centering
\includegraphics[width=\columnwidth]{images/arch-mapping}
%\vspace{-0.15in}
\caption{Target Architecture and Component Mapping}
\label{fig:arch-mapping}
%\vspace{-0.15in}
\end{figure}
The IO stack layers each contribute
different functionality. The architecture (Figure~\ref{fig:arch-mapping})
incorporates five layers, some of which have potentially optional components.
The top layer is comprised of generally a high level IO library, such as the
demonstration HDF5 library~\cite{folk:2011:hdf5} and a more complex API for accessing the
lower level components. This layer is in dark blue. Becuase the system supports
more complex architectures and supports richer functionality, hiding this
complexity behind a user-friendly API is the intent. It is possible to
access the storage stack through the more complex API, but the additional
requirements beyond standard POSIX calls will prompt most users to use an IO
library. This layer incorporates the necessary features for ACGs from a
end-user's perspective.
Below the user API is an IO forwarding layer that redirects IO calls from the
compute nodes to the IO dispatching layer (in black). This IO forwarding layer
is analogous to the function of the IO nodes in a BlueGene machine or the
passive data staging processes demonstrated
previously~\cite{nisar:2008:staging,Abbasi:2009:datatap}. One special function
of note for this layer is that it is where function shipping will be deployed.
This is discussed in Section~\ref{sec:end-user}. The next two layers have
considerable functionality.
The IO Dispatcher (IOD) serves as the primary storage interface for the IO
stack (in green) and offers features like Burst Buffers to insulate the
persistent storage array from bursty IO workloads. Ideally, the IOD layer's
functionality can be optional based on available hardware and compute power
provided on the IO Nodes (IONs). Transactions are handled primarily at this
layer. Much of the functionality offered at this layer would shift either up or
down the stack as discussed in detail below.
The Distributed Application Object Storage (DAOS) layer serves as the
persistent storage interface and translation layer between the user-visible
object model and the requirements of the underlying storage infrastructure.
Transactions work a bit differently at this layer and are called epochs to
distinguish them. DAOS is intended to be the traditional file system-like
foundation on which everything else is built with no dependence on any
technologies specified above it (in dark pink and yellow). For example, the IOD
layer with or without burst buffers is not required for DAOS to operate
properly. Instead, the DAOS layer can handle all of the IO operations from the
user API layer, albeit with the potential performance and usability penalty of
manipulating the shared, persistent storage array directly.
At the bottom is the Versioning Object Storage Device (VOSD) (in purple). It
serves as the interface for storing objects of all types efficiently for each
storage device in the parallel storage array. Think of this layer as the
physical disk interface layer. In terms of Lustre, this would replace the API
on individual storage devices with an interface friendlier to the containers of
objects and transactions/epochs concepts used in the higher layers.
This paper presents an analysis of the published design documents along with a
discussion of the design philosophy representing the overall intent. The design
philospohy conveys information that may or may not have been written down, but
aids understanding for the total design. This information was gathered through
interviews with the core FFSIO team members. These ideas are presented to
reveal the project future rather than dwelling on any limitations of the
published designs. This is most important to illustrate how different concepts
will work across layers since that information is spread across multiple
documents and may lack a cohesive overall view. Previously, a
poster~\cite{lofstead:2014:ffsio-poster} and a more detailed evaluation of
consistency and fault tolerance~\cite{lofstead:2014:ffsio-consistency} have
been published.
With DAOS and friends being groomed as the next generation for the Lustre
parallel file system, the information and analysis presented here can help
users determine how to adapt their thinking about storage as well as hopefully
influence the second phase.
The key contributions of this paper are the syntesized architectural overview
and the discussion and analysis motivating the architectural descisions. With
this information, community members can be better informed about an important
future storage system as well as potentially influence the evolution of the
design from prototype into production system.
The rest of the paper is organized as follows. An overview of related work is
presented first in Section~\ref{sec:related}. Section~\ref{sec:end-user}
discusses the programmatic interface end users will see when interacting with
the storage array. This will be discussed in the context of the HDF5 based
example library used for the functionality demonstration.
Section~\ref{sec:iof} briefly discusses the motivation and proposal for the IO
forwarding layer. Section~\ref{sec:iod} describes the IO Dispatcher layer and
the broad functionality it offers. This will detail the pieces of the layer
that are potentially optional and mention the cross-cutting features discussed
in a later, cross-cutting section. Section~\ref{sec:daos} discusses how the
DAOS layer functions. As with the IOD layer, the cross-cutting features will
be mentioned, but discussed more fully in the cross-cutting section. The VOSD
layer is discussed in Section~\ref{sec:vosd}. In particular, the mapping
between the DAOS and VOSD layers are explored as it pertains to the physical
storage. Next is an exploration of cross-cutting features like transactions
and metadata management in Section~\ref{sec:broader}. Since these and other
features are spread across multiple layers, it makes more sense to discuss them
independently once an understanding of the overall structure has been
presented. A demonstration of the functionality is presented in
Section~\ref{sec:evaluation}. This shows that the prototype system based on
the proposed design can function. Section~\ref{sec:conclusion} concludes the
paper with a summary of the broad issues.
\section{Related Work}
\label{sec:related}
Many projects over the last couple of decades have sought to address some
challenging aspect of parallel file system design. The recent rise of ``Big
Data'' applications with different characteristic IO patterns have somewhat
complicated the picture. Vendors are shifting products to address the far
larger market forcing HPC systems to adapt to these different storage
approaches. Extreme scale machines will be expected to handle both the
traditional simulation-related workloads as well as applications more
squarely in the Big Data arena. This will require some adjustments to the
underlying system for good performance for both scenarios.
The major previous work is really limited to full file systems rather than the
mountain of file system refinements made over the years. A selection of these
other file systems and some features that make it relatively unique are
described below.
Lustre~\cite{braam:lustre-arch} is the de facto standard on most major clusters
offering scalable performance and fine-grained end-user and programmatic
control over how data is placed in the storage system. The broad community
support has led to a solid code base with sufficient optimizations to serve as
the low-cost, proven solution. For each installation, system-wide settings that
apply to all files on the file system are made. The end user, should they have
different needs can reconfigure these characteristics on a file-by-file basis.
This becomes an issue because the dominant file size is tiny. In many cases, it
can be $<$ 4 KB. To keep from slowing the overall system performance when
creating and opening these files, most systems are configured to use a 1 MB
stripe size and a stripe count of 4 meaning only 4 storage targets are used per
file. This limits the default aggregate bandwidth to the combined speed of four
storage targets. By reconfiguring on a file-by-file basis, this default can be
overcome for large, parallel files achieving very high performance. The
downside is that this setting must be done to take advantage of the full
parallel file system performance.
Ceph~\cite{weil:ceph} is a distributed object store and file system. It offers
both a POSIX and object interface including features typically found in
parallel file systems. Ceph's unique striping approach uses pseudo-random
numbers with a known seed eliminating the need for the metadata service to
track where each piece in a striped file is placed. Ceph's strengths are in
providing good perforamnce and scalability with the ability to handle failures
and deploying new storage adapting the system in a live environment. However,
this failure handling advantage was shown~\cite{wang:2013:ceph} to limit peak
performance more than other systems like Lustre. More recently, some of these
limitations have begun to be addressed by the Ceph team, but no new evaluations
have been performed to determine if these changes close the gap sufficiently to
address the extreme HPC performance needs.
PVFS~\cite{carns:pvfs} was built understanding the scalibility bottlenecks
Lustre suffers. For example, Lustre requires all processes opening a file to
hit the metadata server to receive a proper file handle. PVFS reduces this load
by allowing a single process to open a file and sharing the handle with other
processes participating in the IO operation. There are also other optimizations
that enhance file system performance. It has been commercialized in recent
years as OrangeFS.
GPFS~\cite{schmuck:gpfs} offers a highly scalable parallel file system with
robust functionality to handle both parallel storage, recovery, and
optimization. It only supports a hands-off approach for providing good
performance for scaling parallel IO tasks and is used extensively by its owner,
IBM. Unfortunately, the stripe size is fixed introducing potentially false
sharing, when two processes indpendently write to the same stripe, but without
overlapping, causing potentially reduced performance. Beyond these sorts of
fixed parameters, a wide variety of optimizations and features are available
for additional licensing fees.
Panasas~\cite{panasas:architecture} uses a fundamentally different approach to
parallel file system performance. When parallel writers simultaneously write to
a shared file, the system dynamically adapts the number of stripes to maintain
high performance. This adaption is invisible to the user other than seeing that
the system maintains high performance no matter what configuration the workload
exhibits.
This project learns from all of these parallel file system efforts and offers a
scalable approach that can work well for everything from a small cluster to the
largest exascale platforms. By understanding what works well and what the
limitations are for each of the above systems as well as emerging hardware
architectures, this project addresses the limitations while maintaining the
advantages of the above systems.
Other file systems, like GoogleFS~\cite{ghemawat:googlefs} and
HDFS~\cite{Shvachko:2010:hdfs}, address distributed rather than parallel
computing and cannot be compared directly. The primary difference between
distributed and parallel file systems is the ability of the file system to
store and retrieve data simultaneously from multiple clients, in parallel, and
treat the resulting collection of pieces as a single object. Distributed file
systems rely on a single client creating a file typically on a single storage
device. For performance, the file or object may be replicated. The other,
popular distributed file system of note is NFS~\cite{powlowski:1994:nfs3} that
has been used for decades for enterprise file systems. NFS is known to support
a global namespace with data migrating towards users on access and pushed
towards safer storage based on local platform characteristics. These other file
systems are mainly of interest in the context of the ACG features of FFSIO and
will be discussed more in Section~\ref{sec:acg}.
The main alternative from scratch design for a file system is
Sirocco~\cite{sirocco,curry:2016:sirocco}. Rather than continuing the striped
design of existing parallel file systems, Sirocco is inspired by peer-to-peer
and object-based systems and includes features like transactions to protect
data modification process independence when writing to avoid coordination
overhead. The base assumptions are that storage is pervasive and volatile.
Storage devices and locations may come and go randomly, reminiscent of the
Google or Ceph assumptions of regularly failing hardware. When data is pushed
into the system, initial resilience characteristics are guaranteed prior to
returning control back to the user. Then, as system pressures dictate, data
will either replicate as demanded by use and/or migrated towards long-term
resilience requirements. Unlike the FFSIO project, Sirocco assumes it is
possible that data may be successfully stored in the system, but it is
currently inaccessible because all copies are currently offline. There is also
some potential difficulty in finding data since it will migrate around the
system. To be fair, Sirocco intends to function as the storage layer for a
higher level file system API making many of the awkward system features
invisible to the end user. Sirocco is in the process of being released
publicly.
\section{End-User API Layer}
\label{sec:end-user}
Since the proposal specifies a high-level IO API will be the primary end-user
interface for programmatically interacting with the FFSIO stack, the team used
the HDF5 API and leveraged its Virtual Object Layer (VOL) for the initial
design and implementation demonstration. This also serves as a good test
determining what are strictly necessary extensions to an existing IO API to
support the new functionality. The additional functionality, such as
transactions, can be ignored for legacy implementations, but these applications
will not be able to take advantage of the asynchronous IO support inherent to
the new API. The additions comprise (Figure~\ref{fig:vol-arch}):
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
%\itemsep1pt\parskip0pt\parsep0pt
\item
API extensions to support new functionality provided by the FFSIO project.
This includes calls for managing asynchronous request lists, performing
asynchronous operations, creating and managing transactions, end-to-end data
integrity, and data type and functions to support the big data oriented
functionality more efficiently than the current API.
\item
Function shipping from Compute Nodes (CN) to IO Nodes (ION). This provides
the application developer with the capability of sending computation down to
the IONs and get back results and perform other operations such as indexing
and data reorganization for more efficient retrieval.
\item
Analysis Shipping from compute nodes to IO Nodes or DAOS nodes. This is
similar to function shipping, but instead of returning the result over the
network, it is stored on the nodes and pointers to the data are returned.
\end{enumerate}
Function and Analysis Shipping are part of the cross-cutting features and are
discussed in Section~\ref{sec:fn-shipping}.
HDF5~\cite{folk:2011:hdf5} has a versatile data model offering complex data objects
and metadata. Its information set is a collection of datasets, groups,
datatypes and metadata objects. The data model defines mechanisms for
creating associations between various information items. The main
conceptual components for data stored in HDF5 are described below.
\begin{itemize}
\item
\textbf{File}: In the HDF5 data model, the collection of data items stored
together is represented by a file. It is an object collection that also
describes the relationship between them. Every file begins with a root
group ``/'' serving as the ``starting-point'' in the object hierarchy.
\item
\textbf{Group}: A group is an object allowing association between HDF5
objects. It is synonymous with directories in a file system. A group
could contain multiple other groups, datasets, datatypes or attributes within
it. Groups are named and then accessed using a standard path notation
similar to Linux with a ``/'' separating each group name in the hierarchy
from the root to the nested group of interest.
\item
\textbf{Dataset}: HDF5 datasets are objects representing actual data
or content. Datasets are typically arrays with potentially multiple
dimensions. Other types, such as strings and scalars, are also possible. A
dataset is characterized by a dataspace and a datatype. The dataspace
captures the rank (number of dimensions) and the current and maximum
extent in each dimension. The datatype describes the type of its data
elements. By default, the entire data set is stored as a single chunk
reassembled from all processes. It is possible to use a uniform chunking
format where data is stored in fixed sized chunks instead.
\item
\textbf{Attribute}: Attributes are used for annotating HDF5 objects. They are
datasets themselves and are attached to existing objects.
\end{itemize}
\subsection{Virtual Object Layer}
\label{virtual-object-layer}
The Virtual Object Layer is an abstraction mechanism internal to the HDF5
library~\cite{folk:2011:hdf5}. As shown in Figure~\ref{fig:vol-arch} it is implemented
just below the public API. The VOL exports an interface that allows writing
plugins for HDF5 enabling developers to handle data in ways other than writing
to storage in an HDF5 format. Plugin writers provide an implementation for a
set of functions and are trusted to provide the proper semantics for the new
environment. For example, data staging could be implemented in the VOL layer by
replacing writing to disk in the HDF5 format to sending data to a data staging
area using some messaging mechanism.
For this project, rather than the default writing to disk in the HDF5 format,
the VOL is used to interact with the IOD layer and the different concepts it
offers without requiring all of the functionality be exposed to users. For
example, the containers and objects concept is mapped to the files and datasets
existing HDF5 users are familiar with. This reduces the difficulty porting
applications to the new IO stack.
\begin{figure}[htbp]
%\vspace{-0.10in}
\centering
\includegraphics[width=\columnwidth]{images/vol-arch.png}
%\vspace{-0.20in}
\caption{Architectural view of the VOL abstraction mechanism}
\label{fig:vol-arch}
%\vspace{-0.10in}
\end{figure}
The IOD VOL plugin serves as the bridge between HDF5 and the IOD Layer
(Figure~\ref{fig:vol-arch}). The application calls the HDF5 library while
running on the system's compute nodes. Using the VOL architecture, the IOD VOL
plugin uses a function shipper (RPC library) to forward the VOL calls to a
server component running on the IO nodes (IONs). This function shipping is the
IO Forwarding Layer discussed briefly in Section~\ref{sec:iof}. Once the calls
arrive at the IO nodes, they are translated into IO Dispatcher (IOD) API calls
and executed at the IONs.
Since the IOD layer is optional by design, a second VOL plugin is required to
access DAOS directly. This additional complexity was deemed acceptable for the
flexibility it affords. Further, since all end user interactions are intended
to be through an IO API, having two different plugins is a small amount of
extra work to provide a single interface that would operate on widely different
scale deployments (e.g., one too small to have the IOD layer and one at the
other extreme with a large IOD layer interfacing with a large, shared DAOS
layer).
\subsection{HDF5 to FFSIO Mapping}
\label{sec:hdf-to-ffsio}
Since HDF5~\cite{folk:2011:hdf5} offers an interface focused on files and the internal
data types, such as datasets, these concepts must be mapped onto the proposed
FFSIO data storage concepts. This mapping is shown in Table~\ref{tab:mapping}
\begin{table}[ht]
% \vspace{-0.10in}
\centering
\caption[HDF5 to FFSIO Mapping]{HDF5 Data Model to FFSIO Data Model Mapping}
\bigskip
% \vspace{-0.15in}
\begin{tabular}{|r|r|}
\hline
HDF5 & FFSIO\\
\hline
file & container \\
dataset & array \\
group & key-value store \\
attribute(s) & key-value store \\
\hline
\end{tabular}
\label{tab:mapping}
\end{table}
In Section~\ref{sec:iod}, the FFSIO types are described in more detail.
\section{IO Forwarding Layer}
\label{sec:iof}
The IO Forwarding layer offers a mechanism to reduce the concurrency impact of
the massive process count fan in onto the storage stack. The current trend of
using both MPI and a node-level threading library like OpenMP, CUDA, or OpenACC
is addressing the same issue, but limited to handling the parallelism on a
single node rather than multiple nodes. Projected extreme scale platforms will
have far fewer storage stack end-points per compute process or even compute
node in which to receive requests and data. By reducing the number of
simultaneous requests, delays can be reduced. This has been demonstrated for
the file open operation with Lustre~\cite{lofstead:2009:adaptable} and to some
degree for accessing the storage devices
themselves~\cite{lofstead:2010:io-variability}. The BlueGene platform
incorporated dedicated hardware to perform this role. The proposed
functionality for this layer, beyond managing the number of connections to the
IOD layer, is to implement function shipping from the compute nodes to the IO
nodes.
For the basic HDF5 calls, this will work the same as how the Nessie
staging~\cite{lofstead:2011:nessie-staging} shifted the collective IO data
rearrangement calls to a reduced number of processes. The prototype
implementation will only support accessing functionality already deployed to
the IO nodes through an RPC mechanism. This initial implementation will use
Mercury~\cite{Soumagne:2013:mercury} to access the remote functionality. For
dynamically defined functions, a different system will be required leveraging
something like C-on-demand~\cite{abbasi:2011:c-on-demand} or some other dynamic
deployment and compilation or an interpreter system.
\section{IO Dispatcher Layer}
\label{sec:iod}
Strictly speaking, the IO Dispatcher layer and included functionality, such as
burst buffers, is optional. All of the functionality, such as function and
analysis shipping, transaction management, and managing asynchronous data
movement can be handled by other portions of the stack. For an extreme scale
platform, the IOD layer will be an essential pressure relief valve for the
underlying persistent storage layer. By making it optional, the proposed stack
can be deployed more easily on smaller clusters or for those with more
constrained budgets. For simplicity, the rest of this section will describe a
full stack including all of the proposed IOD components.
The core idea for IOD is to provide a way to manage IO load that is separate
from the compute nodes and the storage array. Communication intensive
activities, such as data rearrangement, can be moved to the IOD layer reducing
the number of participants and message count. The IOD has three main purposes.
First, the burst buffers work as a fast cache absorbing write operations that
then trickles out to the central storage array or pre-staging read operations.
It can also be used to retrieve objects from the central storage array for more
efficient read operations and offers data filtering to make client reads more
efficient. Second, it offers the transaction mechanism for controlling data
set visibility and to manage faults that could expose an incomplete or corrupt
data set to users. These transactions are local to the IOD layer until
persisted to the DAOS layer eliminating the need for burdening the persistent
storage with transient data. Third, data processing operations can be placed
in the IOD. These operations are intended to offer functionality like data
rearrangement and filtering prior to data reaching the central storage array.
While these ideas are not necessarily new, they are new twists on best of class
efforts for these technologies. For example, offloading the collective
two-phase data sieving from the compute nodes to reorganize data has proven
effective at reducing the total time for writing data due to fewer participants
involved in the communication patterns~\cite{lofstead:2011:nessie-staging}.
Beyond these broad items, there are many important details some of which are
examined in more detail below.
\subsection{FFSIO Data Model Types}
\label{sec:data-model}
With the shift from a directories and stream-of-bytes files model to the
container and object model, some description is required to better understand
how these concepts are being used as well as the raw benefits.
\textbf{Container}
As mentioned above, the concept of a container is similar to that of a file in
a traditional file system. However, rather than being in a directory structure,
each container essentially is stored in a hash-space allowing direct access to
any container without regard to the current organizational context of the file
system. For example, there is no need to navigate a directory hierarchy to
name a particular container.
Functionally, a container plays the same role as a file in that it holds a
collection of presumably related data intended to be accessed and manipulated
as a unit. Since this is extended from HDF5 files, the container could also be
viewed as a directory tree of objects where each directory entry specifies
either a sub-directory (group) or some data or attribute. For the IOD layer, it
is a collection of objects. For HDF5, interpreting the objects builds the
structure.
\textbf{Key-Value Store}
This is the base type for the container. Since the container represents
something akin to HDF5 files, everything is stored within a hierarchical
namespace. The root namespace is represented by the base key-value store and
contains a list of all of the objects for this portion of the namespace as well
as additional key-value objects representing sub-groups for the hierarchy. Each
of those key-value store objects works identically. Attributes are stored in a
key-value store object, but use the multi-dimensional array and blob objects to
store the values for the attributes.
\textbf{Multi-Dimensional Array}
By treating arrays as a special case separate from blobs, additional
opportunities are enabled. For example, by knowing that an object is an array,
proper slicing of that array onto IO nodes can be done without involving higher
levels of the IO stack.
\textbf{Blob}
All other data is stored as a stream-of-bytes without regard to the actual
data type.
\subsection{Multi-Dimensional Arrays}
For both IO performance and to aid in analysis and other data processing, the
multi-dimensional array object can be split across multiple IO nodes. Each
piece of this array is called a {\em shard}. The idea of sharding is to store
a logically complete portion of a data set on a single storage target. This is
similar in concept to the HDF5 hyperslab. The FFSIO stack supports sharding
the data in the default or some other structured way as well as ``re-sharding''
based on application needs. For example, reordering the data so that a
different dimension is the ``fast'' dimension may greatly improve the
performance of a subsequent data analytics task. A common scenario where this
is useful is a Fortran code (column-major) writes data for a C code (row-major)
to analyze. The IOD API supports the following sharding strategies:
\begin{itemize}
\item
\textbf{contiguous}. Fixed chunking, distributed in a round-robin
fashion across the IO nodes.
\item
\textbf{chunked}. Same as above but with irregular (sparse) chunking.
\item
\textbf{user-defined}. Either contiguous or chunked, but user specifies
where to place each individual shard.
\end{itemize}
It is possible to request the transformation of an object's physical
layout to other formats resulting in multiple copies of the same objects in
multiple formats. Also, the user can pre-fetch objects from the storage cluster
into the IO nodes or read them directly from the storage cluster. At the
semantic level (HDF5), indices can be created for datasets resulting in being
able to read through an index instead of directly from the base array.
All of these distinct alternatives result in having many different ways for
executing the same analysis task. In the subsequent discussions, we consider
only data-movement optimization, i.e., sending the analysis code as close as
possible to the data. In practice, this means we focus on identifying sharding
of datasets and execute code accordingly over the appropriate shards.
\subsection{IO Nodes}
IOD processes are hosted on the IO nodes that interface a general compute area
with the storage array. The IO nodes handle requests forwarded by the
scientific applications, potentially integrate a tier of solid-state devices to
absorb the burst of random or high volume operations, and organize/re-format
the data so that transfers to/from the staging area from/to the traditional
parallel file system can be done more efficiently. It also has the capacity to
execute analysis on data recently generated by simulation applications running
at the compute nodes, but not persisted to the storage array. As the data
arrives, re-organization and data preparation can be applied in order to
anticipate the execution of analytical tasks.
\begin{figure}[htbp]
\centering
%\vspace{-0.10in}
\includegraphics[width=\columnwidth]{images/exa-arch.png}
%\vspace{-0.20in}
\caption{Extreme Scale Architecture}
\label{fig:exa-arch}
%\vspace{-0.10in}
\end{figure}
A common configuration for this type of deployment is shown in
Figure~\ref{fig:exa-arch}. The designated IO nodes (IONs) are connected to the
compute nodes (CNs) through the same fast fabric (e.g., InfiniBand) while the
connection to the external storage cluster is through a secondary, slower
channel (e.g., 10Gb Ethernet). By providing additional storage on the IO nodes,
such as SSDs, these nodes are capable of better regulating the IO pressure on
the underlying storage array better than simple forwarding gateways. For this
project, using something like SSDs on the IO nodes is termed a {\em Burst
Buffer} and is discussed below.
\subsection{Burst Buffers}
\label{sec:burst}
The idea of burst buffers were initially explored in the context of data
staging~\cite{abbasi:2007:datatap,Abbasi:2009:datatap,nisar:2008:staging,zheng:2010:predata}.
These initial designs all use extra compute nodes to represent the data storage
buffer given the lack of any dedicated hardware support for this functionality.
The desired outcome of these initial studies is to motivate how such
functionality might be incorporated and the potential benefits. Later, these
concepts were proposed to be incorporated into the existing IO stack
architecture~\cite{nowoczynski:2008:zest,bent:2012:challenges,bent:2012:burst-buffer}.
In the case of the written IOD design, it describes a fixed-sized staging area
that is partitioned on a per-application basis. As part of an application
being deployed into the platform, each application will be allocated a fixed
number of IO nodes for exclusive use during the application run. This provides
guarantees about how much burst buffer space and processing capability will be
available for the applications.
Future work will generalize this model to potentially support dynamic IO node
allocation and examine the possibility of oversubscription. It will be strictly
necessary to consider shared IO nodes for cases where the number of deployed
applications exceeds the number of IO nodes. This first phase focuses on
extreme scale application runs that use the vast majority of a platform rather
than a capacity cluster where end-to-end performance is a lesser concern.
\subsection{Data Versioning}
Since space is limited in expensive, in compute area storage resources, a
copy-on-write approach is used for new versions of the same data. For example,
for a checkpoint/restart file, multiple versions will be written. The only
parts of this container that must be replicated are those that have changed
since the last write. With potentially many transactions written to the IOD
layer because it is fast, this approach will enable additional output to be
stored while reducing the space overhead. The inherent dependencies this
introduces into the data are a lesser issue for the generally transient data
in the IOD layer. For data intended for persistence in the DAOS layer, it may
expose all versions of the data to corruption unnecessarily. This is explored
in more detail in the Section~\ref{sec:daos}.
\subsection{Design Philosophy}
The burst buffers design, as presented in the IOD documents, limits the
placement of the function operators and SSD buffers to the IO nodes. The
limitations of this design are acknowledged and the intent is to ultimately
spread the IOD layer from the IO nodes into the compute area as well. This is
intended to help address the limitations of the IO bandwidth and compute
capability of these few nodes for data processing and also to take advantage
of new layers in the storage hierarchy. By incorporating NVRAM into compute
nodes, new options for buffering data prior to being moved to centralized
storage become available and addresses potential concerns about SSD
performance. For example, including a small amount of Phase Change memory into
many or most compute nodes offers a way to move data outside of both the
compute and IO path for data and communication intensive operations. Other
projects~\cite{zheng:2010:predata} have shown this will have value, but the
cost will have to be considered as part of the overall platform budget. This
lessens the impact of some operators while offering additional options for
places to store data.
Burst buffers being optional is a high level goal, but not considered in detail
within the phase one design. If there is no burst buffer, all of the advanced
functionality proposed for the IOD layer would have to work against the DAOS
layer instead. For example, function shipping assumes it will operate on fast,
local data within the IOD layer rather than against the globally shared DAOS
layer that will likely still be disk for at least a couple more generations of
platforms. With the additional desire to support using compute node resources
for these operations, serious work will be required to make a fully functional
end-to-end IOD layer implementation for a production system.
\section{DAOS Layer}
\label{sec:daos}
The Distributed Application Object Storage (DAOS) layer serves as the
traditional parallel file system interface layer for the storage devices. This
is the consistent, global view of the underlying devices represented in this
stack by the VOSD layer. This is the layer where the container/object model is
translated into the physical storage requirements dictated by the physical
storage underneath (the VOSD layer). The two key design elements of this layer
are the handling of epochs and the mapping of containers and objects to the
underlying storage.
There is a bit of a terminology shift between the IOD layer and the DAOS
layer. For the IOD layer, a shard represents a portion of an object that is
spread across potentially multiple IO nodes. For the DAOS layer, a shard
represents the portion of a container that is spread across potentially
multiple physical storage devices. The physical storage devices are
represented by the VOSD layer described in Section~\ref{sec:vosd}.
While transactions at the HDF5 and IOD layer use the same term, at the DAOS
layer the terminology shifts. Instead of transactions, the term {\em epochs}
is used instead. Rather than attempting to introduce confusion, this is intended
to help clarify how these concepts are used at different layers of the FFSIO
stack. In the HDF5 and IOD layer, every operation has a transaction that may
or may not ultimately be persisted to the DAOS layer. When a transaction is
persisted to DAOS, it is termed an epoch to reflect that this is a persistent
version of the container. For simplicity the epoch ID is the same as the
transaction ID that was persisted. Unlike transaction IDs, epoch IDs generally
are not consecutive reflecting that not all transactions will be persisted to
DAOS.
To deal with the potentially missing data versions between epochs because not
all transactions are persisted, a special procedure must be followed. The
``flattening'' process combines multiple copy-on-write versions of a
transaction into a single epoch. Since this stack uses a copy-on-write
approach to reduce the space requirement for new versions of existing files all
of the changes between the last epoch and the current epoch must be combined
into a single entry. While not a cost free operation, it is generally
considered inexpensive since a backwards combining of transaction blocks can
be made ignoring any block that is already part of the combined changes.
The current implementation has the DAOS layer map the container/object data
model onto a directory/file data model used for most existing file systems.
Should a fully object-based file system be deployed at the VOSD layer, this
mapping would be unnecessary. The current projections suggest that a standard
POSIX-like file system will likely be used at the lowest level on each storage
device requiring the mapping at some level. To perform this mapping, DAOS
considers the following.
Each container is represented by a directory on some storage device containing
symbolic links to all of the shards it contains and maintains the epoch ID.
In particular the Highest Committed Epoch is an important concept for
quickly identifying which version of a shard to retrieve and to block writes to
older epochs since those have been committed.
Overall, the DAOS layer serves as the shared persistent storage interface for
the IO stack. In the case of a data center-wide storage array, the DAOS layer
would be shared across all of the platforms with the upper layers being local
to each individual platform.
To address consistency issues between platforms, containers at the DAOS layer
must know of every transaction. To address this, a container is updated every
time a new transaction is created for it and closed or aborted. This ensures
that if multiple platforms are writing to the same container sequentially that
they will not have conflicts in the highest transaction number. The FFSIO stack
does not support multiple applications from the same or different platforms
using a shared DAOS layer to write to the same container at the same time. This
functionality is not supported by popular existing paralell file systems
either.
\subsection{Design Philosophy}
The DAOS layer is the key storage management layer for this system. By handling
the translation between user-level concepts and the underlying hardware,
performance and functionality are both important. The choice of an object
interface is influenced by the performance gains achieved by the data analytics
community for non-shared data access. With the system design favoring requiring
this operation mode, using an object interface fits naturally. With the broad
array of object-based storage devices hitting the market, this layer may thin
outsourcing much of the object creation and management to these speciailzed
devices.
Since this is the layer at which a storage system will be shared by multiple
platforms, consistency is also a concern. By shifting to an object model and
moving away from a POSIX-style directory tree, maintaining consistency will be
easier. No longer will a consistent view of a particular set of files
(containers) be required. Instead, only a single container need be consistent.
With container sharing between platforms generally being limited to downstream
analysis routines, waiting for a new epoch to be persisted can serve as an
analysis trigger.
Issues related to the handling of transactions and epochs are discussed in
Section~\ref{sec:transactions}. Maintaining storage system scalability with
this functionality will be challenging.
\section{VOSD Layer}
\label{sec:vosd}
The Versioning Object Storage Device (OSD) layer operates as the interface for
each persistent storage device used to support the parallel storage array. In
the purest form, it uses a local file system to arrange storage of objects that
represent parts of the higher level objects in containers.
The base level implementation continues the space optimization of only storing
changes for new versions by using a copy-on-write file system. The prototype
uses ZFS~\cite{zhang:2010:zfs} for the known stability and integration with
Lustre. In a production version of the FFSIO stack,
btrfs~\cite{rodeh:2013:btrfs}, The Linux B-Tree File System, given its
open-source backing and GPL licensing, is a likely long-term choice.
At a more detailed level, the design for VOSD is an increment beyond the
current Lustre Object Storage Device design to incorporate the idea of shards
and the versioning aspects of transactions/epochs. For every DAOS shard, the
VOSD has information for storing and accessing the currently committed version,
the Highest Committed Epoch, as well as a staging dataset representing the
next version of the object being stored. Both of these are combined in a shard
root.
For data integrity, an intent log is maintained as part of the underlying file
system enabling fault recovery.
Beyond the functionality to incorporate and expose the copy-on-write nature
of the underlying file system and the semantics for storing and processing
shards and their associated epochs, this is largely an evolution of the
existing Lustre OSD layer.
\section{Broader Design}
\label{sec:broader}
Several concepts crosscut many of these layers and are best described in a
single location. For example, transactions and epochs are visible from the
user API level down into the VOSD layer. While each layer affects the concept,
it is best to look at each concept across all of the layers.
In the subsections that follow, we examine transactions and epochs, metadata
management, and function and analysis shipping.
\begin{figure*}[htbp!]
\centering
\vspace{-0.10in}
\subfigure[Write Hosts]{\label{fig:write-hosts}\includegraphics[width=\columnwidth]{images/write-hosts.png}}
\subfigure[Read Hosts]{\label{fig:read-hosts}\includegraphics[width=\columnwidth]{images/read-hosts.png}}
\vspace{-0.10in}
\caption{Functionality Demonstration Validation for Number of Hosts}
\label{fig:eval-hosts}
\vspace{-0.05in}
\end{figure*}
\begin{figure*}[htbp!]
\centering
\vspace{-0.10in}
\subfigure[Write Size]{\label{fig:write-size}\includegraphics[width=\columnwidth]{images/write-size.png}}
\subfigure[Read Size]{\label{fig:read-size}\includegraphics[width=\columnwidth]{images/read-size.png}}
\vspace{-0.10in}
\caption{Functionality Demonstration Validation for Data Sizes}
\label{fig:eval-size}
\vspace{-0.2in}
\end{figure*}
\subsection{Transactions and Epochs}
\label{sec:transactions}
In our previous work~\cite{lofstead:2014:ffsio-consistency}, we focused
strongly on not just exploring, but also critiquing using transactions, as
proposed, for consistency control. This section recaps some of that information
to give a more self-contained presentation.
As mentioned above, the transaction mechanism manifests in two forms. From the
user level down through the IOD layer, they are called transactions and are
used to judge whether or not a set of distributed, asynchronous modifications
across a set of related objects (i.e., within a container) is complete or not.
It is also used to control access by treating the transaction ID of committed
transaction as a version identifier. At the DAOS layer and below, they are
called epochs and represent persisted (durable) transactions from the IOD
layer. Each of these offers different functionality, but are connected as is
explained below.
\subsubsection{Transactions}
To understand how transactions are used in the IOD layer, some terminology and
concepts must be explained first. At the coarsest grain level is a container.
Each container provides the single access context through which to access a
collection of objects. Transactions are the way that a series of modifications
to the objects within a container are treated atomically. Conceptually,
containers correspond to a something akin to an HDF5 file in a traditional
file system. The objects in each container represent different data within a
file. The three initially defined object types are key-value stores,
multi-dimensional arrays, and blobs. The easiest way to understand these types
is to evaluate these from the perspective of an HDF5 file, the initial user
interface layer. The key-value store represents a collection of attributes or
groups. The array represents a potentially multi-dimensional array. The blob
represents a byte stream of arbitrary contents. The fundamental difference
between an array and a blob is that the array has metadata specifying the
dimension(s). At the physical layer within the IO nodes, all of these objects
may be striped across multiple IO nodes. Given this context, the transactions
come in two forms.
First is a single leader transaction where the IOD manages based on calls from
a single client. The underlying assumption is that the client side will manage
the transactional operations itself and the single client is capable of and
responsible for reporting to the IOD how to evolve the transaction state.
The second form is called multi-leader and has the IOD layer manage the
transactions. In this case, when the transaction is created, a count of clients
is provided to the IOD layer. As clients commit their changes to the container,
the reference count is reduced. Once the count reaches 0, the transaction is
automatically committed. Aggregation into a smaller set of leaders is also
possible.
\subsubsection{Epochs}
The Epoch mechanism differs from transactions. Instead of focusing on when a
particular output is complete, an epoch represents incremental persisted
container copies. To simplify the mapping between an IOD transaction and the
DAOS epochs, when an IOD transaction is persisted to DAOS, the IOD transaction
ID is the used as the epoch ID. The key difference is that at the DAOS layer,
some transaction (epoch) IDs will not be represented with data since not all
IOD transactions are necessarily persisted. Maintaining this ID continuity is
critical for multiplatform use. Since the shared point is the DAOS layer, any
user adding a new version to a file must be able to determine the most recent
transaction ID no matter from where the container was updated last.
\subsubsection{Design Philosophy}
Undocumented, but inherent in the design of these transactions is how faults
are detected. The initial design assumes the current Lustre fault detection
mechanism that can determine if a process or node is no longer reachable. This
detection happens at the DAOS layer and when a fault is detected, the rollback
process is pushed up to the IOD layer for all non-persisted or non-committed
transactions. This defines how a fault will be detected and what will trigger a
passive fault recovery (i.e., transaction abort). The challenge with this
approach will be scalability. Existing Lustre systems can use the IO node
status as a proxy for compute area status. Since the DAOS layer must now know
the state of every node, if not every process on every node, to properly
handle transactions, some scalable status tracking mechanism is required.
There are two steps for beginning a transaction on a container. The first step
is for one or more process to open the container. This handle can be shared
eliminating the need for every participating process to hit the IOD layer to
open the file. The second step is a call to determine how many leaders will
participate in the transaction. In the single leader case, there is no
IOD-side aggregation of success/fail statuses to determine the final
transaction state. Instead, it is assumed that the client will fully manage
the transaction. In the multi-leader model, some subset from 2 to $n$ where $n$
is the count of all processes, declare themselves a leader for this container
operation to the IOD layer. Any number of processes can participate in
modifying container without regard to whether or not they are a leader. Once
each leader has finished, with the assumption that any clients a leader may be
responsible for are finished as well, the IOD layer aggregates those responses
to either commit or abort the transaction. For scalability and performance,
phase two is favoring single leader transactions. With libraries like
D\textsuperscript{2}T~\cite{lofstead:2012:txn,lofstead:2013:pdsw-txn,lofstead:2014:txn}
to ease implementing client-side transactions, this burden is lessened.
Ultimately, with the passive detection of faults for transaction leaders, the
transaction mechanism can work very well. A mostly unstated restriction that is
being relaxed for phase two is that every sequential transaction on a container
is considered dependent on the earlier transaction. Should one output be
delayed and the subsequent five succeed, when the delayed process finally
fails, all six transactions are rolled back. The thought of using this
mechanism to store subsequent checkpoint outputs in the same container to both
save space, but not care if one fails, cannot work in the current form. This
has been acknowledged and is being relaxed requiring a new parameter to the
creation of a transaction determining if it will be dependent or not. The
downside to supporting this functionality is the reduced ability to use