Eryndlia's Prog Blog: Comma-Separated Values, cont. (csv_processing.ads, csv

Introduction
In the previous post, we walked through the library package containing the main program. This program calls a couple of other subprograms and also instantiates a generic package. In this post we will take a brief look at the file that contains that generic package and some of the deeper structure of the overall program. The package specifications that we will examine are CSV_Processing and CSV_Types — files csv_processing.ads and csv_types.ads, respectively. We'll get more into the nuts and bolts of actually how the program body works in the next post.
Additionally, another slight change was made to the main program Demo_CSVs: The procedure Process_CSV_Record has had its name changed to Process_CSV_Value_Array, which is more descriptive and less confusing as to what it actually does. Some of the names, by the way, that I have used in my code will seem horribly long to many software developers. My only response is that they are descriptive and help not only others but myself understand what in the world I was thinking. I like that.

Here is the beginning of the specification for package CSV_Processing. The specification for the main processing procedure for the CSV text file appears here. It's name is Process_Csv_File. The first three parameters to the subprogram are straightforward; the fourth needs a little further explanation:

Package Specification CSV_Processing, part 1

The_Text_File — A parameter of type Ada.Text_IO.File_Type. The_Text_File denotes a text file that is already opened for input. This is the CSV file to be read. Note that this type is directly visible as a result of the previous "use Ada.Text_IO" clause.
File_Record_Size — The maximum length of a line of text that can be input from The_Text_File.
The_Separators — A String of one or more Characters that are used as separators between the values (and Headers/Titles) in the CSV file. Note that a final break Character is optional.
CSV_Processor — An access to a callback subprogram, a procedure to be called by Process_Csv_File for each value (including each Header/Title String) in order to process that value. One of the value processing procedures that are supplied within the generic packages below should be sufficient in most cases. Note that this parameter's type is anonymous; that is, it is not formally declared with a name prior to this point. Also, Note that the access value supplied as the argument to the call of the procedure Process_Csv_File may not be null; in other words, the argument must reference an actual procedure. This helps to ensure that the program does not attempt a call to an invalid address (zero, for instance) at run-time.

The parameters of the actual callback procedure are: 1) Csv_Value, which is the String for the value (or Title/Header) taken directly from the text file's record; 2) First_Line, which is a Boolean indicating whether the value in parameter (1) was taken from the first record in the file, in which case it will be True; 3) Current_Field, which is the Positive number of the column from which the value in parameter (1) was taken; and (4) Last_Value, which is a Boolean indicating True in the case that the value in parameter (1) is the last value in the current file record/line.

The Generic_Packages
The package CSV_Processing contains two nested generic packages. They are made a part of the outer package partly for convenience but mainly so that the procedure Process_CSV_File, called by the main program Demo_CSVs, can have access to them; that is, Process_CSV_File can "see" their visible (non-private) components. These components include variables and constants, subprograms, and types, including a couple of protected types (one in each generic).

The two generic packages are CSVs_No_Titles_Processing, for processing a CSV file with values only — no header record — and CSVs_With_Titles_Processing, for processing a CSV file in which the first record is a comma- (or break character-) separated list of column headers for the data that follows in subsequent records. I had originally planned to have the "With_Headers" generic also handle files with no headers — and that functionality is partially implemented — however, it seemed to be more feasible for the purposes of this example to just add a more limited version of the package for simplicity and efficiency. The reader is left to implement the remainder of the partial coding in CSVs_With_Titles_Processing.

The Generic Package Specifications
Generic units — packages and subprograms (procedures and functions) — exist as one method of providing for reusability. The generic unit allows types, objects, and subprograms to be parameterized. These "parameter" declarations must appear between the word generic and the package specification proper. The "No_Titles" package is the simpler of the two all the way around, so we'll begin there.

Package Specification CSV_Processing, part 2

CSVs_No_Titles_Processing
There are three generic parameters for the generic package CSVs_No_Titles_Processing:

CSV_Index_Type
is an integer type, the exact range of which will be fixed when the generic is instantiated. This type will be the type used for indices for setting and looking up the values appearing within a single file record. Its range needs to be at least as large as the number of values appearing in the records. It is used as part of the declaration of the next type, Value_Components_Type.
Value_Components_Type
is the second generic parameter. It is an array that is indexed by CSV_Index_Type, which we just looked at. The array elements are actually Strings, whose type (including length) is declared in the package CSV_Types shown below.
Process
The third generic parameter Process is a procedure whose parameter profile is shown. The name can be Process or something else. Although I did not do this, I could have required the generic parameter to be a specified name. The procedure supplied for this parameter when CSVs_No_Titles_Processing is instantiated will be called upon completion of each record's processing. The single parameter supplied to Process is the array of String values garnered from that record.

CSVs_No_Titles_Processing — the specification

Moving on, the protected type specification CSVs appears next. The protected type provides controlled — one client at a time — access to its internal variable Values. Values is the actual data array that will hold all the String values plucked from a file record. There are two entry declarations, which provide for a (potential) queue of clients to Values; a single procedure Set_Value; and a single function The_Values. Set_Value does what the name says and sets a specific array element indicated by The_Index. The_Values returns the entire array of String values as its function result.

Following the specification for CSVs, we find a single object, the variable CSVs_No_Titles, whose type is the protected type CSVs that we just looked at. This variable now serves as our storage for both the values array and the entry points and subprograms for accessing the array.

Next is the only subprogram which appears in the specification for the package CSVs_No_Titles_Processing. This is the procedure CSV_Processor_No_Header_Line. This procedure can be called, once the package is instantiated, to process each data record from the CSV text file. Finally, this subprogram calls the procedure Process after all the values in a record have been stored in the protected array.

CSVs_With_Titles_Processing
There are multiple ways in which this package could have been implemented. Probably the most flexible in terms of the column headers (or titles) themselves would be to define a String access type and then create an array of these access variables for holding the titles and indexed by the title's position in the array, just as the values are indexed in the previous generic package "No_Titles." I decided against using this method (which is left to the reader as an exercise) and decided to make this an opportunity to demonstrate the usage of Ada enumeration types.

Package Specification CSV_Processing, part 3

As one might surmise, there are both pros and cons to implementing the headers with the aid of an enumeration. There are two major cons which stand out:

Each header must match, except for case, a corresponding enumeration literal exactly. [This is done to keep the code simple, although I can imagine other schemes, for instance, detecting a mismatch and then checking to see whether the String form of the literal matches the first n Characters of the header String; removing spaces; replacing spaces with underlines, etc.]
The headers must have been placed into the code as an enumeration type before the program is run. [This is often a pain, no doubt. There are ways, however, to reduce the inconvenience. One way is to have an executable library (a Windows .dll library, for instance) to hold the various header types to be used by the program. This would allow a software department or company to distribute the program while allowing the client department or company to have to deal with only a slight inconvenience but gaining hugely in flexibility.]

The pros

Each header must match, except for case, a corresponding enumeration literal exactly.
[When one is considering safety and reliability, forcing the actual headers in the file (in both form and count) to correspond with internal representations helps to ensure that the expected file is the one actually being read — not a newer/older version and, worse, not a totally different file.]
The enumeration type itself can be used directly for indexing the array of values which is generated from the subsequent text file records. [This aids in simpler code and more easily understandable code, as each literal directly represents the column of values it heads.]

Proceeding, there are three additional generic parameters for the generic package CSVs_With_Titles_Processing which do not appear in CSVs_No_Titles_Processing. These additional parameters are part of the extra code necessary to implement this functionality:

Title_Type
This type is the enumeration the literals of which are the expected column headers from the first record in the text file. Note that the position numbers of the literals must be sequential for the program to work properly; this is the default, so you normally do not have to worry about it.
Titles_Type
This is the array of Headers/Titles that is indexed by CSV_Index_Type. This array allows us to match a column number (derived from CSV_Index_Type) with a specific Header/Title.

2a. Note that this CSV_Index_Type varies from the CSV_Index_Type in the generic package CSVs_No_Titles_Processing. In that package, described above, this generic parameter's type must have the exact range 1 .. expected number of values.

The type of the parameter of the same name here must be declared with exactly the same range as the contiguous range of position numbers of Title_Type. The position numbers for the enumeration literals normally begin at zero but can begin at any, and this CSV_Index_Type must match their range.
Raise_Exception_When_Missing_Or_Invalid_Title
This constant generic parameter should be set to True in order for an exception to be raised for a missing or invalid Title/Header in the text input file.

CSVs_With_Titles_Processing — the specification

Let's now take a quick look at the differences between the specification of this package and that of the No_Titles package. These differences exist, of course, to process the expected Headers/Titles which the first record of the CSV text file should contain. The first difference is that the constant Title_Pos_First is created and set to the position number of the first literal of the enumeration type Title_Type. It is used in the following subprogram.

Title_Field_Index appears next. It calculates the corresponding index into the array Titles_Type given the column number in which a value appears. It may appear strange to some of you: This is because it is an example of an Ada 2012 expression subprogram, in which the subprogram specification and body both can appear in the package specification. In this case a separate subprogram specification is not needed, so the body with its calculation appears solo.

The protected type CSVs now includes the variable Titles, which is the array of Headers/Titles indexed by CSV_Index_Type, and the subprograms required to set and access it. Also, an additional Boolean variable Title_Record_Present is added to signal other parts of the code when the subprogram Set_Title is called. This functionality was added not because it is required currently for the code to work correctly but because I am still experimenting and may use it in the future.

Finally, the specification for the procedure CSV_Processor_With_Header_Line has been added. The body of this subprogram processes the initial record of the input text file in order to extract the column Headers/Titles.

CSV_Types — specification

The package CSV_Types contains declarations that are used program-wide. They are placed into this second package for two reasons:

Convenience and simplicity and
Avoidance of possible circular references

There are two kinds of declarations that appear in CSV_Types:

CSV_Types
Package Specification

General declarations
Declarations primarily used as arguments in the instantiation of the generic packages above.

(1) CSV_Types — general declarations

Max_CSV_Length is simply the maximum length that a value (non-Header/-Title) may be. We process values strictly as Strings without attempting to interpret them as a target type. Such conversion is left to the callback routine specified by generic parameter Process. A default value of 20 seems reasonable for Max_CSV_Length but may be set to any required value.

Value_Index_Type is the range, based upon Max_CSV_Length, to be used in the following array declaration.

Value_String_Type is the array of Strings that will hold the textual values garnered from a particular record of the text CSV file.

(2) CSV_Types — package Employee_CSV

These latter declarations are grouped within a nested package specification Employee_CSV. This is done so that multiple such packages may be defined for the Headers/Titles required by any number of different applications.

Headers_Type is the declaration of the enumeration that contains the literals, each of which can head a column of values within the CSV text file. When adding additional packages, only the type Headers_Type must be changed: The remainder of the types are declared referencing Headers_Type. All of these type/subtype declarations are used when instantiating one of the two generic packages. See procedure Demo_CSVs for an example of this instantiation.

Summary

This blog post has covered the package specifications CSV_Types and CSV_Processing. CSV_Types has no body; there are no constructs — no package, no procedure, no function, no task, and no protected type — that require a body. The body for the package CSV_Processing will be presented in the following post. It shows the primary component of the package's implementation and should be quite interesting to those getting to know Ada.

Pages

Friday, October 7, 2011

Comma-Separated Values, cont. (csv_processing.ads, csv_types.ads)

No comments:

Post a Comment