Eryndlia's Prog Blog: Comma-Separated Values, cont. (csv_exceptions.ads, csv

Introduction
We now turn our attention to the bulk of the executable code: This code is found in various constructs within the body of the package CSV_Processing, which is file csv_processing.adb. Before we do so, however, let's just take a quick glance at a tiny package in which is housed the definitions a couple of exceptions, which are referenced in CSV_Processing's body.

Package Specification CSV_Exceptions

CSV_Exceptions — specification
Two exceptions are declared in the package specification CSV_Exceptions (file csv_exceptions.ads):

Invalid_Csv_Header
This exception is raised in CSVs_With_Titles_Processing whenever the constant generic parameter Raise_Exception_When_Missing_Or_Invalid_Title is set to True and either a Header or Title is missing or is invalid.
Protected_Type_Not_Reserved
This exception is raised throughout the code whenever a call to an entry or subprogram within a protected type is made "illegally;" that is, the type has not been reserved for use by first calling the entry Reserve.

Package Body CSV_Processing (intro)

CSV_Processing — body

As mentioned above, the bulk of the executable code is contained within the body of the package CSV_Processing (file csv_processing.adb), which is the primary subject here. Note that there is a small amount of executable code that is needed for the elaborations, including constants, within the various package specifications; and, of course, there is the main procedure Demo_CSVs, which we will examine in the next and final posting.

The requisite with clauses appear at the top of the file to indicate dependencies. Note that a with clause that already occurs in the specification, CSV_Types for instance, does not need to be repeated in the body.

Next a pair of renames declarations allow us to use shorter names than the fully qualified names. They don't just make it convenient when typing but make the code much more readable.

Procedure Body CSV_Processing.Process_CSV_File, part 1/3

Process_CSV_File
The first executable entity we come upon is the body of the procedure Process_CSV_File. Process_CSV_File has four parameters, which were detailed in the description of the specification of this subprogram in the previous blog post. For convenience, the descriptions are repeated here:

The_Text_File — A parameter of type Ada.Text_IO.File_Type. The_Text_File denotes a text file that is already opened for input. This is the CSV file to be read. Note that this type is directly visible as a result of the previous "use Ada.Text_IO" clause.
File_Record_Size — The maximum length of a line of text that can be input from The_Text_File.
The_Separators — A string of one or more characters that are used as separators between the values (and Headers/Titles) in the CSV file. Note that a final break character is optional. An obvious enhancement would be to give this parameter a default value of "," (a comma) or ", " (a comma or space).
CSV_Processor — An access to a callback subprogram, a procedure to be called by Process_Csv_File for each value (including each Header/Title string) in order to process that value. One of the value processing procedures that are supplied within the two generic packages below should be sufficient in most cases. Note that this parameter's type is anonymous; that is, it is not formally declared with a name prior to this point. Also, Note that the access value supplied as the argument to the call of the procedure Process_Csv_File may not be null; in other words, the argument must reference an actual procedure. This helps to ensure that the program does not attempt a call to an invalid address (zero, for instance) at run-time.

The parameters of the callback procedure are: 1) Csv_Value, which is the string for the value (or Title/Header) taken directly from the text file's record; 2) First_Line, which is a Boolean indicating whether the value in parameter (1) was taken from the first record in the file, in which case it will be True; 3) Current_Field, which is the Positive number of the column from which the value in parameter (1) was taken; and (4) Last_Value, which is a Boolean indicating True in the case that the value in parameter (1) is the last value in the current file record/line.

Note that the procedures CSV_Processor_With_Header_Line and CSV_Processor_No_Header_Line are designed to be used as possible actual parameters, or arguments, for formal parameter CSV_Processor. These subprograms appear within the nested generic packages inside this package.

The first task of Process_CSV_File is to declare a number of objects — constants and variables — that will be used within the subprogram:

Break_Characters — The String parameter The_Separators is converted to a character set to be used in the Index functions for Fixed Strings. This is a convenient way to search for any of a set of characters within a string.
Find_A_Break_Character, Find_A_NonBreak_Character — These renaming declarations help to make the code more understandable by giving task-specific names to the more general names of Inside and Outside.
Pos1, Pos2 — These are the two variables that will be used to frame values within the input record as they are discovered.
Processing_First_Line — This variable is True when the first record in the file is being processed and False otherwise. We initialize this variable to True in the declaration. It could just as easily been set to True through an assignment statement after begin. Note that this variable has not effect when Header/Title records are not present in the input file.
Buffer_Size — a constant, computed at run-time, that determines the size of the buffer to be allocated (on the stack in this case) for receiving each text record from the file. Note that the initialization value for this constant is a case expression. This type of construct now is allowed under Ada 2012.
Buffer — The memory into which will be placed the data from each file record.
Length_Read — This variable will hold the actual length of each record read, since a text file typically has variable-length records.

We next come across a second procedure, Process_A_Line_From_The_CSV_File, which is embedded (nested) within Process_CSV_File. Examination of this procedure will be deferred until after the description of the remainder of Process_CSV_File is completed.

Procedure Body CSV_Processing.Process_CSV_File, part 2/3

The rest of Process_CSV_File exists to read each text file record, perform an initial search for a non-break (or non-separator) character to determine that some value exists within the record, and if so, to call Process_A_Line_From_The_CSV_File. This executable portion is placed after the word begin. The first item of business is to establish a loop, which will exit upon detection of end-of-file. Note that end-of-file also can be detected as an exception by means of an exception clause, in which case we would set up the loop with no conditions — an infinite loop.

The Get_Line procedure reads a record from the external medium. It is a part of the package Ada.Text_IO. The reason that we do not have to give the fully qualified name is because Ada.Text_IO appears in a use clause within the package specification for CSV_Processing. The data is placed within Buffer as a String, and Length_Read is set to the number of characters placed into Buffer.

Next we determine whether a value exists within the record, and, if so, we save its position in the variable Pos1. Of interest is the combination of arguments to Fixed.Index. The Break_Characters specify the characters to either scan for — that is, break on one of those Characters, or span to — skip sequential Break_Characters and stop on the next non-break Character. The Test parameter specifies which operation we want performed — Find_A_Break_Character (scan for one of them) or, in this case, Find_A_NonBreak_Character (span them). In this subprogram call, we span all sequential break characters (spaces and commas, for instance) looking for a value (anything other than a space or comma).

The return value of the function Index, assigned to Pos1 returns zero if the specified Character(s) can not be found, and the next test uses a return of zero to indicate that there is no value in the record, in which case the record is not processed. A positive return value from Index then leads us to invoke the nested procedure Process_A_Line_From_The_CSV_File.

This subprogram will both set and use the variables Pos1 and Pos2 that are defined in the parent procedure Process_CSV_File, thus having "global" effects. These global effects are limited to Process_CSV_File only, so the risk is acceptable; and anyway, Process_CSV_File only modifies Pos1 and that only happens when a new record is read, starting the process over again. For purists, Pos2 could easily be moved into the declarative part of Process_A_Line_From_The_CSV_File, but they have been placed together to aid in understanding the program.

Process_A_Line_From_The_CSV_File

This procedure is made up of code extracted from Process_CSV_File in order to give that subprogram increased legibility. The pragma Inline informs the compiler that it is not necessary to set up stack storage and call this as a normal subprogram but just to generate the instructions inline at the point of the procedure call.

Two variables local to Process_A_Line_From_The_CSV_File are declared:

Current_Field — Is essentially the column number of the field — Value or Header/Title — being processed
Done_With_This_Line — Is a Boolean value controlling the iteration for the loop that locates and processes each column of text in turn. It is set to True when the end of the record or line is encountered.

Procedure Body CSV_Processing.Process_CSV_File, part 3/3

Process_The_Values is the loop which iterates until the code determines that it is Done_With_This_Line.

The first thing inside the loop is a search for the end of the value residing at the position specified by Pos1. Pos2 is set to the resulting value. Again, a zero indicates that the looked-for Character(s) was not found. That means there is no final break character in the record, and it was decided to add code to allow for this scenario. The result is that in either case Pos2 is set to the last character position of the Value + 1.

When the values of Pos1 and Pos2 are unequal, they indicate that another Value is to be processed, and the procedure passed to Process_CSV_File in the parameter Csv_Processor is called. Note that because the parameter is an access value, the reference to Csv_Processor includes .all to tell the compiler that the procedure and not the access value is desired.

After the value has been processed, we check to see whether the end of the record has been reached: If not, then there may be another value on the line, so we continue the search with another Index call starting at one Character past the assumed — since Pos2 may point past the end of the Buffer — at the break Character. A return result of zero means we're done with this record; otherwise, we increment the number of the Current_Field and loop for processing this new value.

When the end of the record is finally encountered, there are no more values. The subprogram returns to Process_CSV_File with Current_Field and Done_With_This_Line evaporating along the way. These variables, of course, will be reinitialized upon the next call after reading of the next record is complete.

CSVs_No_Titles_Processing — generic body

This package body actually is the body for the generic package whose specification we have examined in the previous posting. It contains two constructs:

protected body CSVs — containing the entry and subprogram bodies declared in the specification we have previously discussed
procedure CSV_Processor_No_Header_Line — the body of the procedure previously specified

Package Body CSV_Processing.CSVs_No_Titles_Processing, part 1/1

The entry bodies are self-explanatory. Reserve, in addition to its primary function of setting the Reserved variable, also initializes the array Values to the null String. Note that there are entry guards to ensure that each entry call awaits its proper turn.

The procedure Set_Value, in addition to setting a single String value within the array, tests the flag Reserved to verify that the call to this subprogram is legal, that is, that the Reserve entry has been called to block other users of the Values array.

The_Values, likewise, tests the state of the Reserved flag and, if the call is allowed, returns the entire array of String values corresponding with the comma-separated values retrieved from the file record.

The body of the procedure CSV_Processor_No_Header_Line appears next. It is quite simple, since it does not have to deal with a Header or Title record. The Field_Index is set to be the column number, e.g. the number of the position of the value within the record Buffer. Then Set_Value is called to place the String value into the Values array. Note that, because the protected body of CSVs is the body of a protected type, we can not call into it directly but must declare an object of the type, which we can then call into. You will recall that we did this in the specification for CSVs_No_Titles_Processing.

The last thing CSV_Processor_No_Header_Line does is to check whether the value is the final value within the record Buffer. If so, it invokes the procedure Process passing to it the Values array, pulled from the protected object, for final processing. Process should contain the code that interprets the Values as data specific to a domain and/or stores it. As a reminder, the actual subprogram Process represents will be one of the generic parameters provided when CSVs_No_Titles_Processing is instantiated.

CSVs_With_Titles_Processing — generic body

The package body CSVs_With_Titles_Processing has much the same code as CSVs_No_Titles_Processing. Some of the code is a little different, plus there is additional code. All of these differences are to support the interpretation of the first record in the file as labels or Headers for the columns of data to be read in subsequent records. Let's take a look at the differences. Note first of all that even though some constructs in this package are the same as those in CSVs_No_Titles_Processing, they can be identified separately, if necessary, through qualification — CSVs_With_Titles_Processing.CSVs.Reserve, for instance.

Package Body CSV_Processing.CSVs_With_Titles_Processing, part 1/3

The protected body CSVs contains modified entry and subprogram bodies plus two new subprograms — procedure Set_Title and function The_Titles. You will recall that the specification also contains the declarations of two new variables — Title_Record_Present, which is set when Set_Title is called, and Titles, which is the actual array variable containing the cross-reference of title by column number, so that the client subprogram Process can know what each column of Values represents.

The entry body Reserved remains the same. The entry body Release also now resets the Boolean variable Title_Record_Present.

Set_Title, after checking to be sure the call is legal, stores The_Title into the array Titles at The_Index necessary to be able to look it up later given a particular column-of-values/field number. It then sets Title_Record_Present to alert the Set_Value routine that a record of Titles was indeed seen.

Set_Value also checks the legality of any call to itself. Then, a test is made to determine whether Title_Record_Present is set. The intent here is (1) to be sure that an expected Title record is present or else (2) to let the calling software know — through the exception mechanism in this case — that an illegal call has been made. The actual processing of the Title in this case is quite simple: A declarative block is created in which a constant (The_Title) is declared that is the internal binary (not String) form of the Title. The_Title is then used to determine the specific Values array element to be set. That element is then set to be the String value gleaned from the record Buffer.

Package Body CSV_Processing.CSVs_With_Titles_Processing, part 2/3

The function The_Titles again checks the legality of the call and then returns the array of Titles arranged in order of their appearance within the header record of the file.

The body of the procedure CSV_Processor_With_Header_Line appears next. This subprogram, or a similar user-written subprogram, must be specified in place of CSV_Processor_No_Header_Line to the call of Process_Csv_File in order to process a CSV file in which the first record contains the list of break character-separated field/column Headers/Titles. Its sole job is to process the Titles from the first record of the file and then to pass the Values from subsequent records straight through to CSV_Processor_No_Header_Line for handling.

Package Body CSV_Processing.CSVs_With_Titles_Processing, part 3/3

The body of procedure CSV_Processor_No_Header_Line is virtually identical to the one in the package CSVs_With_Titles_Processing. The only difference occurs in the initialization of the constant Field_Index, which is used to index the array Values. In order to increase the flexibility of using this program we are allowing the position numbers of the literals specified in the enumeration Title_Type parameter to start at an arbitrary number, rather than the default zero. To handle this situation we use the expression function Title_Field_Index appearing in the specification to perform the calculation. The rest of CSV_Processor_No_Header_Line is identical to the one in CSVs_No_Titles_Processing.

Summary

In this posting we have examined the body of package CSV_Processing, which has a number of components and sub-components. All these components work together to complete the non-domain-specific processing of the comma- or break-separated-values (CSV) file. The knowledge regarding the meaning of the Titles and/or Values read from the text file is embodied within the subprogram whose generic parameter is procedure Process. The next and final posting for the program Demo_CSVs describes a very simple test procedure Process_CSV_Value_Array, which is used for the demonstration.

Pages

Thursday, October 13, 2011

Comma-Separated Values, cont. (csv_exceptions.ads, csv_processing.adb)

No comments:

Post a Comment