Optimization

 

By default, Stat/Transfer will optimize data whenever it is necessary.   This will assure that text variables will not be truncated and that the storage of numeric output data will be as compact as possible. 

 

For some output formats, such as fixed format text, this optimization step is absolutely necessary.    We strongly recommend that you just let Stat/Transfer do its work and not attempt to override or modify the optimization process. 

 

         

You should only modify the options which control optimization if:

 

  1.  you know what you are doing AND...

  2. your input dataset is so large that the time you may save is worth the risk of corrupted or less-than-optimal data.

 

That said, there are options that allow you to control the optimization process.

 

How Stat/Transfer Optimizes

 

The Stat/Transfer optimization process makes an additional pass through the data before transferring it.  For each variable, in each selected case, running statistics are accumulated. The following calculations are performed:

 

Single or Double Width?

For numeric variables, the minimum and maximum values are recorded. Each value is then tested to see whether it has a fractional part.  If so, it is tested to see if it can be represented in a single precision floating point number or if it must remain in double.

 

Fixed Width

For output formats that are essentially fixed-width, formatted text, it is necessary to determine the maximum width of each numeric variable.  To do this, they are written to memory to determine their width. This is only done for output formats that require it, but for those formats it is mandatory.

 

Transcoding Strings

String variables are transcoded into the output encoding and compared to their stored maximum encoded length, which is adjusted as necessary.  Converting from one character set to another can change the string width. For example: if a single-byte character set (such as Greek or Russian) is transcoded to UTF-8, the output string width could be twice as long as the input.  Thus, when there are any strings in the input file, optimization is usually needed.

 

Final Step

After all cases have been read, the target types for integers are determined based on their maximum and minimum values.  For any variable that is not integral, the target  type will be either double or single precision floating point. The width of strings in the output file will be determined by their maximum width, as will the width of numbers which are going into fixed format ASCII file types.

 

Optimization must be performed:

 

  • When the input file format gives no information of the widths of variables (e.g. CSV, Excel) and the output format needs the width of output variables (This includes almost everything but .csv files and worksheets).

  • When the output file requires the calculation of the formatted width of numeric variables (fixed format ASCII, dBASE).

 

If optimization is required, you will be able to set options to suppress it, but they will not be honored.  Messages about optimization can be found in the log if the log-level is set to “information and errors” or higher.

 

If Stat/Transfer detects any variable truncation when optimization has been turned off or limited, the transfer will be halted with an error message.

 

Guide to Optimization Options

 

Number of records to optimize

The possible values for this are:

 

All (the default) – Optimization will be performed on all cases whether it is strictly necessary or not.  This is the recommended setting and for datasets of reasonable size, this will take very little time and will result in an output file that is guaranteed to be correct and as small as possible

 

None - This is not recommended and should only be used if your dataset is extremely large and you are confident that optimization is not necessary.

 

Number – This setting is a compromise and is sometimes sensible if your dataset is very large and you are confident that the first n cases will fairly represent your data.

 

Preserving Widths

If the Number of Records to Optimize is set to “all” or to a specific number, you can exert some control over the widths of both string and numeric variables.

 

Preserve String Widths if Possible

Normally, when optimizing, Stat Transfer will calculate the minimum string width for each variable. This ensures that the output file will be as small as possible. This option allows you to maintain the input string width. This is particularly useful when combining different files. If this option is checked, Stat/Transfer will use the input width will be used as the output width if it can do it without losing data. More precisely, the variable width will be the minimum of the transcoded string width and the input width. The output can be greater, but not less than the input width.  For plain ASCII data, it will be the same.

 

Preserve Numeric Widths if Possible

This option is useful if you are reading text data or data which comes from a format which preserves the width originally used to read the data (SPSS is a prime example). If you are writing text data or want a variable to be formatted with the same width as was present in the input, check this option. If, after optimization, the variable is wider than the input width, the variable will be widened to prevent loss of data.

 

Special Option for Stat/Transfer Schemas

If you want total control over your variables and you know what you are doing, you can put the keyword

 

NO-OPT

 

In your Stat/Transfer schema.  If it is present, your data will not be optimized no matter what output format is chosen.