Refine Detection of Text Files

During text detection, File Content Extraction analyzes the start and end of the document. It compares the usage of printable ASCII characters to other types of character to detect if the document is a plain text file.

Depending on the type of documents you are working with, the default settings might not provide the desired level of accuracy. You can use configuration flags to change the amount of data to read at the end of a file, the percentage of non-ASCII characters permitted in a text file, and whether to use or ignore the file extension to determine the document format.

Change the Amount of File Data to Read

During format detection, File Content Extraction reads characters from the beginning and end of a file. Large text files might contain many irrelevant characters at the end of a file, so File Content Extraction might not accurately detect the file format. You can set a configuration flag to increase the amount of data to read from the end of a file during detection.

To change the amount of data to read during detection

  • In the formats_e.ini file, set the following flag in the detection_flags section:

    [detection_flags]
    non_ascii_chars_end_block_size=kB

    where kB is the number of kilobytes to read from the end of the file, from 0 to 10.

    NOTE: This flag is ignored when the value exceeds the input file size.

Change the Percentage of Allowed Non-ASCII Characters

File Content Extraction detects a document as a text file only if it contains a small number of control characters. If you expect your files to contain higher levels of control characters, changing the default percentage might increase detection accuracy.

File Content Extraction uses this setting only if the analyzed data contains a high enough proportion of printable ASCII characters. For example, a document that was mostly control characters is not detected as a text file even if you set this percentage to 100.

To change the percentage of non-ASCII characters allowed in text files

  • In the formats_e.ini file, set the following flag in the detection_flags section:

    [detection_flags]
    non_ascii_chars_in_text=N

    where N is the percentage of non-ASCII characters to allow in text files. Files that contain a lower percentage of non-ASCII characters than N are detected as text files. The default value is 10.

Allow Consecutive NULL Bytes in a Text File

By default, if a document contains consecutive NULL bytes, it is not detected as text. Depending on the type of files that you are working with, changing the default might increase detection accuracy.

To allow consecutive NULL bytes of ASCII characters in text files

  • In the formats_e.ini file, set the following flag in the detection_flags section:

    [detection_flags]
    ascii_allow_null_bytes=1

    The default value is 0 (do not allow consecutive NULL bytes).