How do I determine the excel file format?

If you do not go into details, you come across files .xlsx which are named as .xls and vice versa, .xls that are named as .xlsx. There are also those that correspond to their extensions.

In general, you need to go through the files and rename them correctly, but I did not find how to determine the file format without relying on the extension in its name

 2
Author: Fangog, 2019-07-23

2 answers

By the cap. The new excel xlsx files start with the PK signature of the zip 50 4B. With such a signature, there can be not only xlsx, but also docx and zip archives, you need to read the list of packaged files for clarification.

Files .xls-there are two types, "new" xls (usually these are found). They have the signature D0 CF 1 E0 A1 B1 1A E1 (in fact, this is the signature of the office document, for more detailed identification, i.e. doc and xls will have the same signature, you need to look for the "main" attachment in biff wrapper, and check the signature there, or by the list of biff wrapper files), and all the old ones (before version 7, the 6th one was not in use) have the signature 09 08

For example, you can do this

 byte[] xlsDoc = f(); // Загрузили документ в байты
 if ((xlsDoc[0]==0x50) && (xlsDoc[1]==0x4B)) {
     // Новый ексель
      } 
 else if ((xlsDoc[0]==0xD0) && (xlsDoc[1]==0xCF) /*...*/) {
    // Офис 97
 }
 else if ((xlsDoc[0]==9) && (xlsDoc[1]==8)) {
    // Старый одностраничный
 } else {
    // В других случаях - думаю что мусор
 }

Since the file can be large, it is better to subtract the first 16 bytes, and analyze them, and then return Stream.Position to zero.

Even with excel, I came across jokes

  • xml <?xml - file with excel (if you save the file as xml in excel)
  • html excel <html - often I advise you to make such a file on the Internet javascipt-om php and so on, - table c <td> <tr> tags and specific postings.

References to signatures

  1. http://uk.wikipedia.org/wiki/%D0%A1%D0%B8%D0%B3%D0%BD%D0%B0%D1%82%D1%83%D1%80%D0%B0_%D1%84%D0%B0%D0%B9%D0%BB%D1%83_(%D0%BF%D0%B5%D1%80%D0%B5%D0%BB%D1%96%D0%BA)
  2. http://www.filesignatures.net/index.php?page=search&search=XLS&mode=EXT
  3. http://www.filesignatures.net/index.php?page=search&search=XLSX&mode=EXT
  4. http://en.wikipedia.org/wiki/Microsoft_Excel
 3
Author: nick_n_a, 2019-07-23 09:43:26

Alternatively, in Windows, you can use the Structured Storage API to determine whether a file is a valid XLS file. According to the specification, the XLS format is a Structured Storage format file that contains a stream named Workbook.

MS-XLS: Excel Binary File Format Structure, paragraph 2.1.2:

A file of the type specified by this document consists of storages and streams as specified in [MS-CFB]...

A workbook MUST contain the workbook stream...

You can use the following code to check for XLS, based on this rule:

using System;
using System.Collections;
using System.Runtime.InteropServices;

namespace ConsoleApplication1
{
    class Program
    {
        [DllImport("ole32.dll")]
        static extern int StgOpenStorageEx(
            [MarshalAs(UnmanagedType.LPWStr)] string pwcsName,
            uint grfMode,
            uint stgfmt,
            uint grfAttrs,
            IntPtr pStgOptions,
            IntPtr reserved2,
            [In] ref Guid riid,
            out IStorage ppObjectOpen);

        const uint STGM_DIRECT = 0;
        const uint STGM_READ = 0;        
        const uint STGM_SHARE_EXCLUSIVE = 0x10;        
        const uint STGFMT_STORAGE = 0;        
        const uint PID_FIRST_USABLE = 2;
        const uint STGC_DEFAULT = 0;               

        [Guid("0000000B-0000-0000-C000-000000000046")]
        [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
        public interface IStorage
        {
            void a();
            
            [PreserveSig]
            int OpenStream(string pwcsName, 
                IntPtr reserved1, 
                uint grfMode, 
                uint reserved2, 
                [MarshalAs(UnmanagedType.Interface)] out object ppstm);
            
            void CreateStorage(string pwcsName, uint grfMode,  uint reserved1, uint reserved2, out IStorage ppstg);            
            void OpenStorage(string pwcsName, IStorage pstgPriority,  uint grfMode,  IntPtr snbExclude, uint reserved, out IStorage ppstg);            
            void CopyTo( uint ciidExclude, Guid[] rgiidExclude,  IntPtr snbExclude, IStorage pstgDest);            
            void MoveElementTo(string pwcsName, IStorage pstgDest, string pwcsNewName, uint grfFlags);            
            void Commit( uint grfCommitFlags);            
            void Revert();            
            void b();            
            void DestroyElement(string pwcsName);            
            void RenameElement(string pwcsOldName, string pwcsNewName);            
            void c();            
            void SetClass( ref Guid clsid);            
            void SetStateBits( uint grfStateBits,  uint grfMask);            
            void d();
        }

        public static bool IsXLS(string path)
        {            
            IStorage pStorage = null;
            object o = null;
            int hr;            
            Guid guidStorage = typeof(IStorage).GUID;            

            try
            {
                //открываем файл
                hr = StgOpenStorageEx(path, STGM_READ | STGM_SHARE_EXCLUSIVE, STGFMT_STORAGE,
                    0, IntPtr.Zero, IntPtr.Zero, ref guidStorage, out pStorage);

                if (hr != 0) return false; //NOT Structured storage file                 
                       
                //открываем поток
                hr = pStorage.OpenStream("Workbook", IntPtr.Zero, STGM_DIRECT | STGM_READ | STGM_SHARE_EXCLUSIVE, 0, out o);
                                            
                return hr == 0;
            }
            finally
            {
                //освобождение ресурсов                
                if (pStorage != null) Marshal.ReleaseComObject(pStorage);
                if (o != null) Marshal.ReleaseComObject(o);
            }
        }       
    }  
}

Since the XLSX file is a ZIP archive of a certain structure, you can use the same logic for checking, and use any library for working with ZIP archives (in .NET 4.5+ there is a built-in System. IO. Compression).

 2
Author: MSDN.WhiteKnight, 2020-06-12 12:52:24