Bulk File Importer

From Sense/Net Wiki
Jump to: navigation, search
  •  
  •  
  •  
  •  
  • 100%
  • 6.0.8
  • Enterprise
  • Community
  • Planned

Overview

Bulk File Importer
If you need to import vast amount of data quickly into your Sense/Net Content Repository you can use the Bulk File Importer Tool. This console application imports whole folder structures including files from the file system into the Content Repository keeping the source structure and indexing content to allow fast searching for files immediately after import. The Bulk File Importer is optimized to import large number of files within the shortest possible time. This feature is currently available as a separate package for our enterprise customers.

Details

How it works

The Bulk File Importer is a console application that connects to a Content Repository of an existing Sense/Net installation and imports files from the given sourcepath to the specified target location. During importing it also indexes content therefore searching among the imported files is immediately available in the target Sense/Net installation after import. The Sense/Net website needs to be shut down during import.

Our measurements

We have conducted a number of tests importing a real-life data set including vast amount files from our very own intranet installations.

Our results

Caption text
sec hour folder file total count MB cps MB/s avg size (B) commit buffer (kB) thread
32026 8,9 299 284 1 554 405 1 853 689 328 430 57 10 221 553 5 000 10 240 100

Our test environment

Hardware:

  • IBM X3550 rack mount server:
    • 2 x Intel Xeon E5420 2.5 MHz processor
    • 32GB Memory
    • 4 x 146GB SAS HDD (Raid-10) LuceneIndex
  • HP MSA100 SAN (2Gbit Fibre Channel connection)
    • 4x300GB SCSI HDD (Raid-10) SOURCE: documents to import
    • 4x300GB SCSI HDD (Raid-10) DESTINATION: SQL database disk

Software:

  • Windows Server 2008R2 64-bit with Service Pack 1 and all update installed
  • Microsoft SQL Server 2008R2 64-bit Service Pack 1 and Cumulative update package 4 installed
    • Boost SQL server priority enabled

Recommended test environment

Hardware:

  • 2 x Intel Xeon E55xx series processor
  • >32GB Memory
  • SAS Raid10 or Raid5 (with minimum of 4 disks) volumes as source
  • SAS Raid10 (with minimum of 4 disks, 8 disks are recommended) as destination SQL database disk
  • SAS Raid-1 (4 SAS disk Raid10 is recommended if possible) for LuceneIndex destination

If above volumes are from a SAN, 4/8Gbit Fibre Channel connection is recommended. The whole import process speed depends on the source and destination volume performance.

Example/Tutorials

The following description shows how to create a clean install and start bulk import right away:

=== Configure database ===
1. Create new empty database.
2. Set recovery mode to Simple (Database/Properties/Options) - only during the import to make it faster.
3. Set file autogrowth to at least 500MB (Database/Properties/Files/Autogrowth/File growth)
4. Restart SQL, so that memory usage drops to minimum levels.
 
 
=== Configure executables ===
1. Set datasource and initial catalog in the following files:
 
	 - Deployment\InstallSenseNet.bat
	 - WebSite\Web.config
	 - WebSite\bin\import.exe.config
	 - WebSite\bin\indexpopulator.exe.config
	 - TurboImport\TurboImport.exe.config
 
2. Set files path in TurboImport\TurboImport.exe.config:
 
	<add key="SourcePath" value="\\reposql01\50000"/>
 
3. Set ContentRepository path in TurboImport\TurboImport.exe.config:
 
	<add key="TargetPath" value="/Root/Import"/>
 
4. Set indexdirectory path in TurboImport\TurboImport.exe.config:
 
	<add key="IndexDirectoryPath" value="..\WebSite\LuceneIndex"/>
 
(this should point to the same location where the indexdirectory of the website is located)
 
5. Set max threadcount in TurboImport\TurboImport.exe.config:
 
	<add key="MaxThreads" value="100"/>
 
(the default value is ok for an 8-core installation. Too much threads could cause too much overhead on computers having less core count).
 
 
=== Install Sense/Net ===
Go to Deployment and execute 
 
	InstallSenseNet.bat
 
This will create the database and import the startup Content Repository.
 
 
=== Check Sense/Net installation by running it ===
1. Create a new IIS website and set it to the WebSite folder.
2. Check that the portal is running so all above configuration settings were correct.
3. Stop website in IIS
4. Stop IIS service if SQL and IIS is on the same machine (to release memory).
 
 
=== Import folders & files ===
Do this after website is stopped.
Go to Deployment and execute 
 
	TurboImport.bat
 
This will import all files specified under the path given with the SourcePath config element. The importer logs current status to console and also creates the following files:
 
	detailedlog.csv			- a detailed information log of all imported files
	errorlog.txt			- an error log containing exception messages that occured during import
	importlog.txt			- an excerpt of status infos similar to the console output
 
 
=== Start Sense/Net website ===
Do this after 'Import folders & files' is finished.
 
1. Set database recovery mode back to the required level.
2. Check the following settings:
 - for successful start after import the ContentRepository allowed name characters should be set to allow everything excluding the '*' character. This might not be restrictive enough for a real life ECMS Repository. You can change this setting with the
 
	<add key="InvalidNameCharsPattern" value="[*]" />
 
	element in web.config setting. TurboImport will import everything regardless of this setting.
 
 - the RestoreIndex feature should be turned off for the website, as the possibly huge LuceneIndex will not be stored in the database. When using multiple web nodes manually scatter contents of the LuceneIndex folder. This option is can be configured with the
 
	<add key="RestoreIndex" value="false"/>
 
element in web.config setting.
 
3. Start website (also IIS service if stopped), and check imported files in Content Repository.

Related links

References

There are no external references for this article.