Text extraction of document files

I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.

The filenames are all in the following format:

firstname surname - id.extension

The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).

COLUMNS:

ID - New ID of the candidate record

OldID - Old ID of the candidate record (old database schema)

OriginalResumeID - Document link ID for the candidate table to the document table

CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL

Here is the script:

## Get resume list

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"

}

$IDCheck = Invoke-Sqlcmd @params



## Word object

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

$word = New-Object -ComObject word.application

$word.Visible = $false

$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")



foreach ($file in $files) {

    Write-Output "Processing: $($file.FullName)"

    $doc = $word.Documents.Open($file.FullName)

    $fileName = $file.BaseName + '.txt'

    $doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)

Write-Output "File saved as $env:TEMP$fileName"

    $doc.Close()



    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$env:TEMP$fileName"

    $text = $text.replace("'", "''")



    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# Upload to azure #>

    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

    Remove-Item -Force "$env:TEMP$fileName"

}



$word.Quit()

The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.

Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.

Any help on speeding this up would be much appreciated.

EDIT:

Looks like it is the upload to azure causing the delay:

<# Upload to azure #>

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

}

Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

edited Sep 28 '18 at 7:06

asked Sep 28 '18 at 5:19

Owain Esau

614

$begingroup$
Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02

$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21

$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23

$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07

add a comment |

The filenames are all in the following format:

firstname surname - id.extension

The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).

COLUMNS:

ID - New ID of the candidate record

OldID - Old ID of the candidate record (old database schema)

OriginalResumeID - Document link ID for the candidate table to the document table

CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL

Here is the script:

## Get resume list

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"

}

$IDCheck = Invoke-Sqlcmd @params



## Word object

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

$word = New-Object -ComObject word.application

$word.Visible = $false

$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")



foreach ($file in $files) {

    Write-Output "Processing: $($file.FullName)"

    $doc = $word.Documents.Open($file.FullName)

    $fileName = $file.BaseName + '.txt'

    $doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)

Write-Output "File saved as $env:TEMP$fileName"

    $doc.Close()



    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$env:TEMP$fileName"

    $text = $text.replace("'", "''")



    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# Upload to azure #>

    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

    Remove-Item -Force "$env:TEMP$fileName"

}



$word.Quit()

Any help on speeding this up would be much appreciated.

EDIT:

Looks like it is the upload to azure causing the delay:

<# Upload to azure #>

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

}

Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

edited Sep 28 '18 at 7:06

asked Sep 28 '18 at 5:19

Owain Esau

614

$begingroup$
Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02

$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21

$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23

$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07

add a comment |

The filenames are all in the following format:

firstname surname - id.extension

The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).

COLUMNS:

ID - New ID of the candidate record

OldID - Old ID of the candidate record (old database schema)

OriginalResumeID - Document link ID for the candidate table to the document table

CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL

Here is the script:

## Get resume list

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"

}

$IDCheck = Invoke-Sqlcmd @params



## Word object

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

$word = New-Object -ComObject word.application

$word.Visible = $false

$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")



foreach ($file in $files) {

    Write-Output "Processing: $($file.FullName)"

    $doc = $word.Documents.Open($file.FullName)

    $fileName = $file.BaseName + '.txt'

    $doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)

Write-Output "File saved as $env:TEMP$fileName"

    $doc.Close()



    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$env:TEMP$fileName"

    $text = $text.replace("'", "''")



    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# Upload to azure #>

    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

    Remove-Item -Force "$env:TEMP$fileName"

}



$word.Quit()

Any help on speeding this up would be much appreciated.

EDIT:

Looks like it is the upload to azure causing the delay:

<# Upload to azure #>

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

}

Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

edited Sep 28 '18 at 7:06

asked Sep 28 '18 at 5:19

Owain Esau

614

The filenames are all in the following format:

firstname surname - id.extension

The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).

COLUMNS:

ID - New ID of the candidate record

OldID - Old ID of the candidate record (old database schema)

OriginalResumeID - Document link ID for the candidate table to the document table

CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL

Here is the script:

## Get resume list

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"

}

$IDCheck = Invoke-Sqlcmd @params



## Word object

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

$word = New-Object -ComObject word.application

$word.Visible = $false

$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")



foreach ($file in $files) {

    Write-Output "Processing: $($file.FullName)"

    $doc = $word.Documents.Open($file.FullName)

    $fileName = $file.BaseName + '.txt'

    $doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)

Write-Output "File saved as $env:TEMP$fileName"

    $doc.Close()



    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$env:TEMP$fileName"

    $text = $text.replace("'", "''")



    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# Upload to azure #>

    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

    Remove-Item -Force "$env:TEMP$fileName"

}



$word.Quit()

Any help on speeding this up would be much appreciated.

EDIT:

Looks like it is the upload to azure causing the delay:

<# Upload to azure #>

$params = @{

    'Database' = $TRIS5DATABASENAME

    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

    'Username' = $AdminLogin

    'Password' = $InsecurePassword

    'query'    = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"

}

Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

time-limit-exceeded powershell pdf ms-word azure

edited Sep 28 '18 at 7:06

asked Sep 28 '18 at 5:19

Owain Esau

614

edited Sep 28 '18 at 7:06

asked Sep 28 '18 at 5:19

Owain Esau

614

edited Sep 28 '18 at 7:06

asked Sep 28 '18 at 5:19

Owain Esau

614

asked Sep 28 '18 at 5:19

Owain Esau

614

asked Sep 28 '18 at 5:19

Owain Esau

614

$begingroup$
Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02

$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21

$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23

$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07

add a comment |

$begingroup$
Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02

$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21

$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23

$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07

Your Get-Content is super slow too. Use -raw parameter to get the text as a string.

– wOxxOm
Sep 29 '18 at 7:02

Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.

– Owain Esau
Oct 1 '18 at 1:21

Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.

– wOxxOm
Oct 1 '18 at 4:23

Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.

– Owain Esau
Oct 2 '18 at 0:07

add a comment |

1 Answer
1

active

oldest

votes

I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.

 CREATE TABLE document

(docKey     BIGINT IDENTITY(1, 1) PRIMARY KEY, 

 CachedText NVARCHAR(MAX), 

 id         INT

);

CREATE TABLE document_stage

(CachedText NVARCHAR(MAX), 

 id         INT

);

As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE will update your primary table.

UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;

PS script

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx



$stagedDataAsArray = @()

foreach ($file in $files) {



    $fileName = $file.name

    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$($file.FullName)"

    $text = $text.replace("'", "''")





    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# create the row and add it to our  #>

     $fileInstance = New-Object -TypeName psobject 

    $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text

    $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID

    $stagedDataAsArray += $fileInstance

    Remove-Item -Force "$env:TEMP$fileName"



    }

    $stagedDataAsTable = $stagedDataAsArray | Out-DataTable

    $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");

    $cn.Open()

    $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn

    $bc.DestinationTableName = "dbo.document_stage"

    $bc.WriteToServer($stagedDataAsTable)

    $cn.Close()





    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

answered 27 mins ago

Brandon McClure

1011

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204506%2ftext-extraction-of-document-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.

 CREATE TABLE document

(docKey     BIGINT IDENTITY(1, 1) PRIMARY KEY, 

 CachedText NVARCHAR(MAX), 

 id         INT

);

CREATE TABLE document_stage

(CachedText NVARCHAR(MAX), 

 id         INT

);

UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;

PS script

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx



$stagedDataAsArray = @()

foreach ($file in $files) {



    $fileName = $file.name

    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$($file.FullName)"

    $text = $text.replace("'", "''")





    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# create the row and add it to our  #>

     $fileInstance = New-Object -TypeName psobject 

    $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text

    $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID

    $stagedDataAsArray += $fileInstance

    Remove-Item -Force "$env:TEMP$fileName"



    }

    $stagedDataAsTable = $stagedDataAsArray | Out-DataTable

    $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");

    $cn.Open()

    $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn

    $bc.DestinationTableName = "dbo.document_stage"

    $bc.WriteToServer($stagedDataAsTable)

    $cn.Close()





    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

answered 27 mins ago

Brandon McClure

1011

New contributor

add a comment |

I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.

 CREATE TABLE document

(docKey     BIGINT IDENTITY(1, 1) PRIMARY KEY, 

 CachedText NVARCHAR(MAX), 

 id         INT

);

CREATE TABLE document_stage

(CachedText NVARCHAR(MAX), 

 id         INT

);

UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;

PS script

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx



$stagedDataAsArray = @()

foreach ($file in $files) {



    $fileName = $file.name

    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$($file.FullName)"

    $text = $text.replace("'", "''")





    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# create the row and add it to our  #>

     $fileInstance = New-Object -TypeName psobject 

    $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text

    $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID

    $stagedDataAsArray += $fileInstance

    Remove-Item -Force "$env:TEMP$fileName"



    }

    $stagedDataAsTable = $stagedDataAsArray | Out-DataTable

    $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");

    $cn.Open()

    $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn

    $bc.DestinationTableName = "dbo.document_stage"

    $bc.WriteToServer($stagedDataAsTable)

    $cn.Close()





    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

answered 27 mins ago

Brandon McClure

1011

New contributor

add a comment |

I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.

 CREATE TABLE document

(docKey     BIGINT IDENTITY(1, 1) PRIMARY KEY, 

 CachedText NVARCHAR(MAX), 

 id         INT

);

CREATE TABLE document_stage

(CachedText NVARCHAR(MAX), 

 id         INT

);

UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;

PS script

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx



$stagedDataAsArray = @()

foreach ($file in $files) {



    $fileName = $file.name

    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$($file.FullName)"

    $text = $text.replace("'", "''")





    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# create the row and add it to our  #>

     $fileInstance = New-Object -TypeName psobject 

    $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text

    $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID

    $stagedDataAsArray += $fileInstance

    Remove-Item -Force "$env:TEMP$fileName"



    }

    $stagedDataAsTable = $stagedDataAsArray | Out-DataTable

    $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");

    $cn.Open()

    $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn

    $bc.DestinationTableName = "dbo.document_stage"

    $bc.WriteToServer($stagedDataAsTable)

    $cn.Close()





    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

answered 27 mins ago

Brandon McClure

1011

New contributor

I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.

 CREATE TABLE document

(docKey     BIGINT IDENTITY(1, 1) PRIMARY KEY, 

 CachedText NVARCHAR(MAX), 

 id         INT

);

CREATE TABLE document_stage

(CachedText NVARCHAR(MAX), 

 id         INT

);

UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;

PS script

$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx



$stagedDataAsArray = @()

foreach ($file in $files) {



    $fileName = $file.name

    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")

    $text = Get-Content "$($file.FullName)"

    $text = $text.replace("'", "''")





    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID

    $resumeID = $resumeID.OriginalResumeID



    <# create the row and add it to our  #>

     $fileInstance = New-Object -TypeName psobject 

    $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text

    $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID

    $stagedDataAsArray += $fileInstance

    Remove-Item -Force "$env:TEMP$fileName"



    }

    $stagedDataAsTable = $stagedDataAsArray | Out-DataTable

    $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");

    $cn.Open()

    $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn

    $bc.DestinationTableName = "dbo.document_stage"

    $bc.WriteToServer($stagedDataAsTable)

    $cn.Close()





    $params = @{

        'Database' = $TRIS5DATABASENAME

        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)

        'Username' = $AdminLogin

        'Password' = $InsecurePassword

        'query'    = "UPDATE Document

  SET 

      CachedText = stg.CachedText

FROM document_stage stg

WHERE document.id = stg.id;"

    }

    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"

answered 27 mins ago

Brandon McClure

1011

New contributor

answered 27 mins ago

Brandon McClure

1011

New contributor

answered 27 mins ago

Brandon McClure

1011

answered 27 mins ago

Brandon McClure

1011

New contributor

Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

QcizYGF Y2m4zC yN7Y3YpjaQ Uo2b Wo,854hT x1tX0W,9ZBF,rqyMq,BhWb

搜尋此網誌

Tukukkk