Text extraction of document files
$begingroup$
I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.
The filenames are all in the following format:
firstname surname - id.extension
The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).
COLUMNS:
- ID - New ID of the candidate record
- OldID - Old ID of the candidate record (old database schema)
- OriginalResumeID - Document link ID for the candidate table to the document table
- CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL
Here is the script:
## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params
## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$fileName"
$doc.Close()
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}
$word.Quit()
The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.
Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.
Any help on speeding this up would be much appreciated.
EDIT:
Looks like it is the upload to azure causing the delay:
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
time-limit-exceeded powershell pdf ms-word azure
$endgroup$
add a comment |
$begingroup$
I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.
The filenames are all in the following format:
firstname surname - id.extension
The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).
COLUMNS:
- ID - New ID of the candidate record
- OldID - Old ID of the candidate record (old database schema)
- OriginalResumeID - Document link ID for the candidate table to the document table
- CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL
Here is the script:
## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params
## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$fileName"
$doc.Close()
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}
$word.Quit()
The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.
Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.
Any help on speeding this up would be much appreciated.
EDIT:
Looks like it is the upload to azure causing the delay:
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
time-limit-exceeded powershell pdf ms-word azure
$endgroup$
$begingroup$
YourGet-Content
is super slow too. Use-raw
parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02
$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21
$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23
$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07
add a comment |
$begingroup$
I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.
The filenames are all in the following format:
firstname surname - id.extension
The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).
COLUMNS:
- ID - New ID of the candidate record
- OldID - Old ID of the candidate record (old database schema)
- OriginalResumeID - Document link ID for the candidate table to the document table
- CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL
Here is the script:
## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params
## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$fileName"
$doc.Close()
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}
$word.Quit()
The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.
Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.
Any help on speeding this up would be much appreciated.
EDIT:
Looks like it is the upload to azure causing the delay:
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
time-limit-exceeded powershell pdf ms-word azure
$endgroup$
I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.
The filenames are all in the following format:
firstname surname - id.extension
The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).
COLUMNS:
- ID - New ID of the candidate record
- OldID - Old ID of the candidate record (old database schema)
- OriginalResumeID - Document link ID for the candidate table to the document table
- CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL
Here is the script:
## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params
## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")
foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$fileName"
$doc.Close()
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}
$word.Quit()
The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.
Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.
Any help on speeding this up would be much appreciated.
EDIT:
Looks like it is the upload to azure causing the delay:
<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
time-limit-exceeded powershell pdf ms-word azure
time-limit-exceeded powershell pdf ms-word azure
edited Sep 28 '18 at 7:06
Owain Esau
asked Sep 28 '18 at 5:19
Owain EsauOwain Esau
614
614
$begingroup$
YourGet-Content
is super slow too. Use-raw
parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02
$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21
$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23
$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07
add a comment |
$begingroup$
YourGet-Content
is super slow too. Use-raw
parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02
$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21
$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23
$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07
$begingroup$
Your
Get-Content
is super slow too. Use -raw
parameter to get the text as a string.$endgroup$
– wOxxOm
Sep 29 '18 at 7:02
$begingroup$
Your
Get-Content
is super slow too. Use -raw
parameter to get the text as a string.$endgroup$
– wOxxOm
Sep 29 '18 at 7:02
$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21
$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21
$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23
$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23
$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07
$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document
table.
CREATE TABLE document
(docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
CachedText NVARCHAR(MAX),
id INT
);
CREATE TABLE document_stage
(CachedText NVARCHAR(MAX),
id INT
);
As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE
will update your primary table.
UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;
PS script
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$stagedDataAsArray = @()
foreach ($file in $files) {
$fileName = $file.name
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$($file.FullName)"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# create the row and add it to our #>
$fileInstance = New-Object -TypeName psobject
$fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
$fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
$stagedDataAsArray += $fileInstance
Remove-Item -Force "$env:TEMP$fileName"
}
$stagedDataAsTable = $stagedDataAsArray | Out-DataTable
$cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
$cn.Open()
$bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
$bc.DestinationTableName = "dbo.document_stage"
$bc.WriteToServer($stagedDataAsTable)
$cn.Close()
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204506%2ftext-extraction-of-document-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document
table.
CREATE TABLE document
(docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
CachedText NVARCHAR(MAX),
id INT
);
CREATE TABLE document_stage
(CachedText NVARCHAR(MAX),
id INT
);
As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE
will update your primary table.
UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;
PS script
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$stagedDataAsArray = @()
foreach ($file in $files) {
$fileName = $file.name
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$($file.FullName)"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# create the row and add it to our #>
$fileInstance = New-Object -TypeName psobject
$fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
$fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
$stagedDataAsArray += $fileInstance
Remove-Item -Force "$env:TEMP$fileName"
}
$stagedDataAsTable = $stagedDataAsArray | Out-DataTable
$cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
$cn.Open()
$bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
$bc.DestinationTableName = "dbo.document_stage"
$bc.WriteToServer($stagedDataAsTable)
$cn.Close()
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
New contributor
$endgroup$
add a comment |
$begingroup$
I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document
table.
CREATE TABLE document
(docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
CachedText NVARCHAR(MAX),
id INT
);
CREATE TABLE document_stage
(CachedText NVARCHAR(MAX),
id INT
);
As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE
will update your primary table.
UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;
PS script
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$stagedDataAsArray = @()
foreach ($file in $files) {
$fileName = $file.name
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$($file.FullName)"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# create the row and add it to our #>
$fileInstance = New-Object -TypeName psobject
$fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
$fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
$stagedDataAsArray += $fileInstance
Remove-Item -Force "$env:TEMP$fileName"
}
$stagedDataAsTable = $stagedDataAsArray | Out-DataTable
$cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
$cn.Open()
$bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
$bc.DestinationTableName = "dbo.document_stage"
$bc.WriteToServer($stagedDataAsTable)
$cn.Close()
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
New contributor
$endgroup$
add a comment |
$begingroup$
I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document
table.
CREATE TABLE document
(docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
CachedText NVARCHAR(MAX),
id INT
);
CREATE TABLE document_stage
(CachedText NVARCHAR(MAX),
id INT
);
As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE
will update your primary table.
UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;
PS script
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$stagedDataAsArray = @()
foreach ($file in $files) {
$fileName = $file.name
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$($file.FullName)"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# create the row and add it to our #>
$fileInstance = New-Object -TypeName psobject
$fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
$fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
$stagedDataAsArray += $fileInstance
Remove-Item -Force "$env:TEMP$fileName"
}
$stagedDataAsTable = $stagedDataAsArray | Out-DataTable
$cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
$cn.Open()
$bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
$bc.DestinationTableName = "dbo.document_stage"
$bc.WriteToServer($stagedDataAsTable)
$cn.Close()
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
New contributor
$endgroup$
I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document
table.
CREATE TABLE document
(docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
CachedText NVARCHAR(MAX),
id INT
);
CREATE TABLE document_stage
(CachedText NVARCHAR(MAX),
id INT
);
As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE
will update your primary table.
UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;
PS script
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$stagedDataAsArray = @()
foreach ($file in $files) {
$fileName = $file.name
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$($file.FullName)"
$text = $text.replace("'", "''")
$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID
<# create the row and add it to our #>
$fileInstance = New-Object -TypeName psobject
$fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
$fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
$stagedDataAsArray += $fileInstance
Remove-Item -Force "$env:TEMP$fileName"
}
$stagedDataAsTable = $stagedDataAsArray | Out-DataTable
$cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
$cn.Open()
$bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
$bc.DestinationTableName = "dbo.document_stage"
$bc.WriteToServer($stagedDataAsTable)
$cn.Close()
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
New contributor
New contributor
answered 27 mins ago
Brandon McClureBrandon McClure
1011
1011
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204506%2ftext-extraction-of-document-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Your
Get-Content
is super slow too. Use-raw
parameter to get the text as a string.$endgroup$
– wOxxOm
Sep 29 '18 at 7:02
$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21
$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23
$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07