Optimizing Code to Minimize Time to Process Large List of Strings












0














I am working on an application in C# WPF which reads a large file containing a number of HL7 formatted reports. My application takes in the file, reads and extracts any lines that starts with OBX and stores it into a List. It then tries to extract report headers from the each line, if one exists, based on a handle-full of rules:




  1. Ends with a ':'

  2. Is in all caps.

  3. Is less then 6 words (not include words in brackets).

  4. Contains more 4 characters.

  5. May be on its own in a line or embedded into the content of the string (always at the start).


I have the algorithm down and it works, but I am dealing with files which can contain an upward of a million lines. My initial design took about 10-15 minutes to read and process around 1 million lines. Through hours of research, I was able to optimize the code a bit, bring it to about a few minutes. However, I am hoping to optimize it even further in order to reduce the time it takes for the app to process the lines. This is where I need some help as I do not know what I can do further improve the performance of my code.



I was able to narrow down the bottleneck to this method which does the header extraction from the string collected. Below is the most recent version of my method and is as optimized as I can get it (Hopefully it will be better with your help):



    private List<string> GetHeader(List<string> FileLines)
{
List<string> headers = new List<string>();
foreach (string line in FileLines)
{
string header = string.Empty;
//Checks if there is a ':' and assumes that anything before that is the header except if it contains a date or a report id
if(Regex.IsMatch(header, @"w{2,4}[/-]w{2,3}[/-]w{2,4}", RegexOptions.Compiled) || Regex.IsMatch(header, @"^w+, w{2} d{5}-{0,1}d{0,5}", RegexOptions.Compiled))
{
continue;
}

string nobrackets = Regex.Replace(line, @".*?(.*?)", string.Empty, RegexOptions.Compiled);
if (line.IndexOf(':') != -1)
{
string nobracks = Regex.Replace(line.Substring(0, line.IndexOf(':') + 1), @"(.*?)", string.Empty, RegexOptions.Compiled);
if (nobracks.Split(' ').Length < 5 && nobracks.Length > 6)
{
headers.Add(line.Substring(0, line.IndexOf(':') + 1));
continue;
}
}

//Checks if a string is larger then 5 words (not including brackets)
if (!(nobrackets.Split(' ').Length < 5 && nobrackets.Length > 6))
continue;
//Checks if the string is in all CAPS
char letter = nobrackets.ToCharArray();

if(letter.All(l => char.IsUpper(l))){
headers.Add(line);
continue;
}

//Checks if the string is 5 words or less
string temp = Regex.Replace(line, @"(.*?)", string.Empty, RegexOptions.Compiled);
if (temp.Split(' ').Length < 6)
{
headers.Add(line);
}

//Checks for an all caps header embedded in a string
bool caps = true;
string word = line.Split(' ');
int lastCapWordIndex = 0;
for (int i = 0; i < word.Length && caps; i++)
{
char char_array = word[i].ToCharArray();

if (!letter.All(l => char.IsUpper(l)))
{
caps = false;
continue;
}
if (caps)
lastCapWordIndex++;
}
if (lastCapWordIndex > 0)
{
for (int i = 0; i < lastCapWordIndex; i++)
{
header += " " + word[i];
}
headers.Add(header.Trim());
continue;
}
}

//final check for string with less then 4 characters
string tempH = headers.ToArray();
headers = new List<string>();
foreach (string h in tempH)
{
if (h.Length > 4)
{
headers.Add(h);
}
}
return headers;
}








share







New contributor




ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    0














    I am working on an application in C# WPF which reads a large file containing a number of HL7 formatted reports. My application takes in the file, reads and extracts any lines that starts with OBX and stores it into a List. It then tries to extract report headers from the each line, if one exists, based on a handle-full of rules:




    1. Ends with a ':'

    2. Is in all caps.

    3. Is less then 6 words (not include words in brackets).

    4. Contains more 4 characters.

    5. May be on its own in a line or embedded into the content of the string (always at the start).


    I have the algorithm down and it works, but I am dealing with files which can contain an upward of a million lines. My initial design took about 10-15 minutes to read and process around 1 million lines. Through hours of research, I was able to optimize the code a bit, bring it to about a few minutes. However, I am hoping to optimize it even further in order to reduce the time it takes for the app to process the lines. This is where I need some help as I do not know what I can do further improve the performance of my code.



    I was able to narrow down the bottleneck to this method which does the header extraction from the string collected. Below is the most recent version of my method and is as optimized as I can get it (Hopefully it will be better with your help):



        private List<string> GetHeader(List<string> FileLines)
    {
    List<string> headers = new List<string>();
    foreach (string line in FileLines)
    {
    string header = string.Empty;
    //Checks if there is a ':' and assumes that anything before that is the header except if it contains a date or a report id
    if(Regex.IsMatch(header, @"w{2,4}[/-]w{2,3}[/-]w{2,4}", RegexOptions.Compiled) || Regex.IsMatch(header, @"^w+, w{2} d{5}-{0,1}d{0,5}", RegexOptions.Compiled))
    {
    continue;
    }

    string nobrackets = Regex.Replace(line, @".*?(.*?)", string.Empty, RegexOptions.Compiled);
    if (line.IndexOf(':') != -1)
    {
    string nobracks = Regex.Replace(line.Substring(0, line.IndexOf(':') + 1), @"(.*?)", string.Empty, RegexOptions.Compiled);
    if (nobracks.Split(' ').Length < 5 && nobracks.Length > 6)
    {
    headers.Add(line.Substring(0, line.IndexOf(':') + 1));
    continue;
    }
    }

    //Checks if a string is larger then 5 words (not including brackets)
    if (!(nobrackets.Split(' ').Length < 5 && nobrackets.Length > 6))
    continue;
    //Checks if the string is in all CAPS
    char letter = nobrackets.ToCharArray();

    if(letter.All(l => char.IsUpper(l))){
    headers.Add(line);
    continue;
    }

    //Checks if the string is 5 words or less
    string temp = Regex.Replace(line, @"(.*?)", string.Empty, RegexOptions.Compiled);
    if (temp.Split(' ').Length < 6)
    {
    headers.Add(line);
    }

    //Checks for an all caps header embedded in a string
    bool caps = true;
    string word = line.Split(' ');
    int lastCapWordIndex = 0;
    for (int i = 0; i < word.Length && caps; i++)
    {
    char char_array = word[i].ToCharArray();

    if (!letter.All(l => char.IsUpper(l)))
    {
    caps = false;
    continue;
    }
    if (caps)
    lastCapWordIndex++;
    }
    if (lastCapWordIndex > 0)
    {
    for (int i = 0; i < lastCapWordIndex; i++)
    {
    header += " " + word[i];
    }
    headers.Add(header.Trim());
    continue;
    }
    }

    //final check for string with less then 4 characters
    string tempH = headers.ToArray();
    headers = new List<string>();
    foreach (string h in tempH)
    {
    if (h.Length > 4)
    {
    headers.Add(h);
    }
    }
    return headers;
    }








    share







    New contributor




    ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      0












      0








      0







      I am working on an application in C# WPF which reads a large file containing a number of HL7 formatted reports. My application takes in the file, reads and extracts any lines that starts with OBX and stores it into a List. It then tries to extract report headers from the each line, if one exists, based on a handle-full of rules:




      1. Ends with a ':'

      2. Is in all caps.

      3. Is less then 6 words (not include words in brackets).

      4. Contains more 4 characters.

      5. May be on its own in a line or embedded into the content of the string (always at the start).


      I have the algorithm down and it works, but I am dealing with files which can contain an upward of a million lines. My initial design took about 10-15 minutes to read and process around 1 million lines. Through hours of research, I was able to optimize the code a bit, bring it to about a few minutes. However, I am hoping to optimize it even further in order to reduce the time it takes for the app to process the lines. This is where I need some help as I do not know what I can do further improve the performance of my code.



      I was able to narrow down the bottleneck to this method which does the header extraction from the string collected. Below is the most recent version of my method and is as optimized as I can get it (Hopefully it will be better with your help):



          private List<string> GetHeader(List<string> FileLines)
      {
      List<string> headers = new List<string>();
      foreach (string line in FileLines)
      {
      string header = string.Empty;
      //Checks if there is a ':' and assumes that anything before that is the header except if it contains a date or a report id
      if(Regex.IsMatch(header, @"w{2,4}[/-]w{2,3}[/-]w{2,4}", RegexOptions.Compiled) || Regex.IsMatch(header, @"^w+, w{2} d{5}-{0,1}d{0,5}", RegexOptions.Compiled))
      {
      continue;
      }

      string nobrackets = Regex.Replace(line, @".*?(.*?)", string.Empty, RegexOptions.Compiled);
      if (line.IndexOf(':') != -1)
      {
      string nobracks = Regex.Replace(line.Substring(0, line.IndexOf(':') + 1), @"(.*?)", string.Empty, RegexOptions.Compiled);
      if (nobracks.Split(' ').Length < 5 && nobracks.Length > 6)
      {
      headers.Add(line.Substring(0, line.IndexOf(':') + 1));
      continue;
      }
      }

      //Checks if a string is larger then 5 words (not including brackets)
      if (!(nobrackets.Split(' ').Length < 5 && nobrackets.Length > 6))
      continue;
      //Checks if the string is in all CAPS
      char letter = nobrackets.ToCharArray();

      if(letter.All(l => char.IsUpper(l))){
      headers.Add(line);
      continue;
      }

      //Checks if the string is 5 words or less
      string temp = Regex.Replace(line, @"(.*?)", string.Empty, RegexOptions.Compiled);
      if (temp.Split(' ').Length < 6)
      {
      headers.Add(line);
      }

      //Checks for an all caps header embedded in a string
      bool caps = true;
      string word = line.Split(' ');
      int lastCapWordIndex = 0;
      for (int i = 0; i < word.Length && caps; i++)
      {
      char char_array = word[i].ToCharArray();

      if (!letter.All(l => char.IsUpper(l)))
      {
      caps = false;
      continue;
      }
      if (caps)
      lastCapWordIndex++;
      }
      if (lastCapWordIndex > 0)
      {
      for (int i = 0; i < lastCapWordIndex; i++)
      {
      header += " " + word[i];
      }
      headers.Add(header.Trim());
      continue;
      }
      }

      //final check for string with less then 4 characters
      string tempH = headers.ToArray();
      headers = new List<string>();
      foreach (string h in tempH)
      {
      if (h.Length > 4)
      {
      headers.Add(h);
      }
      }
      return headers;
      }








      share







      New contributor




      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I am working on an application in C# WPF which reads a large file containing a number of HL7 formatted reports. My application takes in the file, reads and extracts any lines that starts with OBX and stores it into a List. It then tries to extract report headers from the each line, if one exists, based on a handle-full of rules:




      1. Ends with a ':'

      2. Is in all caps.

      3. Is less then 6 words (not include words in brackets).

      4. Contains more 4 characters.

      5. May be on its own in a line or embedded into the content of the string (always at the start).


      I have the algorithm down and it works, but I am dealing with files which can contain an upward of a million lines. My initial design took about 10-15 minutes to read and process around 1 million lines. Through hours of research, I was able to optimize the code a bit, bring it to about a few minutes. However, I am hoping to optimize it even further in order to reduce the time it takes for the app to process the lines. This is where I need some help as I do not know what I can do further improve the performance of my code.



      I was able to narrow down the bottleneck to this method which does the header extraction from the string collected. Below is the most recent version of my method and is as optimized as I can get it (Hopefully it will be better with your help):



          private List<string> GetHeader(List<string> FileLines)
      {
      List<string> headers = new List<string>();
      foreach (string line in FileLines)
      {
      string header = string.Empty;
      //Checks if there is a ':' and assumes that anything before that is the header except if it contains a date or a report id
      if(Regex.IsMatch(header, @"w{2,4}[/-]w{2,3}[/-]w{2,4}", RegexOptions.Compiled) || Regex.IsMatch(header, @"^w+, w{2} d{5}-{0,1}d{0,5}", RegexOptions.Compiled))
      {
      continue;
      }

      string nobrackets = Regex.Replace(line, @".*?(.*?)", string.Empty, RegexOptions.Compiled);
      if (line.IndexOf(':') != -1)
      {
      string nobracks = Regex.Replace(line.Substring(0, line.IndexOf(':') + 1), @"(.*?)", string.Empty, RegexOptions.Compiled);
      if (nobracks.Split(' ').Length < 5 && nobracks.Length > 6)
      {
      headers.Add(line.Substring(0, line.IndexOf(':') + 1));
      continue;
      }
      }

      //Checks if a string is larger then 5 words (not including brackets)
      if (!(nobrackets.Split(' ').Length < 5 && nobrackets.Length > 6))
      continue;
      //Checks if the string is in all CAPS
      char letter = nobrackets.ToCharArray();

      if(letter.All(l => char.IsUpper(l))){
      headers.Add(line);
      continue;
      }

      //Checks if the string is 5 words or less
      string temp = Regex.Replace(line, @"(.*?)", string.Empty, RegexOptions.Compiled);
      if (temp.Split(' ').Length < 6)
      {
      headers.Add(line);
      }

      //Checks for an all caps header embedded in a string
      bool caps = true;
      string word = line.Split(' ');
      int lastCapWordIndex = 0;
      for (int i = 0; i < word.Length && caps; i++)
      {
      char char_array = word[i].ToCharArray();

      if (!letter.All(l => char.IsUpper(l)))
      {
      caps = false;
      continue;
      }
      if (caps)
      lastCapWordIndex++;
      }
      if (lastCapWordIndex > 0)
      {
      for (int i = 0; i < lastCapWordIndex; i++)
      {
      header += " " + word[i];
      }
      headers.Add(header.Trim());
      continue;
      }
      }

      //final check for string with less then 4 characters
      string tempH = headers.ToArray();
      headers = new List<string>();
      foreach (string h in tempH)
      {
      if (h.Length > 4)
      {
      headers.Add(h);
      }
      }
      return headers;
      }






      c# performance regex wpf





      share







      New contributor




      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share







      New contributor




      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share



      share






      New contributor




      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 4 mins ago









      ShandowViper18

      1




      1




      New contributor




      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      ShandowViper18 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.



























          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          ShandowViper18 is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210298%2foptimizing-code-to-minimize-time-to-process-large-list-of-strings%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          ShandowViper18 is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          ShandowViper18 is a new contributor. Be nice, and check out our Code of Conduct.













          ShandowViper18 is a new contributor. Be nice, and check out our Code of Conduct.












          ShandowViper18 is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f210298%2foptimizing-code-to-minimize-time-to-process-large-list-of-strings%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'