select every paragraph in text via regex using python











up vote
1
down vote

favorite












I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.



INTERNATIONAL MONETARY FUND            7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7.     The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn    require following through on plans to gradually move toward structural balance.nnuf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n    further labor and product market reforms are needed to increase productivity growth, raisen    potential output, and integrate vulnerable groups into the labor market.nnuf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen    and proactive policies.3nn8.      The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9.      Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8


Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:



first paragraph:





  1. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn




second paragraph:





  1. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn




and finally the third:





  1. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn




I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn with the reasoning to select everything from the start to the end





  1. (?m)[0-99].*[.] {3,7}: To identify the beginning, for each line separately.


  2. nn specifying the end.


However, it doesn't find anything with it.










share|improve this question




















  • 2




    If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with dd?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)', see the regex demo.
    – Wiktor Stribiżew
    Nov 20 at 13:41












  • Can you provide de raw input?
    – Edilson Borges
    Nov 20 at 13:42















up vote
1
down vote

favorite












I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.



INTERNATIONAL MONETARY FUND            7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7.     The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn    require following through on plans to gradually move toward structural balance.nnuf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n    further labor and product market reforms are needed to increase productivity growth, raisen    potential output, and integrate vulnerable groups into the labor market.nnuf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen    and proactive policies.3nn8.      The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9.      Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8


Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:



first paragraph:





  1. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn




second paragraph:





  1. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn




and finally the third:





  1. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn




I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn with the reasoning to select everything from the start to the end





  1. (?m)[0-99].*[.] {3,7}: To identify the beginning, for each line separately.


  2. nn specifying the end.


However, it doesn't find anything with it.










share|improve this question




















  • 2




    If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with dd?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)', see the regex demo.
    – Wiktor Stribiżew
    Nov 20 at 13:41












  • Can you provide de raw input?
    – Edilson Borges
    Nov 20 at 13:42













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.



INTERNATIONAL MONETARY FUND            7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7.     The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn    require following through on plans to gradually move toward structural balance.nnuf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n    further labor and product market reforms are needed to increase productivity growth, raisen    potential output, and integrate vulnerable groups into the labor market.nnuf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen    and proactive policies.3nn8.      The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9.      Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8


Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:



first paragraph:





  1. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn




second paragraph:





  1. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn




and finally the third:





  1. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn




I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn with the reasoning to select everything from the start to the end





  1. (?m)[0-99].*[.] {3,7}: To identify the beginning, for each line separately.


  2. nn specifying the end.


However, it doesn't find anything with it.










share|improve this question















I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.



INTERNATIONAL MONETARY FUND            7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7.     The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7   First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln    has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn    require following through on plans to gradually move toward structural balance.nnuf0b7   Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n    further labor and product market reforms are needed to increase productivity growth, raisen    potential output, and integrate vulnerable groups into the labor market.nnuf0b7   Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln    vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen    and proactive policies.3nn8.      The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9.      Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n  The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8


Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:



first paragraph:





  1. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn




second paragraph:





  1. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn




and finally the third:





  1. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn




I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn with the reasoning to select everything from the start to the end





  1. (?m)[0-99].*[.] {3,7}: To identify the beginning, for each line separately.


  2. nn specifying the end.


However, it doesn't find anything with it.







python regex






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 at 14:04

























asked Nov 20 at 13:38









math

510923




510923








  • 2




    If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with dd?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)', see the regex demo.
    – Wiktor Stribiżew
    Nov 20 at 13:41












  • Can you provide de raw input?
    – Edilson Borges
    Nov 20 at 13:42














  • 2




    If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with dd?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)', see the regex demo.
    – Wiktor Stribiżew
    Nov 20 at 13:41












  • Can you provide de raw input?
    – Edilson Borges
    Nov 20 at 13:42








2




2




If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with dd?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)', see the regex demo.
– Wiktor Stribiżew
Nov 20 at 13:41






If you think [0-99] match numbers from 0 to 99, you are wrong. You may replace that with dd?. re.M ((?m)) modifies ^ and $, you do not have them in the pattern. You must have wanted to use (?s). Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)', see the regex demo.
– Wiktor Stribiżew
Nov 20 at 13:41














Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42




Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42












1 Answer
1






active

oldest

votes

















up vote
3
down vote



accepted










The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.



You may use



r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'


See the regex demo.



Details





  • (?sm) - re.DOTALL and re.MULTILINE options enabled


  • ^ - start of a line


  • dd? - 1 or 2 digits (0 to 99)


  • . - a dot


  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^Srn]{3,7}` to match any horizontal whitespace)


  • (.*?) - Group 1: any 0+ chars as few as possible


  • (?=nndd?. |Z) - a location, immediately followed with two newline chars (nn) and then 1 or 2 digits (dd?) and a dot followed with space or (|) end of the whole string (Z).


Python demo:



import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")


Output:



The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.

 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.

 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8
---------





share|improve this answer























  • many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
    – math
    Nov 20 at 13:48










  • @math I already did it for you - see this demo. And here is a Python demo.
    – Wiktor Stribiżew
    Nov 20 at 13:49












  • I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
    – math
    Nov 20 at 14:02












  • Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
    – math
    Nov 20 at 14:12










  • @math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
    – Wiktor Stribiżew
    Nov 20 at 14:14











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394296%2fselect-every-paragraph-in-text-via-regex-using-python%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote



accepted










The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.



You may use



r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'


See the regex demo.



Details





  • (?sm) - re.DOTALL and re.MULTILINE options enabled


  • ^ - start of a line


  • dd? - 1 or 2 digits (0 to 99)


  • . - a dot


  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^Srn]{3,7}` to match any horizontal whitespace)


  • (.*?) - Group 1: any 0+ chars as few as possible


  • (?=nndd?. |Z) - a location, immediately followed with two newline chars (nn) and then 1 or 2 digits (dd?) and a dot followed with space or (|) end of the whole string (Z).


Python demo:



import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")


Output:



The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.

 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.

 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8
---------





share|improve this answer























  • many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
    – math
    Nov 20 at 13:48










  • @math I already did it for you - see this demo. And here is a Python demo.
    – Wiktor Stribiżew
    Nov 20 at 13:49












  • I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
    – math
    Nov 20 at 14:02












  • Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
    – math
    Nov 20 at 14:12










  • @math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
    – Wiktor Stribiżew
    Nov 20 at 14:14















up vote
3
down vote



accepted










The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.



You may use



r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'


See the regex demo.



Details





  • (?sm) - re.DOTALL and re.MULTILINE options enabled


  • ^ - start of a line


  • dd? - 1 or 2 digits (0 to 99)


  • . - a dot


  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^Srn]{3,7}` to match any horizontal whitespace)


  • (.*?) - Group 1: any 0+ chars as few as possible


  • (?=nndd?. |Z) - a location, immediately followed with two newline chars (nn) and then 1 or 2 digits (dd?) and a dot followed with space or (|) end of the whole string (Z).


Python demo:



import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")


Output:



The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.

 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.

 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8
---------





share|improve this answer























  • many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
    – math
    Nov 20 at 13:48










  • @math I already did it for you - see this demo. And here is a Python demo.
    – Wiktor Stribiżew
    Nov 20 at 13:49












  • I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
    – math
    Nov 20 at 14:02












  • Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
    – math
    Nov 20 at 14:12










  • @math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
    – Wiktor Stribiżew
    Nov 20 at 14:14













up vote
3
down vote



accepted







up vote
3
down vote



accepted






The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.



You may use



r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'


See the regex demo.



Details





  • (?sm) - re.DOTALL and re.MULTILINE options enabled


  • ^ - start of a line


  • dd? - 1 or 2 digits (0 to 99)


  • . - a dot


  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^Srn]{3,7}` to match any horizontal whitespace)


  • (.*?) - Group 1: any 0+ chars as few as possible


  • (?=nndd?. |Z) - a location, immediately followed with two newline chars (nn) and then 1 or 2 digits (dd?) and a dot followed with space or (|) end of the whole string (Z).


Python demo:



import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")


Output:



The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.

 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.

 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8
---------





share|improve this answer














The [0-99] pattern is erroneous since it matches any 1 digit from 0 to 9. See Why doesn't [01-12] range work as expected?. The re.M ((?m)) modifies ^ and $ anchors, but you haved neither in the pattern.



You may use



r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'


See the regex demo.



Details





  • (?sm) - re.DOTALL and re.MULTILINE options enabled


  • ^ - start of a line


  • dd? - 1 or 2 digits (0 to 99)


  • . - a dot


  • <code> {3,7}</code> - 3 to 7 spaces (replace with[^Srn]{3,7}` to match any horizontal whitespace)


  • (.*?) - Group 1: any 0+ chars as few as possible


  • (?=nndd?. |Z) - a location, immediately followed with two newline chars (nn) and then 1 or 2 digits (dd?) and a dot followed with space or (|) end of the whole string (Z).


Python demo:



import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")


Output:



The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.

 First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.

 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.

 Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).




3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.


8
---------






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 at 14:35

























answered Nov 20 at 13:45









Wiktor Stribiżew

306k16125202




306k16125202












  • many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
    – math
    Nov 20 at 13:48










  • @math I already did it for you - see this demo. And here is a Python demo.
    – Wiktor Stribiżew
    Nov 20 at 13:49












  • I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
    – math
    Nov 20 at 14:02












  • Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
    – math
    Nov 20 at 14:12










  • @math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
    – Wiktor Stribiżew
    Nov 20 at 14:14


















  • many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
    – math
    Nov 20 at 13:48










  • @math I already did it for you - see this demo. And here is a Python demo.
    – Wiktor Stribiżew
    Nov 20 at 13:49












  • I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
    – math
    Nov 20 at 14:02












  • Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
    – math
    Nov 20 at 14:12










  • @math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
    – Wiktor Stribiżew
    Nov 20 at 14:14
















many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48




many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48












@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49






@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49














I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
– math
Nov 20 at 14:02






I've just noticed that my ending condition was not correct. I will change it above. It should be nn followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7 in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)' but then the last paragraph is not selected
– math
Nov 20 at 14:02














Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
– math
Nov 20 at 14:12




Because otherwise It will end at .nnuf0b7 in the first paragraph which is not correct
– math
Nov 20 at 14:12












@math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
– Wiktor Stribiżew
Nov 20 at 14:14




@math uf0b7 is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9] instead of d.
– Wiktor Stribiżew
Nov 20 at 14:14


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394296%2fselect-every-paragraph-in-text-via-regex-using-python%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'