select every paragraph in text via regex using python
up vote
1
down vote
favorite
I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.
INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8
Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn
followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:
first paragraph:
- The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn
second paragraph:
- The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn
and finally the third:
- Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn
I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn
with the reasoning to select everything from the start to the end
(?m)[0-99].*[.] {3,7}
: To identify the beginning, for each line separately.
nn
specifying the end.
However, it doesn't find anything with it.
python regex
add a comment |
up vote
1
down vote
favorite
I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.
INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8
Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn
followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:
first paragraph:
- The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn
second paragraph:
- The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn
and finally the third:
- Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn
I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn
with the reasoning to select everything from the start to the end
(?m)[0-99].*[.] {3,7}
: To identify the beginning, for each line separately.
nn
specifying the end.
However, it doesn't find anything with it.
python regex
2
If you think[0-99]
match numbers from0
to99
, you are wrong. You may replace that withdd?
.re.M
((?m)
) modifies^
and$
, you do not have them in the pattern. You must have wanted to use(?s)
. Tryr'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)'
, see the regex demo.
– Wiktor Stribiżew
Nov 20 at 13:41
Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.
INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8
Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn
followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:
first paragraph:
- The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn
second paragraph:
- The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn
and finally the third:
- Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn
I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn
with the reasoning to select everything from the start to the end
(?m)[0-99].*[.] {3,7}
: To identify the beginning, for each line separately.
nn
specifying the end.
However, it doesn't find anything with it.
python regex
I read a PDF into python and would like to extract specific paragraphs from it. For this I'm using python and try to get the selection via regex. To illustrate the case, here is an example.
INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8
Each paragraph starts with a number, one or two digits, followed by a dot and three to seven blank spaces. The end consists of the next double new line nn
followed by a number, one or two digits, followed by a dot. Notice this should also act as the next starting point. In the example above, I should find the three paragraphs:
first paragraph:
- The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn
second paragraph:
- The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn
and finally the third:
- Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nn
I've tried to use the following regex: r'(?m)[0-99].*[.] {3,7} (.*?) nn
with the reasoning to select everything from the start to the end
(?m)[0-99].*[.] {3,7}
: To identify the beginning, for each line separately.
nn
specifying the end.
However, it doesn't find anything with it.
python regex
python regex
edited Nov 20 at 14:04
asked Nov 20 at 13:38
math
510923
510923
2
If you think[0-99]
match numbers from0
to99
, you are wrong. You may replace that withdd?
.re.M
((?m)
) modifies^
and$
, you do not have them in the pattern. You must have wanted to use(?s)
. Tryr'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)'
, see the regex demo.
– Wiktor Stribiżew
Nov 20 at 13:41
Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42
add a comment |
2
If you think[0-99]
match numbers from0
to99
, you are wrong. You may replace that withdd?
.re.M
((?m)
) modifies^
and$
, you do not have them in the pattern. You must have wanted to use(?s)
. Tryr'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)'
, see the regex demo.
– Wiktor Stribiżew
Nov 20 at 13:41
Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42
2
2
If you think
[0-99]
match numbers from 0
to 99
, you are wrong. You may replace that with dd?
. re.M
((?m)
) modifies ^
and $
, you do not have them in the pattern. You must have wanted to use (?s)
. Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)'
, see the regex demo.– Wiktor Stribiżew
Nov 20 at 13:41
If you think
[0-99]
match numbers from 0
to 99
, you are wrong. You may replace that with dd?
. re.M
((?m)
) modifies ^
and $
, you do not have them in the pattern. You must have wanted to use (?s)
. Try r'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)'
, see the regex demo.– Wiktor Stribiżew
Nov 20 at 13:41
Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42
Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42
add a comment |
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
The [0-99]
pattern is erroneous since it matches any 1 digit from 0
to 9
. See Why doesn't [01-12] range work as expected?. The re.M
((?m)
) modifies ^
and $
anchors, but you haved neither in the pattern.
You may use
r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'
See the regex demo.
Details
(?sm)
-re.DOTALL
andre.MULTILINE
options enabled
^
- start of a line
dd?
- 1 or 2 digits (0
to99
)
.
- a dot
<code> {3,7}</code> - 3 to 7 spaces (replace with
[^Srn]{3,7}` to match any horizontal whitespace)
(.*?)
- Group 1: any 0+ chars as few as possible
(?=nndd?. |Z)
- a location, immediately followed with two newline chars (nn
) and then 1 or 2 digits (dd?
) and a dot followed with space or (|
) end of the whole string (Z
).
Python demo:
import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")
Output:
The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.
First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.
Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.
Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).
3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.
8
---------
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
I've just noticed that my ending condition was not correct. I will change it above. It should benn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters likenn\uf0b7
in there. I tried to change your solution tor'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected
– math
Nov 20 at 14:02
Because otherwise It will end at.nnuf0b7
in the first paragraph which is not correct
– math
Nov 20 at 14:12
@mathuf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use[0-9]
instead ofd
.
– Wiktor Stribiżew
Nov 20 at 14:14
|
show 4 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394296%2fselect-every-paragraph-in-text-via-regex-using-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
The [0-99]
pattern is erroneous since it matches any 1 digit from 0
to 9
. See Why doesn't [01-12] range work as expected?. The re.M
((?m)
) modifies ^
and $
anchors, but you haved neither in the pattern.
You may use
r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'
See the regex demo.
Details
(?sm)
-re.DOTALL
andre.MULTILINE
options enabled
^
- start of a line
dd?
- 1 or 2 digits (0
to99
)
.
- a dot
<code> {3,7}</code> - 3 to 7 spaces (replace with
[^Srn]{3,7}` to match any horizontal whitespace)
(.*?)
- Group 1: any 0+ chars as few as possible
(?=nndd?. |Z)
- a location, immediately followed with two newline chars (nn
) and then 1 or 2 digits (dd?
) and a dot followed with space or (|
) end of the whole string (Z
).
Python demo:
import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")
Output:
The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.
First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.
Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.
Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).
3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.
8
---------
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
I've just noticed that my ending condition was not correct. I will change it above. It should benn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters likenn\uf0b7
in there. I tried to change your solution tor'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected
– math
Nov 20 at 14:02
Because otherwise It will end at.nnuf0b7
in the first paragraph which is not correct
– math
Nov 20 at 14:12
@mathuf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use[0-9]
instead ofd
.
– Wiktor Stribiżew
Nov 20 at 14:14
|
show 4 more comments
up vote
3
down vote
accepted
The [0-99]
pattern is erroneous since it matches any 1 digit from 0
to 9
. See Why doesn't [01-12] range work as expected?. The re.M
((?m)
) modifies ^
and $
anchors, but you haved neither in the pattern.
You may use
r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'
See the regex demo.
Details
(?sm)
-re.DOTALL
andre.MULTILINE
options enabled
^
- start of a line
dd?
- 1 or 2 digits (0
to99
)
.
- a dot
<code> {3,7}</code> - 3 to 7 spaces (replace with
[^Srn]{3,7}` to match any horizontal whitespace)
(.*?)
- Group 1: any 0+ chars as few as possible
(?=nndd?. |Z)
- a location, immediately followed with two newline chars (nn
) and then 1 or 2 digits (dd?
) and a dot followed with space or (|
) end of the whole string (Z
).
Python demo:
import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")
Output:
The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.
First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.
Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.
Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).
3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.
8
---------
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
I've just noticed that my ending condition was not correct. I will change it above. It should benn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters likenn\uf0b7
in there. I tried to change your solution tor'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected
– math
Nov 20 at 14:02
Because otherwise It will end at.nnuf0b7
in the first paragraph which is not correct
– math
Nov 20 at 14:12
@mathuf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use[0-9]
instead ofd
.
– Wiktor Stribiżew
Nov 20 at 14:14
|
show 4 more comments
up vote
3
down vote
accepted
up vote
3
down vote
accepted
The [0-99]
pattern is erroneous since it matches any 1 digit from 0
to 9
. See Why doesn't [01-12] range work as expected?. The re.M
((?m)
) modifies ^
and $
anchors, but you haved neither in the pattern.
You may use
r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'
See the regex demo.
Details
(?sm)
-re.DOTALL
andre.MULTILINE
options enabled
^
- start of a line
dd?
- 1 or 2 digits (0
to99
)
.
- a dot
<code> {3,7}</code> - 3 to 7 spaces (replace with
[^Srn]{3,7}` to match any horizontal whitespace)
(.*?)
- Group 1: any 0+ chars as few as possible
(?=nndd?. |Z)
- a location, immediately followed with two newline chars (nn
) and then 1 or 2 digits (dd?
) and a dot followed with space or (|
) end of the whole string (Z
).
Python demo:
import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")
Output:
The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.
First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.
Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.
Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).
3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.
8
---------
The [0-99]
pattern is erroneous since it matches any 1 digit from 0
to 9
. See Why doesn't [01-12] range work as expected?. The re.M
((?m)
) modifies ^
and $
anchors, but you haved neither in the pattern.
You may use
r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)'
See the regex demo.
Details
(?sm)
-re.DOTALL
andre.MULTILINE
options enabled
^
- start of a line
dd?
- 1 or 2 digits (0
to99
)
.
- a dot
<code> {3,7}</code> - 3 to 7 spaces (replace with
[^Srn]{3,7}` to match any horizontal whitespace)
(.*?)
- Group 1: any 0+ chars as few as possible
(?=nndd?. |Z)
- a location, immediately followed with two newline chars (nn
) and then 1 or 2 digits (dd?
) and a dot followed with space or (|
) end of the whole string (Z
).
Python demo:
import re
s="INTERNATIONAL MONETARY FUND 7nx0cBELGIUMnnnnPOLICY DISCUSSIONS—MAINTAINING THE REFORMnMOMENTUMn7. The current recovery is an opportunity to strengthen the resilience and growthnpotential of the Belgian economy. The government's ability to deal with future shocks will dependnon whether it implements the right policies now while the economy continues to recover.nnuf0b7 First, with public debt above 100 percent of GDP and only starting to come down, Belgium stilln has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This willn require following through on plans to gradually move toward structural balance.nnuf0b7 Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,n further labor and product market reforms are needed to increase productivity growth, raisen potential output, and integrate vulnerable groups into the labor market.nnuf0b7 Third, although the financial sector has recovered since the crisis and is generally sound, cyclicaln vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilancen and proactive policies.3nn8. The government agreed last summer on a new package of measures related tontaxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform wasna reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to benphased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting inn2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) wasnmodified to apply only to incremental corporate equity rather than to the total stock, and new anti-ntax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, thenmeasures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.nn9. Policy discussions focused on the importance of maintaining the reform momentumnand not yielding to complacency. Achieving the balanced budget goal will require efforts at allnlevels of government to make spending more efficient and safeguard revenues (Section A).nA combination of policies and reforms could help raise productivity growth, including increasingninvestment in infrastructure and enhancing competition in services (Section B). To fully realizenBelgium's employment potential, it will be critical to address the severe fragmentation of the labornmarket (Section C). To preserve financial stability, the authorities should address vulnerabilities in thenmortgage market and carefully navigate the transition toward a European Banking Union (Section D).nnnnn3n A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial SectornAssessment Program (FSAP).n4n The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest withna deduction that is the product of corporate equity and a notional interest rate.nnn8"
for r in re.findall(r'(?sm)^dd?. {3,7}(.*?)(?=nndd?. |Z)', s):
print(r, "n---------")
Output:
The current recovery is an opportunity to strengthen the resilience and growth
potential of the Belgian economy. The government's ability to deal with future shocks will depend
on whether it implements the right policies now while the economy continues to recover.
First, with public debt above 100 percent of GDP and only starting to come down, Belgium still
has a long way to go to rebuild buffers and achieve a more sustainable fiscal position. This will
require following through on plans to gradually move toward structural balance.
Second, with real GDP growth projected at only around 1½ percent for the foreseeable future,
further labor and product market reforms are needed to increase productivity growth, raise
potential output, and integrate vulnerable groups into the labor market.
Third, although the financial sector has recovered since the crisis and is generally sound, cyclical
vulnerabilities are rising and new challenges are emerging, suggesting the need for vigilance
and proactive policies.3
---------
The government agreed last summer on a new package of measures related to
taxation, the labor market, and social benefits (Table 2 and Box 1). The most notable reform was
a reduction in Belgium's corporate income tax (CIT) rate from 34 percent to 25 percent, to be
phased in over the next three years (SMEs will benefit from a reduced rate of 20 percent starting in
2018). To compensate for the resulting revenue loss, the notional interest rate deduction (NID) was
modified to apply only to incremental corporate equity rather than to the total stock, and new anti-
tax avoidance measures were introduced consistent with Belgium's EU obligations.4 Together, the
measures are designed to enhance Belgium's competitiveness while preserving revenue neutrality.
---------
Policy discussions focused on the importance of maintaining the reform momentum
and not yielding to complacency. Achieving the balanced budget goal will require efforts at all
levels of government to make spending more efficient and safeguard revenues (Section A).
A combination of policies and reforms could help raise productivity growth, including increasing
investment in infrastructure and enhancing competition in services (Section B). To fully realize
Belgium's employment potential, it will be critical to address the severe fragmentation of the labor
market (Section C). To preserve financial stability, the authorities should address vulnerabilities in the
mortgage market and carefully navigate the transition toward a European Banking Union (Section D).
3
A comprehensive assessment of Belgium's financial sector took place in 2017 under the Financial Sector
Assessment Program (FSAP).
4
The NID aims to neutralize the CIT treatment of debt and equity by supplementing the deductibility of interest with
a deduction that is the product of corporate equity and a notional interest rate.
8
---------
edited Nov 20 at 14:35
answered Nov 20 at 13:45
Wiktor Stribiżew
306k16125202
306k16125202
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
I've just noticed that my ending condition was not correct. I will change it above. It should benn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters likenn\uf0b7
in there. I tried to change your solution tor'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected
– math
Nov 20 at 14:02
Because otherwise It will end at.nnuf0b7
in the first paragraph which is not correct
– math
Nov 20 at 14:12
@mathuf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use[0-9]
instead ofd
.
– Wiktor Stribiżew
Nov 20 at 14:14
|
show 4 more comments
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
I've just noticed that my ending condition was not correct. I will change it above. It should benn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters likenn\uf0b7
in there. I tried to change your solution tor'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected
– math
Nov 20 at 14:02
Because otherwise It will end at.nnuf0b7
in the first paragraph which is not correct
– math
Nov 20 at 14:12
@mathuf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use[0-9]
instead ofd
.
– Wiktor Stribiżew
Nov 20 at 14:14
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
many thanks for your answer. However, if I paste it there, I don't get any match, see the demo regex101.com
– math
Nov 20 at 13:48
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
@math I already did it for you - see this demo. And here is a Python demo.
– Wiktor Stribiżew
Nov 20 at 13:49
I've just noticed that my ending condition was not correct. I will change it above. It should be
nn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7
in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected– math
Nov 20 at 14:02
I've just noticed that my ending condition was not correct. I will change it above. It should be
nn
followed by a number, two or three digit and a dot. Otherwise we don't select the complete first paragraph as there are characters like nn\uf0b7
in there. I tried to change your solution to r'(?sm)^dd?. {3,7}(.*?)(?:nndd?. |Z)'
but then the last paragraph is not selected– math
Nov 20 at 14:02
Because otherwise It will end at
.nnuf0b7
in the first paragraph which is not correct– math
Nov 20 at 14:12
Because otherwise It will end at
.nnuf0b7
in the first paragraph which is not correct– math
Nov 20 at 14:12
@math
uf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9]
instead of d
.– Wiktor Stribiżew
Nov 20 at 14:14
@math
uf0b7
is a control (other) character, it is not a digit. If you need to match ASCII digits only, use [0-9]
instead of d
.– Wiktor Stribiżew
Nov 20 at 14:14
|
show 4 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394296%2fselect-every-paragraph-in-text-via-regex-using-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
If you think
[0-99]
match numbers from0
to99
, you are wrong. You may replace that withdd?
.re.M
((?m)
) modifies^
and$
, you do not have them in the pattern. You must have wanted to use(?s)
. Tryr'(?sm)^dd?. {3,7}(.*?)(?:nn|Z)'
, see the regex demo.– Wiktor Stribiżew
Nov 20 at 13:41
Can you provide de raw input?
– Edilson Borges
Nov 20 at 13:42