java regex to retrieve link from text












1















I have a input String as:



String text = "Some content which contains link as <A HREF="/relative-path/fruit.cgi?param1=abc&param2=xyz">URL Label</A> and some text after it";


I want to convert this text to:



Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it


So here:



1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.



2) If the URL is relative, I want to prefix the base URL (http://www.google.com).



3) I want to append a parameter to the URL. (&myParam=pqr)



I am having issues retrieving the tag with URL and label, and replacing it.



I wrote something like:



public static void main(String args) {
String text = "String text = "Some content which contains link as <A HREF="/relative-path/fruit.cgi?param1=abc&param2=xyz">URL Label</A> and some text after it";";
text = text.replaceAll("&lt;", "<");
text = text.replaceAll("&gt;", ">");
text = text.replaceAll("&amp;", "&");

// this is not working
Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);

}
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}


Edit1:



Pattern p = Pattern.compile("HREF="(.*?)"");


This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.



Also, how do I handle if my text has several URLs.



Edit2:



Some progress.



Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}


This handles the case of multiple URLs.



Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.



Edit3:



By multiple URL cases, I mean there are multiple url present in given text.



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}









share|improve this question

























  • Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );

    – Pedro Lobito
    Nov 22 '18 at 3:23


















1















I have a input String as:



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";


I want to convert this text to:



Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it


So here:



1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.



2) If the URL is relative, I want to prefix the base URL (http://www.google.com).



3) I want to append a parameter to the URL. (&myParam=pqr)



I am having issues retrieving the tag with URL and label, and replacing it.



I wrote something like:



public static void main(String args) {
String text = "String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";";
text = text.replaceAll("&lt;", "<");
text = text.replaceAll("&gt;", ">");
text = text.replaceAll("&amp;", "&");

// this is not working
Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);

}
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}


Edit1:



Pattern p = Pattern.compile("HREF="(.*?)"");


This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.



Also, how do I handle if my text has several URLs.



Edit2:



Some progress.



Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}


This handles the case of multiple URLs.



Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.



Edit3:



By multiple URL cases, I mean there are multiple url present in given text.



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}









share|improve this question

























  • Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );

    – Pedro Lobito
    Nov 22 '18 at 3:23
















1












1








1








I have a input String as:



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";


I want to convert this text to:



Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it


So here:



1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.



2) If the URL is relative, I want to prefix the base URL (http://www.google.com).



3) I want to append a parameter to the URL. (&myParam=pqr)



I am having issues retrieving the tag with URL and label, and replacing it.



I wrote something like:



public static void main(String args) {
String text = "String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";";
text = text.replaceAll("&lt;", "<");
text = text.replaceAll("&gt;", ">");
text = text.replaceAll("&amp;", "&");

// this is not working
Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);

}
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}


Edit1:



Pattern p = Pattern.compile("HREF="(.*?)"");


This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.



Also, how do I handle if my text has several URLs.



Edit2:



Some progress.



Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}


This handles the case of multiple URLs.



Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.



Edit3:



By multiple URL cases, I mean there are multiple url present in given text.



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}









share|improve this question
















I have a input String as:



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";


I want to convert this text to:



Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it


So here:



1) I want to replace the link tag with plain link. If the tag contains label then it should go in braces after the URL.



2) If the URL is relative, I want to prefix the base URL (http://www.google.com).



3) I want to append a parameter to the URL. (&myParam=pqr)



I am having issues retrieving the tag with URL and label, and replacing it.



I wrote something like:



public static void main(String args) {
String text = "String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";";
text = text.replaceAll("&lt;", "<");
text = text.replaceAll("&gt;", ">");
text = text.replaceAll("&amp;", "&");

// this is not working
Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
if (m.find()) {
url = m.group(1);

}
}

// helper method to append new query params once I have the url
public static URI appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri;
}


Edit1:



Pattern p = Pattern.compile("HREF="(.*?)"");


This works. But then I want it to be capitalization agnostic. Href, HRef, href, hrEF, etc. all should work.



Also, how do I handle if my text has several URLs.



Edit2:



Some progress.



Pattern p = Pattern.compile("href="(.*?)"");
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1);
System.out.println(url);
}


This handles the case of multiple URLs.



Last pending issue is, how do I get hold of the label and replace the href tags in original text with URL and label.



Edit3:



By multiple URL cases, I mean there are multiple url present in given text.



String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";

Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String url = null;
while (m.find()) {
url = m.group(1); // this variable should contain the link URL
url = appendBaseURI(url);
url = appendQueryParams(url, "license=ABCXYZ");
System.out.println(url);
}






java regex string url text






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 3:12









Kartik

2,88831333




2,88831333










asked Nov 22 '18 at 2:39









NikNik

5,64937102171




5,64937102171













  • Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );

    – Pedro Lobito
    Nov 22 '18 at 3:23





















  • Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );

    – Pedro Lobito
    Nov 22 '18 at 3:23



















Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );

– Pedro Lobito
Nov 22 '18 at 3:23







Start by converting the html entities with: import org.apache.commons.lang.StringEscapeUtils; String entities_decode = StringEscapeUtils.unescapeHtml(text );

– Pedro Lobito
Nov 22 '18 at 3:23














4 Answers
4






active

oldest

votes


















1














public static void main(String args) {
String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
Pattern p = Pattern.compile("<a href="(.*?)">(.*?)</a>", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
while (m.find()) {
text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
}
System.out.println(text);
}

private static String cleanUrlPart(String url, String label) {
if (!url.startsWith("http") && !url.startsWith("www")) {
if (url.startsWith("/")) {
url = "http://www.google.com" + url;
} else {
url = "http://www.google.com/" + url;
}
}
url = appendQueryParams(url, "myParam=pqr").toString();
if (label != null && !label.isEmpty()) url += " (" + label + ")";
return url;
}


Output



Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text





share|improve this answer
























  • oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

    – Nik
    Nov 22 '18 at 6:10



















1














You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:



import org.apache.commons.text.StringEscapeUtils;

String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";
String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+"(.*?)">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
System.out.print(output);
// Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it




Demos:




  1. jdoodle


  2. Regex Explanation






share|improve this answer


























  • This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

    – Nik
    Nov 22 '18 at 5:00











  • make baseurl also dinamic.

    – Pedro Lobito
    Nov 22 '18 at 15:07



















0















// this is not working




Because your regex is case-sensitive.



Try:-



Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);


Edit1:

To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).



Edit2:

To replace the tag (including label) with your final string, use:-



text.replaceAll("(?i)<a href="(.*?)</a>", "new substring here")





share|improve this answer


























  • Thanks. Just found out this. Have edited the question for the same.

    – Nik
    Nov 22 '18 at 2:47






  • 1





    So this doesn't answer your question? If not, what's the next issue?

    – Kartik
    Nov 22 '18 at 2:48











  • 3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

    – Nik
    Nov 22 '18 at 2:50











  • 1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

    – Kartik
    Nov 22 '18 at 3:00











  • edited. I did not understand the replace part. What do you mean "like you replaced before" ?

    – Nik
    Nov 22 '18 at 3:05



















0














Almost there:



public static void main(String args) throws URISyntaxException {
String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
text = StringEscapeUtils.unescapeHtml4(text);
System.out.println(text);
System.out.println("**************************************");
Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
Pattern patternLink = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
Matcher matcherTag = patternTag.matcher(text);

while (matcherTag.find()) {
String href = matcherTag.group(1); // href
String linkText = matcherTag.group(2); // link text
System.out.println("Href: " + href);
System.out.println("Label: " + linkText);
Matcher matcherLink = patternLink.matcher(href);
String finalText = null;
while (matcherLink.find()) {
String link = matcherLink.group(1);
System.out.println("Link: " + link);
finalText = getFinalText(link, linkText);
break;
}
System.out.println("***************************************");
// replacing logic goes here
}
System.out.println(text);
}

public static String getFinalText(String link, String label) throws URISyntaxException {
link = appendBaseURI(link);
link = appendQueryParams(link, "myParam=ABCXYZ");
return link + " (" + label + ")";
}

public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
URI oldUri = new URI(uriToUpdate);
String newQueryParams = oldUri.getQuery();
if (newQueryParams == null) {
newQueryParams = queryParamsToAppend;
} else {
newQueryParams += "&" + queryParamsToAppend;
}
URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
oldUri.getPath(), newQueryParams, oldUri.getFragment());
return newUri.toString();
}

public static String appendBaseURI(String url) {
String baseURI = "http://www.google.com/";
if (url.startsWith("/")) {
url = url.substring(1, url.length());
}
if (url.startsWith(baseURI)) {
return url;
} else {
return baseURI + url;
}
}





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423132%2fjava-regex-to-retrieve-link-from-text%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    public static void main(String args) {
    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href="(.*?)">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
    text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
    }

    private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
    if (url.startsWith("/")) {
    url = "http://www.google.com" + url;
    } else {
    url = "http://www.google.com/" + url;
    }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
    }


    Output



    Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text





    share|improve this answer
























    • oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

      – Nik
      Nov 22 '18 at 6:10
















    1














    public static void main(String args) {
    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href="(.*?)">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
    text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
    }

    private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
    if (url.startsWith("/")) {
    url = "http://www.google.com" + url;
    } else {
    url = "http://www.google.com/" + url;
    }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
    }


    Output



    Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text





    share|improve this answer
























    • oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

      – Nik
      Nov 22 '18 at 6:10














    1












    1








    1







    public static void main(String args) {
    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href="(.*?)">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
    text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
    }

    private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
    if (url.startsWith("/")) {
    url = "http://www.google.com" + url;
    } else {
    url = "http://www.google.com/" + url;
    }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
    }


    Output



    Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text





    share|improve this answer













    public static void main(String args) {
    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    Pattern p = Pattern.compile("<a href="(.*?)">(.*?)</a>", Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(text);
    while (m.find()) {
    text = text.replace(m.group(0), cleanUrlPart(m.group(1), m.group(2)));
    }
    System.out.println(text);
    }

    private static String cleanUrlPart(String url, String label) {
    if (!url.startsWith("http") && !url.startsWith("www")) {
    if (url.startsWith("/")) {
    url = "http://www.google.com" + url;
    } else {
    url = "http://www.google.com/" + url;
    }
    }
    url = appendQueryParams(url, "myParam=pqr").toString();
    if (label != null && !label.isEmpty()) url += " (" + label + ")";
    return url;
    }


    Output



    Some content which contains link as http://www.google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&myParam=pqr (URL Label) and some text after it and another link http://www.google.com/relative-path/vegetables.cgi?param1=abc&param2=xyz&myParam=pqr (URL2 Label) and some more text






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 22 '18 at 5:48









    KartikKartik

    2,88831333




    2,88831333













    • oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

      – Nik
      Nov 22 '18 at 6:10



















    • oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

      – Nik
      Nov 22 '18 at 6:10

















    oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

    – Nik
    Nov 22 '18 at 6:10





    oh.. didn't see this and posted my answer.. l am just struggling with replacing part.. will try to do with my answer first... else will try yours.. thanks!

    – Nik
    Nov 22 '18 at 6:10













    1














    You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:



    import org.apache.commons.text.StringEscapeUtils;

    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";
    String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+"(.*?)">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
    System.out.print(output);
    // Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it




    Demos:




    1. jdoodle


    2. Regex Explanation






    share|improve this answer


























    • This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

      – Nik
      Nov 22 '18 at 5:00











    • make baseurl also dinamic.

      – Pedro Lobito
      Nov 22 '18 at 15:07
















    1














    You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:



    import org.apache.commons.text.StringEscapeUtils;

    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";
    String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+"(.*?)">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
    System.out.print(output);
    // Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it




    Demos:




    1. jdoodle


    2. Regex Explanation






    share|improve this answer


























    • This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

      – Nik
      Nov 22 '18 at 5:00











    • make baseurl also dinamic.

      – Pedro Lobito
      Nov 22 '18 at 15:07














    1












    1








    1







    You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:



    import org.apache.commons.text.StringEscapeUtils;

    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";
    String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+"(.*?)">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
    System.out.print(output);
    // Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it




    Demos:




    1. jdoodle


    2. Regex Explanation






    share|improve this answer















    You can use apache commons text StringEscapeUtils to decode the html entities and then replaceAll, i.e.:



    import org.apache.commons.text.StringEscapeUtils;

    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it";
    String output = StringEscapeUtils.unescapeHtml4(text).replaceAll("([^<]+).+"(.*?)">(.*?)<[^>]+>(.*)", "$1https://google.com$2&your_param ($3)$4");
    System.out.print(output);
    // Some content which contains link as https://google.com/relative-path/fruit.cgi?param1=abc&param2=xyz&your_param (URL Label) and some text after it




    Demos:




    1. jdoodle


    2. Regex Explanation







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 22 '18 at 13:04

























    answered Nov 22 '18 at 4:04









    Pedro LobitoPedro Lobito

    48.2k14133164




    48.2k14133164













    • This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

      – Nik
      Nov 22 '18 at 5:00











    • make baseurl also dinamic.

      – Pedro Lobito
      Nov 22 '18 at 15:07



















    • This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

      – Nik
      Nov 22 '18 at 5:00











    • make baseurl also dinamic.

      – Pedro Lobito
      Nov 22 '18 at 15:07

















    This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

    – Nik
    Nov 22 '18 at 5:00





    This is really sleek and will fit my required solution perfectly if it can handle the multiple URL scenarios. Also, I guess your solution assumes that the URL will always have to be prefixed with google.com, which is not the case as mentioned in point (2) of my question. I will add the base URI only if its missing. Thanks for the answer though! will try to expand on it.

    – Nik
    Nov 22 '18 at 5:00













    make baseurl also dinamic.

    – Pedro Lobito
    Nov 22 '18 at 15:07





    make baseurl also dinamic.

    – Pedro Lobito
    Nov 22 '18 at 15:07











    0















    // this is not working




    Because your regex is case-sensitive.



    Try:-



    Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);


    Edit1:

    To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).



    Edit2:

    To replace the tag (including label) with your final string, use:-



    text.replaceAll("(?i)<a href="(.*?)</a>", "new substring here")





    share|improve this answer


























    • Thanks. Just found out this. Have edited the question for the same.

      – Nik
      Nov 22 '18 at 2:47






    • 1





      So this doesn't answer your question? If not, what's the next issue?

      – Kartik
      Nov 22 '18 at 2:48











    • 3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

      – Nik
      Nov 22 '18 at 2:50











    • 1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

      – Kartik
      Nov 22 '18 at 3:00











    • edited. I did not understand the replace part. What do you mean "like you replaced before" ?

      – Nik
      Nov 22 '18 at 3:05
















    0















    // this is not working




    Because your regex is case-sensitive.



    Try:-



    Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);


    Edit1:

    To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).



    Edit2:

    To replace the tag (including label) with your final string, use:-



    text.replaceAll("(?i)<a href="(.*?)</a>", "new substring here")





    share|improve this answer


























    • Thanks. Just found out this. Have edited the question for the same.

      – Nik
      Nov 22 '18 at 2:47






    • 1





      So this doesn't answer your question? If not, what's the next issue?

      – Kartik
      Nov 22 '18 at 2:48











    • 3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

      – Nik
      Nov 22 '18 at 2:50











    • 1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

      – Kartik
      Nov 22 '18 at 3:00











    • edited. I did not understand the replace part. What do you mean "like you replaced before" ?

      – Nik
      Nov 22 '18 at 3:05














    0












    0








    0








    // this is not working




    Because your regex is case-sensitive.



    Try:-



    Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);


    Edit1:

    To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).



    Edit2:

    To replace the tag (including label) with your final string, use:-



    text.replaceAll("(?i)<a href="(.*?)</a>", "new substring here")





    share|improve this answer
















    // this is not working




    Because your regex is case-sensitive.



    Try:-



    Pattern p = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);


    Edit1:

    To get the label, use Pattern.compile("(?<=>).*?(?=</a>)", Pattern.CASE_INSENSITIVE) and m.group(0).



    Edit2:

    To replace the tag (including label) with your final string, use:-



    text.replaceAll("(?i)<a href="(.*?)</a>", "new substring here")






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 22 '18 at 3:06

























    answered Nov 22 '18 at 2:45









    KartikKartik

    2,88831333




    2,88831333













    • Thanks. Just found out this. Have edited the question for the same.

      – Nik
      Nov 22 '18 at 2:47






    • 1





      So this doesn't answer your question? If not, what's the next issue?

      – Kartik
      Nov 22 '18 at 2:48











    • 3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

      – Nik
      Nov 22 '18 at 2:50











    • 1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

      – Kartik
      Nov 22 '18 at 3:00











    • edited. I did not understand the replace part. What do you mean "like you replaced before" ?

      – Nik
      Nov 22 '18 at 3:05



















    • Thanks. Just found out this. Have edited the question for the same.

      – Nik
      Nov 22 '18 at 2:47






    • 1





      So this doesn't answer your question? If not, what's the next issue?

      – Kartik
      Nov 22 '18 at 2:48











    • 3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

      – Nik
      Nov 22 '18 at 2:50











    • 1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

      – Kartik
      Nov 22 '18 at 3:00











    • edited. I did not understand the replace part. What do you mean "like you replaced before" ?

      – Nik
      Nov 22 '18 at 3:05

















    Thanks. Just found out this. Have edited the question for the same.

    – Nik
    Nov 22 '18 at 2:47





    Thanks. Just found out this. Have edited the question for the same.

    – Nik
    Nov 22 '18 at 2:47




    1




    1





    So this doesn't answer your question? If not, what's the next issue?

    – Kartik
    Nov 22 '18 at 2:48





    So this doesn't answer your question? If not, what's the next issue?

    – Kartik
    Nov 22 '18 at 2:48













    3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

    – Nik
    Nov 22 '18 at 2:50





    3 issues actually: 1) how do I handle multiple URL cases, 2) How do I get hold of label, 3) Once I have urls with base URL prefixed and parameter attached, how do I replace them in the original text.

    – Nik
    Nov 22 '18 at 2:50













    1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

    – Kartik
    Nov 22 '18 at 3:00





    1) what do you mean by multiple URL cases? can you update your question with an example? 2) Updated the answer for label 3) just like you replaced before, do the reverse, and oh, use replace instead of replaceAll

    – Kartik
    Nov 22 '18 at 3:00













    edited. I did not understand the replace part. What do you mean "like you replaced before" ?

    – Nik
    Nov 22 '18 at 3:05





    edited. I did not understand the replace part. What do you mean "like you replaced before" ?

    – Nik
    Nov 22 '18 at 3:05











    0














    Almost there:



    public static void main(String args) throws URISyntaxException {
    String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
    text = StringEscapeUtils.unescapeHtml4(text);
    System.out.println(text);
    System.out.println("**************************************");
    Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
    Pattern patternLink = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
    Matcher matcherTag = patternTag.matcher(text);

    while (matcherTag.find()) {
    String href = matcherTag.group(1); // href
    String linkText = matcherTag.group(2); // link text
    System.out.println("Href: " + href);
    System.out.println("Label: " + linkText);
    Matcher matcherLink = patternLink.matcher(href);
    String finalText = null;
    while (matcherLink.find()) {
    String link = matcherLink.group(1);
    System.out.println("Link: " + link);
    finalText = getFinalText(link, linkText);
    break;
    }
    System.out.println("***************************************");
    // replacing logic goes here
    }
    System.out.println(text);
    }

    public static String getFinalText(String link, String label) throws URISyntaxException {
    link = appendBaseURI(link);
    link = appendQueryParams(link, "myParam=ABCXYZ");
    return link + " (" + label + ")";
    }

    public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
    URI oldUri = new URI(uriToUpdate);
    String newQueryParams = oldUri.getQuery();
    if (newQueryParams == null) {
    newQueryParams = queryParamsToAppend;
    } else {
    newQueryParams += "&" + queryParamsToAppend;
    }
    URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
    oldUri.getPath(), newQueryParams, oldUri.getFragment());
    return newUri.toString();
    }

    public static String appendBaseURI(String url) {
    String baseURI = "http://www.google.com/";
    if (url.startsWith("/")) {
    url = url.substring(1, url.length());
    }
    if (url.startsWith(baseURI)) {
    return url;
    } else {
    return baseURI + url;
    }
    }





    share|improve this answer




























      0














      Almost there:



      public static void main(String args) throws URISyntaxException {
      String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
      text = StringEscapeUtils.unescapeHtml4(text);
      System.out.println(text);
      System.out.println("**************************************");
      Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
      Pattern patternLink = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
      Matcher matcherTag = patternTag.matcher(text);

      while (matcherTag.find()) {
      String href = matcherTag.group(1); // href
      String linkText = matcherTag.group(2); // link text
      System.out.println("Href: " + href);
      System.out.println("Label: " + linkText);
      Matcher matcherLink = patternLink.matcher(href);
      String finalText = null;
      while (matcherLink.find()) {
      String link = matcherLink.group(1);
      System.out.println("Link: " + link);
      finalText = getFinalText(link, linkText);
      break;
      }
      System.out.println("***************************************");
      // replacing logic goes here
      }
      System.out.println(text);
      }

      public static String getFinalText(String link, String label) throws URISyntaxException {
      link = appendBaseURI(link);
      link = appendQueryParams(link, "myParam=ABCXYZ");
      return link + " (" + label + ")";
      }

      public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
      URI oldUri = new URI(uriToUpdate);
      String newQueryParams = oldUri.getQuery();
      if (newQueryParams == null) {
      newQueryParams = queryParamsToAppend;
      } else {
      newQueryParams += "&" + queryParamsToAppend;
      }
      URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
      oldUri.getPath(), newQueryParams, oldUri.getFragment());
      return newUri.toString();
      }

      public static String appendBaseURI(String url) {
      String baseURI = "http://www.google.com/";
      if (url.startsWith("/")) {
      url = url.substring(1, url.length());
      }
      if (url.startsWith(baseURI)) {
      return url;
      } else {
      return baseURI + url;
      }
      }





      share|improve this answer


























        0












        0








        0







        Almost there:



        public static void main(String args) throws URISyntaxException {
        String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
        text = StringEscapeUtils.unescapeHtml4(text);
        System.out.println(text);
        System.out.println("**************************************");
        Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
        Pattern patternLink = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
        Matcher matcherTag = patternTag.matcher(text);

        while (matcherTag.find()) {
        String href = matcherTag.group(1); // href
        String linkText = matcherTag.group(2); // link text
        System.out.println("Href: " + href);
        System.out.println("Label: " + linkText);
        Matcher matcherLink = patternLink.matcher(href);
        String finalText = null;
        while (matcherLink.find()) {
        String link = matcherLink.group(1);
        System.out.println("Link: " + link);
        finalText = getFinalText(link, linkText);
        break;
        }
        System.out.println("***************************************");
        // replacing logic goes here
        }
        System.out.println(text);
        }

        public static String getFinalText(String link, String label) throws URISyntaxException {
        link = appendBaseURI(link);
        link = appendQueryParams(link, "myParam=ABCXYZ");
        return link + " (" + label + ")";
        }

        public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
        URI oldUri = new URI(uriToUpdate);
        String newQueryParams = oldUri.getQuery();
        if (newQueryParams == null) {
        newQueryParams = queryParamsToAppend;
        } else {
        newQueryParams += "&" + queryParamsToAppend;
        }
        URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
        oldUri.getPath(), newQueryParams, oldUri.getFragment());
        return newUri.toString();
        }

        public static String appendBaseURI(String url) {
        String baseURI = "http://www.google.com/";
        if (url.startsWith("/")) {
        url = url.substring(1, url.length());
        }
        if (url.startsWith(baseURI)) {
        return url;
        } else {
        return baseURI + url;
        }
        }





        share|improve this answer













        Almost there:



        public static void main(String args) throws URISyntaxException {
        String text = "Some content which contains link as &lt;A HREF="/relative-path/fruit.cgi?param1=abc&amp;param2=xyz"&gt;URL Label&lt;/A&gt; and some text after it and another link &lt;A HREF="/relative-path/vegetables.cgi?param1=abc&amp;param2=xyz"&gt;URL2 Label&lt;/A&gt; and some more text";
        text = StringEscapeUtils.unescapeHtml4(text);
        System.out.println(text);
        System.out.println("**************************************");
        Pattern patternTag = Pattern.compile("<a([^>]+)>(.+?)</a>", Pattern.CASE_INSENSITIVE);
        Pattern patternLink = Pattern.compile("href="(.*?)"", Pattern.CASE_INSENSITIVE);
        Matcher matcherTag = patternTag.matcher(text);

        while (matcherTag.find()) {
        String href = matcherTag.group(1); // href
        String linkText = matcherTag.group(2); // link text
        System.out.println("Href: " + href);
        System.out.println("Label: " + linkText);
        Matcher matcherLink = patternLink.matcher(href);
        String finalText = null;
        while (matcherLink.find()) {
        String link = matcherLink.group(1);
        System.out.println("Link: " + link);
        finalText = getFinalText(link, linkText);
        break;
        }
        System.out.println("***************************************");
        // replacing logic goes here
        }
        System.out.println(text);
        }

        public static String getFinalText(String link, String label) throws URISyntaxException {
        link = appendBaseURI(link);
        link = appendQueryParams(link, "myParam=ABCXYZ");
        return link + " (" + label + ")";
        }

        public static String appendQueryParams(String uriToUpdate, String queryParamsToAppend) throws URISyntaxException {
        URI oldUri = new URI(uriToUpdate);
        String newQueryParams = oldUri.getQuery();
        if (newQueryParams == null) {
        newQueryParams = queryParamsToAppend;
        } else {
        newQueryParams += "&" + queryParamsToAppend;
        }
        URI newUri = new URI(oldUri.getScheme(), oldUri.getAuthority(),
        oldUri.getPath(), newQueryParams, oldUri.getFragment());
        return newUri.toString();
        }

        public static String appendBaseURI(String url) {
        String baseURI = "http://www.google.com/";
        if (url.startsWith("/")) {
        url = url.substring(1, url.length());
        }
        if (url.startsWith(baseURI)) {
        return url;
        } else {
        return baseURI + url;
        }
        }






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 22 '18 at 6:09









        NikNik

        5,64937102171




        5,64937102171






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423132%2fjava-regex-to-retrieve-link-from-text%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            404 Error Contact Form 7 ajax form submitting

            How to know if a Active Directory user can login interactively

            Refactoring coordinates for Minecraft Pi buildings written in Python