How to extract content between two delimiters in PHP

Posted on August 29, 2008, under PHP 

Hi,

Here’s a function which is useful when you need to extract some content between two delimiters. For instance you need to extract content using a robot that connects to a page.

<?php
/*
Credits: Bit Repository
URL: http://www.bitrepository.com/web-programming/php/extracting-content-between-two-delimiters.html
*/

function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);

$str = substr($string, $pos);

$str_two = substr($str, strlen($start));

$second_pos = stripos($str_two, $end);

$str_three = substr($str_two, 0, $second_pos);

$unit = trim($str_three); // remove whitespaces

return $unit;
}

This is an usage example of this function:

$text = 'PHP is an acronym for "PHP: Hypertext Preprocessor".';

$unit = extract_unit($text, 'an', 'for');

// Outputs: acronym
echo $unit;
?>

How it works?

First, we use stripos() to determine the numeric position of the first occurrence of needle in the haystack string. In our example, there are 7 characters from the beginning of the string until ‘an’.

$pos = stripos($string, $start);

Now, we will use this information to get the content of $string, from the $pos character until the last one:

an acronym for “PHP: Hypertext Preprocessor”.

$str = substr($string, $pos);

Remove ‘an’ from the recently created string:

acronym for “PHP: Hypertext Preprocessor”.

$str_two = substr($str, strlen($start));

Determine the number of characters from the beginning of $str_two until ‘for’ (9 in this case):

$second_pos = stripos($str_two, $end);

Now use this number to get the content from the beginning of the string until ‘for’:

$str_three = substr($str_two, 0, $second_pos);

The last variable would be equal with ‘ acronym ‘. Eventually, let’s strip the whitespaces from the beginning and ending of the string:

acronym

$unit = trim($str_three); // remove whitespaces

If you have any comments, suggestions regarding this snippet please post them.

Comment via Facebook

comments

18 Replies to "How to extract content between two delimiters in PHP"

  1. Hi, nice script!
    But I have additional question… what if I want all instances of $unit in an array, for example in case my string contains more than one location of $start and $end?
    example:
    $string = “This is a string“;
    $start = ““;
    $end = “
    “;

    desired output:
    array (“This”, “string”);

  2. my comment stripped the Bold-tags…
    i meant:
    $string = “[bold]This is a string[/bold]“;
    $start = “[bold]“;
    $end = “[/bold]“;

  3. Here’s the solution:

    <?php
    $string = "<b>This</b> is a <b>string</b>";
    $start = "<b>";
    $end = "<\/b>";
    
    // Regexp for the extractor
    $regexp = '/'.$start.'(.*)'.$end.'/Ui';
    
    preg_match_all($regexp, $string, $out, PREG_PATTERN_ORDER);
    
    $desired = $out[1];
    
    echo "<pre>"; print_r($desired); echo "</pre>";
    
    /*
    Output:
    
    Array
    (
        [0] => This
        [1] => string
    )
    */
    ?>

    Notice the backslash before the slash in the ending bold: <\/b>.

    Helpful address: http://www.php.net/manual/ro/function.preg-quote.php

  4. Wow, this is a great solution. I hoped that it would work without the use of an regular expression, but it works!
    Now I want to create a sort of database in an associative array. So for example data between tag A and B needs to get in the same record as data extracted from tag C and D in a what has to result in something like this:
    Array (“data from A and B” => “from C and D”,
    “data from A and B” => “from C and D”,
    and so on…
    )

    I now have the following:

    <?php
    function test($string, $start1, $end1, $start2, $end2)

    {
    // Regexp for the extractor
    $regexp1 = ‘/’.$start1.’(.*)’.$end1.’/Ui’;
    $regexp2 = ‘/’.$start2.’(.*)’.$end2.’/Ui’;

    preg_match_all($regexp1, $string, $out1, PREG_PATTERN_ORDER);
    preg_match_all($regexp2, $string, $out2, PREG_PATTERN_ORDER);

    $combined = array_combine($out1[1], $out2[1]);

    echo “”; print_r($combined); echo “”;

    //checking one row
    print_r (explode(“,”,$combined[3]));
    }
    ?>

    But it doesn’t work unfortunately…Thanks for your help in advance! It’s appreciated big time!

  5. Array (”data from A and B” => “from C and D”,
    “data from A and B” => “from C and D”,

    Are you sure that your example (with dupes) is a good one?

  6. yeah for example:

    $string = “<b>Street 23</b> – Paris
    <b>Street 43</b> – Berlin
    <b>Street 453</b> – London”;

    $string_1 = “<b>”;
    $string_1 = “”;
    $string_3 = “”;
    $string_4 = “”;

    required result:
    Array (”Street 23” => “Paris”,
    Street 43” => “Berlin”,
    “Street 453” => “London”)

    Thanks

  7. This is a way to get the desired result:

    <?php
    $string = '<b>Street 23</b> - Paris
    <b>Street 43</b> - Berlin
    <b>Street 453</b> - London';
    
    $array = explode("<b>", $string);
    
    $desired_array = array();
    
    foreach($array as $value)
    {
    $value = trim(strip_tags($value));
    
    if($value)
    	{
    	list($street, $city) = explode("-", $value); // hyphen is our delimiter
    	$desired_array[trim($street)] = trim($city); // remove whitespaces
    	}
    }
    
    echo "<pre>"; print_r($desired_array); echo "</pre>";
    
    /*
    Array
    (
        [Street 23] => Paris
        [Street 43] => Berlin
        [Street 453] => London
    )
    */
    ?>

    PS: Consider using &lt; for < and &gt; for > when you write a new comment.

  8. Thanks a lot! It brought me on new ideas!

  9. Very helpful, thanks!

  10. My question, i use the following:

    $q = 0;
    function extract_unit($string, $start, $end)
    {
    $pos = stripos($string, $start);
    $str = substr($string, $pos);
    $str_two = substr($str, strlen($start));
    $second_pos = stripos($str_two, $end);
    $str_three = substr($str_two, $q, $second_pos);
    $unit = trim($str_three); // remove whitespaces
    return $unit;
    }

    $h2 = extract_unit($pagina, ”, ”);

    to extract the h2 header from a webpage. $pagina is the string which holds the sourcecode of a webpage. The only problem right now: It only extract the first h2 header from the page. I tried with a for loop, but then i get the same h2 header, multple times.

    I have a page with 5 h2 headers, and i want to extract all of them, and put them in an array for example.

    Anybody knows how?

    Maximus

    1. Hi Maximus,

      i have the same problem. Did you solve it?

      Best,

      Nils

      1. Nope, but i dont work on this project anymore. If you find the solution, let me know.

  11. it should be
    $h2 = extrace_unit($pagina, “&lth2&gt”,”&lt/h2&gt”);

    forgot the &lt and &gt

  12. [...] a practical cURL function & another function that extracts content between 2 delimiters (click here to view details about [...]

  13. [...] while ago I have written a short tutorial of how you can write a short PHP function to extract content from specific delimiters. I has come to my attention that many people are looking for a way to replace and even modify [...]

  14. [...] 5.2.0) which doesn’t have the json_decode() function included, you can use the alternative extract_unit function to get the data between total_posts”: and “ (will return the actual number). [...]

  15. [...] is specified.The extract_unit function gets all the text between 2 specified strings.Thanks to Bit Repositary for this function.Since we need to extract text from javascript so we need to deocde the output [...]

  16. A more efficient version of the algorithm (no string copies required), which copes with $start or $end not existing in $string

Leave a Reply


* = required fields

  (will not be published)


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Note: If you want to post CODE Snippets, please make them postable first!
(e.g. <br /> should be converted to &lt;br /&gt;)

POSTING RULES:

  • The comment must be relevant with the topic of the post.
  • Only comments with real email addresses will get approved. So, emails like 'abc@domain.com' will not be accepted.
  • Do not post the same message in multiple articles through the site.
  • Do not post advertisements, junk mail or pyramid schemes.
  • In case you post a link to another site, please explain briefly where the link goes as a courtesy to other users.
  • Do not post comments such as: "Thank you", "Awesome", "Nice tutorial", "Merci", etc.